8 minute read


IEEE-CIS Fraud Detection의 불균형 데이터

- STEP 1 : IEEE-CIS Fraud Detection 데이터 불러오기 및 전처리
- STEP 2 : Light GBM 기본 모델 테스트
- STEP 3-4 : Oversampling 수행
  • 작성자: 김태영 감수자


STEP 1~4. IEEE-CIS Fraud Detection의 불균형 데이터 문제

배경

실제 캐글에서 개최된 부정거래 예측 대회인 IEEE-CIS Fraud Detection에서 사용된 데이터를 가지고 불균형 데이터 분석을 수행해보고, 간단한 예측 모델을 만들어보겠습니다. 예측 모델의 정확도를 높이기 위해서 본 파트에서 배운 여러가지 불균형 데이터 처리 기법을 적용해보고 그 결과를 살펴보겠습니다.

목표

  • IEEE-CIS Fraud Detection 데이터 불러오기 및 전처리
  • Light GBM 기본 모델 테스트
  • SMOTE을 이용한 Oversampling 수행
  • BLSM을 이용한 Oversampling 수행

STEP 1. IEEE-CIS Fraud Detection 데이터 불러오기 및 전처리

IEEE-CIS Fraud Detection 데이터를 불러와서 전처리를 수행합니다.

  • https://www.kaggle.com/c/ieee-fraud-detection의 data 탭에서 train_transaction.csv와 test_transaction.csv를 다운로드 함
  • 카테고리형 및 결측치 처리
  • 본 파트에서 데이터 전처리 과정은 평가에 포함되지 않으므로 소스코드 제공
import numpy as np
import pandas as pd

train = pd.read_csv("train_transaction.csv")
test = pd.read_csv("test_transaction.csv")

train["isFraud"].mean() #  0.03499000914417313
0.03499000914417313
train.head()
TransactionID isFraud TransactionDT TransactionAmt ProductCD card1 card2 card3 card4 card5 ... V330 V331 V332 V333 V334 V335 V336 V337 V338 V339
0 2987000 0 86400 68.5 W 13926 NaN 150.0 discover 142.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 2987001 0 86401 29.0 W 2755 404.0 150.0 mastercard 102.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 2987002 0 86469 59.0 W 4663 490.0 150.0 visa 166.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 2987003 0 86499 50.0 W 18132 567.0 150.0 mastercard 117.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 2987004 0 86506 50.0 H 4497 514.0 150.0 mastercard 102.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 394 columns

# generate time of day
train["Time of Day"] = np.floor(train["TransactionDT"]/3600/183)
test["Time of Day"] = np.floor(test["TransactionDT"]/3600/183)

# drop columns
train.drop("TransactionDT",axis=1,inplace=True)
test.drop("TransactionDT",axis=1,inplace=True)

# define continuous and categorical variables
cont_vars = ["TransactionAmt"]
cat_vars = ["ProductCD","addr1","addr2","P_emaildomain","R_emaildomain","Time of Day"] + [col for col in train.columns if "card" in col]

# set training and testing set
x_train = train[cont_vars + cat_vars].copy()
y_train = train["isFraud"].copy()
x_test = train[cont_vars + cat_vars].copy()
y_test = train["isFraud"].copy()

# process cont_vars
# scale values
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
x_train["TransactionAmt"] = scaler.fit_transform(x_train["TransactionAmt"].values.reshape(-1,1))
x_test["TransactionAmt"] = scaler.transform(x_test["TransactionAmt"].values.reshape(-1,1))

# reduce cardinality of categorical variables
idx_list = x_train["card1"].value_counts()[x_train["card1"].value_counts()<=100].index.tolist()
x_train.loc[x_train["card1"].isin(idx_list),"card1"] = "Others"
x_test.loc[x_test["card1"].isin(idx_list),"card1"] = "Others"

# convert to numerical value for modelling
def categorify(df, cat_vars):
    categories = {}
    for cat in cat_vars:
        df[cat] = df[cat].astype("category").cat.as_ordered()
        categories[cat] = df[cat].cat.categories
    return categories

def apply_test(test,categories):
    for cat, index in categories.items():
        test[cat] = pd.Categorical(test[cat],categories=categories[cat],ordered=True)

# convert to integers
categories = categorify(x_train,cat_vars)
apply_test(x_test,categories)

for cat in cat_vars:
    x_train[cat] = x_train[cat].cat.codes+1
    x_test[cat] = x_test[cat].cat.codes+1



# fill missing
x_train[cat_vars] = x_train[cat_vars].fillna("Missing")
x_test[cat_vars] = x_test[cat_vars].fillna("Missing")
for cat, index in categories.items():
    test[cat] = pd.Categorical(test[cat],categories=categories[cat],ordered=True)

평가 코드는 아래와 같습니다.

def model_evaluation(label, predict):
    cf_matrix = confusion_matrix(label, predict)
    Accuracy = (cf_matrix[0][0] + cf_matrix[1][1]) / sum(sum(cf_matrix))
    Precision = cf_matrix[1][1] / (cf_matrix[1][1] + cf_matrix[0][1])
    Recall = cf_matrix[1][1] / (cf_matrix[1][1] + cf_matrix[1][0])
    F1_Score = (2 * Recall * Precision) / (Recall + Precision)
    print("Model_Evaluation with Label:1")
    print("Accuracy: ", Accuracy)
    print("Precision: ", Precision)
    print("Recall: ", Recall)
    print("F1-Score: ", F1_Score)

STEP 2. Light GBM 기본 모델 테스트

Light GBM 기본 모델로 불균형 데이터 처리를 수행합니다.

  • 학습 데이터를 LightGBM 모델에 맞게 변환함
  • LightGBM 모델 학습함
  • LightGBM 모델 평가함
# LightGBM 수행
from sklearn.metrics import confusion_matrix
import lightgbm as lgb
lgb_dtrain = lgb.Dataset(data = pd.DataFrame(x_train), label = pd.DataFrame(y_train)) # 학습 데이터를 LightGBM 모델에 맞게 변환
lgb_param = {'max_depth': 10, # 트리 깊이
            'learning_rate': 0.01, # Step Size
            'n_estimators': 50, # Number of trees, 트리 생성 개수
            'objective': 'multiclass', # 목적 함수
            'num_class': len(set(pd.DataFrame(y_train))) + 1} # 파라미터 추가, Label must be in [0, num_class) -> num_class보다 1 커야한다.
lgb_model = lgb.train(params = lgb_param, train_set = lgb_dtrain) # 학습 진행
lgb_model_predict = np.argmax(lgb_model.predict(x_test), axis = 1) # 평가 데이터 예측, Softmax의 결과값 중 가장 큰 값의 Label로 예측
model_evaluation(y_test, lgb_model_predict) # 모델 분류 결과 평가
C:\Users\drogpard\Anaconda3\lib\site-packages\lightgbm\engine.py:148: UserWarning: Found `n_estimators` in params. Will use it instead of argument
  warnings.warn("Found `{}` in params. Will use it instead of argument".format(alias))


Model_Evaluation with Label:1
Accuracy:  0.9650099908558268
Precision:  nan
Recall:  0.0
F1-Score:  nan


C:\Users\drogpard\Anaconda3\lib\site-packages\ipykernel_launcher.py:4: RuntimeWarning: invalid value encountered in longlong_scalars
  after removing the cwd from sys.path.

STEP 3. SMOTE을 이용한 Oversampling 수행

SMOTE를 이용하여 Oversampling를 수행한 후 Light GBM 기본 모델로 평가해봅니다.

  • SMOTE을 이용하여 Oversampling 수행
  • 학습 데이터를 LightGBM 모델에 맞게 변환함
  • LightGBM 모델 학습함
  • LightGBM 모델 평가함
# 기존의 X_train, y_train, X_test, y_test의 형태 확인
print("x_train: ", x_train.shape)
print("y_train: ", y_train.shape)
print("x_test: ", x_test.shape)
print("y_test: ", y_test.shape)
x_train:  (590540, 13)
y_train:  (590540,)
x_test:  (590540, 13)
y_test:  (590540,)
# SMOTE 수행
from imblearn.over_sampling import SMOTE
print("Before OverSampling, counts of label '1': {}".format(sum(y_train == 1))) # y_train 중 레이블 값이 1인 데이터의 개수
print("Before OverSampling, counts of label '0': {} \n".format(sum(y_train == 0))) # y_train 중 레이블 값이 0 인 데이터의 개수

sm = SMOTE(random_state = 42, ratio = 0.3) # SMOTE 알고리즘, 비율 증가
x_train_res, y_train_res = sm.fit_sample(x_train, y_train.ravel()) # Over Sampling 진행

print("After OverSampling, counts of label '1': {}".format(sum(y_train_res==1)))
print("After OverSampling, counts of label '0': {}".format(sum(y_train_res==0)))
C:\Users\drogpard\Anaconda3\lib\site-packages\sklearn\utils\deprecation.py:143: FutureWarning: The sklearn.neighbors.base module is  deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.neighbors. Anything that cannot be imported from sklearn.neighbors is now part of the private API.
  warnings.warn(message, FutureWarning)
C:\Users\drogpard\Anaconda3\lib\site-packages\sklearn\utils\deprecation.py:143: FutureWarning: The sklearn.ensemble.bagging module is  deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.ensemble. Anything that cannot be imported from sklearn.ensemble is now part of the private API.
  warnings.warn(message, FutureWarning)
C:\Users\drogpard\Anaconda3\lib\site-packages\sklearn\utils\deprecation.py:143: FutureWarning: The sklearn.ensemble.base module is  deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.ensemble. Anything that cannot be imported from sklearn.ensemble is now part of the private API.
  warnings.warn(message, FutureWarning)
C:\Users\drogpard\Anaconda3\lib\site-packages\sklearn\utils\deprecation.py:143: FutureWarning: The sklearn.ensemble.forest module is  deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.ensemble. Anything that cannot be imported from sklearn.ensemble is now part of the private API.
  warnings.warn(message, FutureWarning)
C:\Users\drogpard\Anaconda3\lib\site-packages\sklearn\utils\deprecation.py:143: FutureWarning: The sklearn.utils.testing module is  deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.utils. Anything that cannot be imported from sklearn.utils is now part of the private API.
  warnings.warn(message, FutureWarning)
C:\Users\drogpard\Anaconda3\lib\site-packages\sklearn\utils\deprecation.py:143: FutureWarning: The sklearn.metrics.classification module is  deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.metrics. Anything that cannot be imported from sklearn.metrics is now part of the private API.
  warnings.warn(message, FutureWarning)


Before OverSampling, counts of label '1': 20663
Before OverSampling, counts of label '0': 569877 



C:\Users\drogpard\Anaconda3\lib\site-packages\sklearn\utils\deprecation.py:86: FutureWarning: Function safe_indexing is deprecated; safe_indexing is deprecated in version 0.22 and will be removed in version 0.24.
  warnings.warn(msg, category=FutureWarning)


After OverSampling, counts of label '1': 170963
After OverSampling, counts of label '0': 569877
# SMOTE 전후 데이터 형태 확인
print("Before OverSampling, the shape of X_train: {}".format(x_train.shape)) # SMOTE 적용 이전 데이터 형태
print("Before OverSampling, the shape of y_train: {}".format(y_train.shape)) # SMOTE 적용 이전 데이터 형태
print('After OverSampling, the shape of X_train: {}'.format(x_train_res.shape)) # SMOTE 적용 결과 확인
print('After OverSampling, the shape of y_train: {}'.format(y_train_res.shape)) # # SMOTE 적용 결과 확인
Before OverSampling, the shape of X_train: (590540, 13)
Before OverSampling, the shape of y_train: (590540,)
After OverSampling, the shape of X_train: (740840, 13)
After OverSampling, the shape of y_train: (740840,)
# LightGBM 수행
lgb_dtrain2 = lgb.Dataset(data = pd.DataFrame(x_train_res), label = pd.DataFrame(y_train_res)) # 학습 데이터를 LightGBM 모델에 맞게 변환
lgb_param2 = {'max_depth': 10, # 트리 깊이
            'learning_rate': 0.01, # Step Size
            'n_estimators': 50, # Number of trees, 트리 생성 개수
            'objective': 'multiclass', # 목적 함수
            'num_class': len(set(pd.DataFrame(y_train_res))) + 1} # 파라미터 추가, Label must be in [0, num_class) -> num_class보다 1 커야한다.
lgb_model2 = lgb.train(params = lgb_param2, train_set = lgb_dtrain2) # 학습 진행
lgb_model2_predict = np.argmax(lgb_model2.predict(x_test), axis = 1) # 평가 데이터 예측, Softmax의 결과값 중 가장 큰 값의 Label로 예측
model_evaluation(y_test, lgb_model2_predict) # 모델 분류 평가 결과
C:\Users\drogpard\Anaconda3\lib\site-packages\lightgbm\engine.py:148: UserWarning: Found `n_estimators` in params. Will use it instead of argument
  warnings.warn("Found `{}` in params. Will use it instead of argument".format(alias))


Model_Evaluation with Label:1
Accuracy:  0.9650099908558268
Precision:  nan
Recall:  0.0
F1-Score:  nan


C:\Users\drogpard\Anaconda3\lib\site-packages\ipykernel_launcher.py:4: RuntimeWarning: invalid value encountered in longlong_scalars
  after removing the cwd from sys.path.

STEP 4. BLSM을 이용한 Oversampling 수행

BLSM을 이용하여 Oversampling를 수행한 후 Light GBM 기본 모델로 평가해봅니다.

  • BLSM을 이용하여 Oversampling 수행
  • 학습 데이터를 LightGBM 모델에 맞게 변환함
  • LightGBM 모델 학습함
  • LightGBM 모델 평가함
# BLSM 수행
from imblearn.over_sampling import BorderlineSMOTE
sm4 = BorderlineSMOTE(random_state = 42, sampling_strategy = 0.6) # BLSM 알고리즘 적용
x_train_res4, y_train_res4 = sm4.fit_sample(x_train, y_train.ravel()) # Over Sampling 적용
C:\Users\drogpard\Anaconda3\lib\site-packages\sklearn\utils\deprecation.py:86: FutureWarning: Function safe_indexing is deprecated; safe_indexing is deprecated in version 0.22 and will be removed in version 0.24.
  warnings.warn(msg, category=FutureWarning)
C:\Users\drogpard\Anaconda3\lib\site-packages\sklearn\utils\deprecation.py:86: FutureWarning: Function safe_indexing is deprecated; safe_indexing is deprecated in version 0.22 and will be removed in version 0.24.
  warnings.warn(msg, category=FutureWarning)
C:\Users\drogpard\Anaconda3\lib\site-packages\sklearn\utils\deprecation.py:86: FutureWarning: Function safe_indexing is deprecated; safe_indexing is deprecated in version 0.22 and will be removed in version 0.24.
  warnings.warn(msg, category=FutureWarning)
# LightGBM 수행
lgb_dtrain5 = lgb.Dataset(data = pd.DataFrame(x_train_res4), label = pd.DataFrame(y_train_res4)) # 학습 데이터를 LightGBM 모델에 맞게 변환
lgb_param5 = {'max_depth': 10, # 트리 깊이
            'learning_rate': 0.01, # Step Size
            'n_estimators': 50, # Number of trees, 트리 생성 개수
            'objective': 'multiclass', # 목적 함수
            'num_class': len(set(pd.DataFrame(y_train_res4))) + 1} # 파라미터 추가, Label must be in [0, num_class) -> num_class보다 1 커야한다.
lgb_model5 = lgb.train(params = lgb_param5, train_set = lgb_dtrain5) # 학습 진행
lgb_model5_predict = np.argmax(lgb_model5.predict(x_test), axis = 1) # 평가 데이터 예측, Softmax의 결과값 중 가장 큰 값의 Label로 예측
model_evaluation(y_test, lgb_model5_predict) # 모델 분류 평가 결과
C:\Users\drogpard\Anaconda3\lib\site-packages\lightgbm\engine.py:148: UserWarning: Found `n_estimators` in params. Will use it instead of argument
  warnings.warn("Found `{}` in params. Will use it instead of argument".format(alias))


Model_Evaluation with Label:1
Accuracy:  0.9261201612083856
Precision:  0.19434092844974446
Recall:  0.35333688235009436
F1-Score:  0.25075990451821195