불균형 데이터 처리
IEEE-CIS Fraud Detection의 불균형 데이터
- STEP 1 : IEEE-CIS Fraud Detection 데이터 불러오기 및 전처리
- STEP 2 : Light GBM 기본 모델 테스트
- STEP 3-4 : Oversampling 수행
- 작성자: 김태영 감수자
STEP 1~4. IEEE-CIS Fraud Detection의 불균형 데이터 문제
배경
실제 캐글에서 개최된 부정거래 예측 대회인 IEEE-CIS Fraud Detection에서 사용된 데이터를 가지고 불균형 데이터 분석을 수행해보고, 간단한 예측 모델을 만들어보겠습니다. 예측 모델의 정확도를 높이기 위해서 본 파트에서 배운 여러가지 불균형 데이터 처리 기법을 적용해보고 그 결과를 살펴보겠습니다.
목표
- IEEE-CIS Fraud Detection 데이터 불러오기 및 전처리
- Light GBM 기본 모델 테스트
- SMOTE을 이용한 Oversampling 수행
-
BLSM을 이용한 Oversampling 수행
STEP 1. IEEE-CIS Fraud Detection 데이터 불러오기 및 전처리
IEEE-CIS Fraud Detection 데이터를 불러와서 전처리를 수행합니다.
- https://www.kaggle.com/c/ieee-fraud-detection의 data 탭에서 train_transaction.csv와 test_transaction.csv를 다운로드 함
- 카테고리형 및 결측치 처리
- 본 파트에서 데이터 전처리 과정은 평가에 포함되지 않으므로 소스코드 제공
import numpy as np
import pandas as pd
train = pd.read_csv("train_transaction.csv")
test = pd.read_csv("test_transaction.csv")
train["isFraud"].mean() # 0.03499000914417313
0.03499000914417313
train.head()
TransactionID | isFraud | TransactionDT | TransactionAmt | ProductCD | card1 | card2 | card3 | card4 | card5 | ... | V330 | V331 | V332 | V333 | V334 | V335 | V336 | V337 | V338 | V339 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2987000 | 0 | 86400 | 68.5 | W | 13926 | NaN | 150.0 | discover | 142.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | 2987001 | 0 | 86401 | 29.0 | W | 2755 | 404.0 | 150.0 | mastercard | 102.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 2987002 | 0 | 86469 | 59.0 | W | 4663 | 490.0 | 150.0 | visa | 166.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | 2987003 | 0 | 86499 | 50.0 | W | 18132 | 567.0 | 150.0 | mastercard | 117.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | 2987004 | 0 | 86506 | 50.0 | H | 4497 | 514.0 | 150.0 | mastercard | 102.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 rows × 394 columns
# generate time of day
train["Time of Day"] = np.floor(train["TransactionDT"]/3600/183)
test["Time of Day"] = np.floor(test["TransactionDT"]/3600/183)
# drop columns
train.drop("TransactionDT",axis=1,inplace=True)
test.drop("TransactionDT",axis=1,inplace=True)
# define continuous and categorical variables
cont_vars = ["TransactionAmt"]
cat_vars = ["ProductCD","addr1","addr2","P_emaildomain","R_emaildomain","Time of Day"] + [col for col in train.columns if "card" in col]
# set training and testing set
x_train = train[cont_vars + cat_vars].copy()
y_train = train["isFraud"].copy()
x_test = train[cont_vars + cat_vars].copy()
y_test = train["isFraud"].copy()
# process cont_vars
# scale values
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
x_train["TransactionAmt"] = scaler.fit_transform(x_train["TransactionAmt"].values.reshape(-1,1))
x_test["TransactionAmt"] = scaler.transform(x_test["TransactionAmt"].values.reshape(-1,1))
# reduce cardinality of categorical variables
idx_list = x_train["card1"].value_counts()[x_train["card1"].value_counts()<=100].index.tolist()
x_train.loc[x_train["card1"].isin(idx_list),"card1"] = "Others"
x_test.loc[x_test["card1"].isin(idx_list),"card1"] = "Others"
# convert to numerical value for modelling
def categorify(df, cat_vars):
categories = {}
for cat in cat_vars:
df[cat] = df[cat].astype("category").cat.as_ordered()
categories[cat] = df[cat].cat.categories
return categories
def apply_test(test,categories):
for cat, index in categories.items():
test[cat] = pd.Categorical(test[cat],categories=categories[cat],ordered=True)
# convert to integers
categories = categorify(x_train,cat_vars)
apply_test(x_test,categories)
for cat in cat_vars:
x_train[cat] = x_train[cat].cat.codes+1
x_test[cat] = x_test[cat].cat.codes+1
# fill missing
x_train[cat_vars] = x_train[cat_vars].fillna("Missing")
x_test[cat_vars] = x_test[cat_vars].fillna("Missing")
for cat, index in categories.items():
test[cat] = pd.Categorical(test[cat],categories=categories[cat],ordered=True)
평가 코드는 아래와 같습니다.
def model_evaluation(label, predict):
cf_matrix = confusion_matrix(label, predict)
Accuracy = (cf_matrix[0][0] + cf_matrix[1][1]) / sum(sum(cf_matrix))
Precision = cf_matrix[1][1] / (cf_matrix[1][1] + cf_matrix[0][1])
Recall = cf_matrix[1][1] / (cf_matrix[1][1] + cf_matrix[1][0])
F1_Score = (2 * Recall * Precision) / (Recall + Precision)
print("Model_Evaluation with Label:1")
print("Accuracy: ", Accuracy)
print("Precision: ", Precision)
print("Recall: ", Recall)
print("F1-Score: ", F1_Score)
STEP 2. Light GBM 기본 모델 테스트
Light GBM 기본 모델로 불균형 데이터 처리를 수행합니다.
- 학습 데이터를 LightGBM 모델에 맞게 변환함
- LightGBM 모델 학습함
- LightGBM 모델 평가함
# LightGBM 수행
from sklearn.metrics import confusion_matrix
import lightgbm as lgb
lgb_dtrain = lgb.Dataset(data = pd.DataFrame(x_train), label = pd.DataFrame(y_train)) # 학습 데이터를 LightGBM 모델에 맞게 변환
lgb_param = {'max_depth': 10, # 트리 깊이
'learning_rate': 0.01, # Step Size
'n_estimators': 50, # Number of trees, 트리 생성 개수
'objective': 'multiclass', # 목적 함수
'num_class': len(set(pd.DataFrame(y_train))) + 1} # 파라미터 추가, Label must be in [0, num_class) -> num_class보다 1 커야한다.
lgb_model = lgb.train(params = lgb_param, train_set = lgb_dtrain) # 학습 진행
lgb_model_predict = np.argmax(lgb_model.predict(x_test), axis = 1) # 평가 데이터 예측, Softmax의 결과값 중 가장 큰 값의 Label로 예측
model_evaluation(y_test, lgb_model_predict) # 모델 분류 결과 평가
C:\Users\drogpard\Anaconda3\lib\site-packages\lightgbm\engine.py:148: UserWarning: Found `n_estimators` in params. Will use it instead of argument
warnings.warn("Found `{}` in params. Will use it instead of argument".format(alias))
Model_Evaluation with Label:1
Accuracy: 0.9650099908558268
Precision: nan
Recall: 0.0
F1-Score: nan
C:\Users\drogpard\Anaconda3\lib\site-packages\ipykernel_launcher.py:4: RuntimeWarning: invalid value encountered in longlong_scalars
after removing the cwd from sys.path.
STEP 3. SMOTE을 이용한 Oversampling 수행
SMOTE를 이용하여 Oversampling를 수행한 후 Light GBM 기본 모델로 평가해봅니다.
- SMOTE을 이용하여 Oversampling 수행
- 학습 데이터를 LightGBM 모델에 맞게 변환함
- LightGBM 모델 학습함
- LightGBM 모델 평가함
# 기존의 X_train, y_train, X_test, y_test의 형태 확인
print("x_train: ", x_train.shape)
print("y_train: ", y_train.shape)
print("x_test: ", x_test.shape)
print("y_test: ", y_test.shape)
x_train: (590540, 13)
y_train: (590540,)
x_test: (590540, 13)
y_test: (590540,)
# SMOTE 수행
from imblearn.over_sampling import SMOTE
print("Before OverSampling, counts of label '1': {}".format(sum(y_train == 1))) # y_train 중 레이블 값이 1인 데이터의 개수
print("Before OverSampling, counts of label '0': {} \n".format(sum(y_train == 0))) # y_train 중 레이블 값이 0 인 데이터의 개수
sm = SMOTE(random_state = 42, ratio = 0.3) # SMOTE 알고리즘, 비율 증가
x_train_res, y_train_res = sm.fit_sample(x_train, y_train.ravel()) # Over Sampling 진행
print("After OverSampling, counts of label '1': {}".format(sum(y_train_res==1)))
print("After OverSampling, counts of label '0': {}".format(sum(y_train_res==0)))
C:\Users\drogpard\Anaconda3\lib\site-packages\sklearn\utils\deprecation.py:143: FutureWarning: The sklearn.neighbors.base module is deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.neighbors. Anything that cannot be imported from sklearn.neighbors is now part of the private API.
warnings.warn(message, FutureWarning)
C:\Users\drogpard\Anaconda3\lib\site-packages\sklearn\utils\deprecation.py:143: FutureWarning: The sklearn.ensemble.bagging module is deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.ensemble. Anything that cannot be imported from sklearn.ensemble is now part of the private API.
warnings.warn(message, FutureWarning)
C:\Users\drogpard\Anaconda3\lib\site-packages\sklearn\utils\deprecation.py:143: FutureWarning: The sklearn.ensemble.base module is deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.ensemble. Anything that cannot be imported from sklearn.ensemble is now part of the private API.
warnings.warn(message, FutureWarning)
C:\Users\drogpard\Anaconda3\lib\site-packages\sklearn\utils\deprecation.py:143: FutureWarning: The sklearn.ensemble.forest module is deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.ensemble. Anything that cannot be imported from sklearn.ensemble is now part of the private API.
warnings.warn(message, FutureWarning)
C:\Users\drogpard\Anaconda3\lib\site-packages\sklearn\utils\deprecation.py:143: FutureWarning: The sklearn.utils.testing module is deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.utils. Anything that cannot be imported from sklearn.utils is now part of the private API.
warnings.warn(message, FutureWarning)
C:\Users\drogpard\Anaconda3\lib\site-packages\sklearn\utils\deprecation.py:143: FutureWarning: The sklearn.metrics.classification module is deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.metrics. Anything that cannot be imported from sklearn.metrics is now part of the private API.
warnings.warn(message, FutureWarning)
Before OverSampling, counts of label '1': 20663
Before OverSampling, counts of label '0': 569877
C:\Users\drogpard\Anaconda3\lib\site-packages\sklearn\utils\deprecation.py:86: FutureWarning: Function safe_indexing is deprecated; safe_indexing is deprecated in version 0.22 and will be removed in version 0.24.
warnings.warn(msg, category=FutureWarning)
After OverSampling, counts of label '1': 170963
After OverSampling, counts of label '0': 569877
# SMOTE 전후 데이터 형태 확인
print("Before OverSampling, the shape of X_train: {}".format(x_train.shape)) # SMOTE 적용 이전 데이터 형태
print("Before OverSampling, the shape of y_train: {}".format(y_train.shape)) # SMOTE 적용 이전 데이터 형태
print('After OverSampling, the shape of X_train: {}'.format(x_train_res.shape)) # SMOTE 적용 결과 확인
print('After OverSampling, the shape of y_train: {}'.format(y_train_res.shape)) # # SMOTE 적용 결과 확인
Before OverSampling, the shape of X_train: (590540, 13)
Before OverSampling, the shape of y_train: (590540,)
After OverSampling, the shape of X_train: (740840, 13)
After OverSampling, the shape of y_train: (740840,)
# LightGBM 수행
lgb_dtrain2 = lgb.Dataset(data = pd.DataFrame(x_train_res), label = pd.DataFrame(y_train_res)) # 학습 데이터를 LightGBM 모델에 맞게 변환
lgb_param2 = {'max_depth': 10, # 트리 깊이
'learning_rate': 0.01, # Step Size
'n_estimators': 50, # Number of trees, 트리 생성 개수
'objective': 'multiclass', # 목적 함수
'num_class': len(set(pd.DataFrame(y_train_res))) + 1} # 파라미터 추가, Label must be in [0, num_class) -> num_class보다 1 커야한다.
lgb_model2 = lgb.train(params = lgb_param2, train_set = lgb_dtrain2) # 학습 진행
lgb_model2_predict = np.argmax(lgb_model2.predict(x_test), axis = 1) # 평가 데이터 예측, Softmax의 결과값 중 가장 큰 값의 Label로 예측
model_evaluation(y_test, lgb_model2_predict) # 모델 분류 평가 결과
C:\Users\drogpard\Anaconda3\lib\site-packages\lightgbm\engine.py:148: UserWarning: Found `n_estimators` in params. Will use it instead of argument
warnings.warn("Found `{}` in params. Will use it instead of argument".format(alias))
Model_Evaluation with Label:1
Accuracy: 0.9650099908558268
Precision: nan
Recall: 0.0
F1-Score: nan
C:\Users\drogpard\Anaconda3\lib\site-packages\ipykernel_launcher.py:4: RuntimeWarning: invalid value encountered in longlong_scalars
after removing the cwd from sys.path.
STEP 4. BLSM을 이용한 Oversampling 수행
BLSM을 이용하여 Oversampling를 수행한 후 Light GBM 기본 모델로 평가해봅니다.
- BLSM을 이용하여 Oversampling 수행
- 학습 데이터를 LightGBM 모델에 맞게 변환함
- LightGBM 모델 학습함
- LightGBM 모델 평가함
# BLSM 수행
from imblearn.over_sampling import BorderlineSMOTE
sm4 = BorderlineSMOTE(random_state = 42, sampling_strategy = 0.6) # BLSM 알고리즘 적용
x_train_res4, y_train_res4 = sm4.fit_sample(x_train, y_train.ravel()) # Over Sampling 적용
C:\Users\drogpard\Anaconda3\lib\site-packages\sklearn\utils\deprecation.py:86: FutureWarning: Function safe_indexing is deprecated; safe_indexing is deprecated in version 0.22 and will be removed in version 0.24.
warnings.warn(msg, category=FutureWarning)
C:\Users\drogpard\Anaconda3\lib\site-packages\sklearn\utils\deprecation.py:86: FutureWarning: Function safe_indexing is deprecated; safe_indexing is deprecated in version 0.22 and will be removed in version 0.24.
warnings.warn(msg, category=FutureWarning)
C:\Users\drogpard\Anaconda3\lib\site-packages\sklearn\utils\deprecation.py:86: FutureWarning: Function safe_indexing is deprecated; safe_indexing is deprecated in version 0.22 and will be removed in version 0.24.
warnings.warn(msg, category=FutureWarning)
# LightGBM 수행
lgb_dtrain5 = lgb.Dataset(data = pd.DataFrame(x_train_res4), label = pd.DataFrame(y_train_res4)) # 학습 데이터를 LightGBM 모델에 맞게 변환
lgb_param5 = {'max_depth': 10, # 트리 깊이
'learning_rate': 0.01, # Step Size
'n_estimators': 50, # Number of trees, 트리 생성 개수
'objective': 'multiclass', # 목적 함수
'num_class': len(set(pd.DataFrame(y_train_res4))) + 1} # 파라미터 추가, Label must be in [0, num_class) -> num_class보다 1 커야한다.
lgb_model5 = lgb.train(params = lgb_param5, train_set = lgb_dtrain5) # 학습 진행
lgb_model5_predict = np.argmax(lgb_model5.predict(x_test), axis = 1) # 평가 데이터 예측, Softmax의 결과값 중 가장 큰 값의 Label로 예측
model_evaluation(y_test, lgb_model5_predict) # 모델 분류 평가 결과
C:\Users\drogpard\Anaconda3\lib\site-packages\lightgbm\engine.py:148: UserWarning: Found `n_estimators` in params. Will use it instead of argument
warnings.warn("Found `{}` in params. Will use it instead of argument".format(alias))
Model_Evaluation with Label:1
Accuracy: 0.9261201612083856
Precision: 0.19434092844974446
Recall: 0.35333688235009436
F1-Score: 0.25075990451821195