불균형 데이터 처리

8 minute read

IEEE-CIS Fraud Detection의 불균형 데이터

- STEP 1 : IEEE-CIS Fraud Detection 데이터 불러오기 및 전처리
- STEP 2 : Light GBM 기본 모델 테스트
- STEP 3-4 : Oversampling 수행

작성자: 김태영 감수자

STEP 1~4. IEEE-CIS Fraud Detection의 불균형 데이터 문제

배경

실제 캐글에서 개최된 부정거래 예측 대회인 IEEE-CIS Fraud Detection에서 사용된 데이터를 가지고 불균형 데이터 분석을 수행해보고, 간단한 예측 모델을 만들어보겠습니다. 예측 모델의 정확도를 높이기 위해서 본 파트에서 배운 여러가지 불균형 데이터 처리 기법을 적용해보고 그 결과를 살펴보겠습니다.

목표

IEEE-CIS Fraud Detection 데이터 불러오기 및 전처리
Light GBM 기본 모델 테스트
SMOTE을 이용한 Oversampling 수행
BLSM을 이용한 Oversampling 수행

STEP 1. IEEE-CIS Fraud Detection 데이터 불러오기 및 전처리

IEEE-CIS Fraud Detection 데이터를 불러와서 전처리를 수행합니다.

https://www.kaggle.com/c/ieee-fraud-detection의 data 탭에서 train_transaction.csv와 test_transaction.csv를 다운로드 함
카테고리형 및 결측치 처리
본 파트에서 데이터 전처리 과정은 평가에 포함되지 않으므로 소스코드 제공

import numpy as np
import pandas as pd

train = pd.read_csv("train_transaction.csv")
test = pd.read_csv("test_transaction.csv")

train["isFraud"].mean() #  0.03499000914417313

0.03499000914417313

train.head()

	TransactionID	TransactionDT	TransactionAmt	ProductCD	card1	card2	card3	card4	card5	...	V330	V331	V332	V333	V334	V335	V336	V337	V338	V339
0	2987000	86400	68.5	W	13926	NaN	150.0	discover	142.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	2987001	86401	29.0	W	2755	404.0	150.0	mastercard	102.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	2987002	86469	59.0	W	4663	490.0	150.0	visa	166.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	2987003	86499	50.0	W	18132	567.0	150.0	mastercard	117.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	2987004	86506	50.0	H	4497	514.0	150.0	mastercard	102.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

5 rows × 394 columns

# generate time of day
train["Time of Day"] = np.floor(train["TransactionDT"]/3600/183)
test["Time of Day"] = np.floor(test["TransactionDT"]/3600/183)

# drop columns
train.drop("TransactionDT",axis=1,inplace=True)
test.drop("TransactionDT",axis=1,inplace=True)

# define continuous and categorical variables
cont_vars = ["TransactionAmt"]
cat_vars = ["ProductCD","addr1","addr2","P_emaildomain","R_emaildomain","Time of Day"] + [col for col in train.columns if "card" in col]

# set training and testing set
x_train = train[cont_vars + cat_vars].copy()
y_train = train["isFraud"].copy()
x_test = train[cont_vars + cat_vars].copy()
y_test = train["isFraud"].copy()

# process cont_vars
# scale values
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
x_train["TransactionAmt"] = scaler.fit_transform(x_train["TransactionAmt"].values.reshape(-1,1))
x_test["TransactionAmt"] = scaler.transform(x_test["TransactionAmt"].values.reshape(-1,1))

# reduce cardinality of categorical variables
idx_list = x_train["card1"].value_counts()[x_train["card1"].value_counts()<=100].index.tolist()
x_train.loc[x_train["card1"].isin(idx_list),"card1"] = "Others"
x_test.loc[x_test["card1"].isin(idx_list),"card1"] = "Others"

# convert to numerical value for modelling
def categorify(df, cat_vars):
    categories = {}
    for cat in cat_vars:
        df[cat] = df[cat].astype("category").cat.as_ordered()
        categories[cat] = df[cat].cat.categories
    return categories

def apply_test(test,categories):
    for cat, index in categories.items():
        test[cat] = pd.Categorical(test[cat],categories=categories[cat],ordered=True)

# convert to integers
categories = categorify(x_train,cat_vars)
apply_test(x_test,categories)

for cat in cat_vars:
    x_train[cat] = x_train[cat].cat.codes+1
    x_test[cat] = x_test[cat].cat.codes+1



# fill missing
x_train[cat_vars] = x_train[cat_vars].fillna("Missing")
x_test[cat_vars] = x_test[cat_vars].fillna("Missing")
for cat, index in categories.items():
    test[cat] = pd.Categorical(test[cat],categories=categories[cat],ordered=True)

평가 코드는 아래와 같습니다.

def model_evaluation(label, predict):
    cf_matrix = confusion_matrix(label, predict)
    Accuracy = (cf_matrix[0][0] + cf_matrix[1][1]) / sum(sum(cf_matrix))
    Precision = cf_matrix[1][1] / (cf_matrix[1][1] + cf_matrix[0][1])
    Recall = cf_matrix[1][1] / (cf_matrix[1][1] + cf_matrix[1][0])
    F1_Score = (2 * Recall * Precision) / (Recall + Precision)
    print("Model_Evaluation with Label:1")
    print("Accuracy: ", Accuracy)
    print("Precision: ", Precision)
    print("Recall: ", Recall)
    print("F1-Score: ", F1_Score)

STEP 2. Light GBM 기본 모델 테스트

Light GBM 기본 모델로 불균형 데이터 처리를 수행합니다.

학습 데이터를 LightGBM 모델에 맞게 변환함
LightGBM 모델 학습함
LightGBM 모델 평가함

# LightGBM 수행
from sklearn.metrics import confusion_matrix
import lightgbm as lgb
lgb_dtrain = lgb.Dataset(data = pd.DataFrame(x_train), label = pd.DataFrame(y_train)) # 학습 데이터를 LightGBM 모델에 맞게 변환
lgb_param = {'max_depth': 10, # 트리 깊이
            'learning_rate': 0.01, # Step Size
            'n_estimators': 50, # Number of trees, 트리 생성 개수
            'objective': 'multiclass', # 목적 함수
            'num_class': len(set(pd.DataFrame(y_train))) + 1} # 파라미터 추가, Label must be in [0, num_class) -> num_class보다 1 커야한다.
lgb_model = lgb.train(params = lgb_param, train_set = lgb_dtrain) # 학습 진행
lgb_model_predict = np.argmax(lgb_model.predict(x_test), axis = 1) # 평가 데이터 예측, Softmax의 결과값 중 가장 큰 값의 Label로 예측
model_evaluation(y_test, lgb_model_predict) # 모델 분류 결과 평가

C:\Users\drogpard\Anaconda3\lib\site-packages\lightgbm\engine.py:148: UserWarning: Found `n_estimators` in params. Will use it instead of argument
  warnings.warn("Found `{}` in params. Will use it instead of argument".format(alias))


Model_Evaluation with Label:1
Accuracy:  0.9650099908558268
Precision:  nan
Recall:  0.0
F1-Score:  nan


C:\Users\drogpard\Anaconda3\lib\site-packages\ipykernel_launcher.py:4: RuntimeWarning: invalid value encountered in longlong_scalars
  after removing the cwd from sys.path.

STEP 3. SMOTE을 이용한 Oversampling 수행

SMOTE를 이용하여 Oversampling를 수행한 후 Light GBM 기본 모델로 평가해봅니다.

SMOTE을 이용하여 Oversampling 수행
학습 데이터를 LightGBM 모델에 맞게 변환함
LightGBM 모델 학습함
LightGBM 모델 평가함

# 기존의 X_train, y_train, X_test, y_test의 형태 확인
print("x_train: ", x_train.shape)
print("y_train: ", y_train.shape)
print("x_test: ", x_test.shape)
print("y_test: ", y_test.shape)

x_train:  (590540, 13)
y_train:  (590540,)
x_test:  (590540, 13)
y_test:  (590540,)

# SMOTE 수행
from imblearn.over_sampling import SMOTE
print("Before OverSampling, counts of label '1': {}".format(sum(y_train == 1))) # y_train 중 레이블 값이 1인 데이터의 개수
print("Before OverSampling, counts of label '0': {} \n".format(sum(y_train == 0))) # y_train 중 레이블 값이 0 인 데이터의 개수

sm = SMOTE(random_state = 42, ratio = 0.3) # SMOTE 알고리즘, 비율 증가
x_train_res, y_train_res = sm.fit_sample(x_train, y_train.ravel()) # Over Sampling 진행

print("After OverSampling, counts of label '1': {}".format(sum(y_train_res==1)))
print("After OverSampling, counts of label '0': {}".format(sum(y_train_res==0)))

C:\Users\drogpard\Anaconda3\lib\site-packages\sklearn\utils\deprecation.py:143: FutureWarning: The sklearn.neighbors.base module is  deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.neighbors. Anything that cannot be imported from sklearn.neighbors is now part of the private API.
  warnings.warn(message, FutureWarning)
C:\Users\drogpard\Anaconda3\lib\site-packages\sklearn\utils\deprecation.py:143: FutureWarning: The sklearn.ensemble.bagging module is  deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.ensemble. Anything that cannot be imported from sklearn.ensemble is now part of the private API.
  warnings.warn(message, FutureWarning)
C:\Users\drogpard\Anaconda3\lib\site-packages\sklearn\utils\deprecation.py:143: FutureWarning: The sklearn.ensemble.base module is  deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.ensemble. Anything that cannot be imported from sklearn.ensemble is now part of the private API.
  warnings.warn(message, FutureWarning)
C:\Users\drogpard\Anaconda3\lib\site-packages\sklearn\utils\deprecation.py:143: FutureWarning: The sklearn.ensemble.forest module is  deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.ensemble. Anything that cannot be imported from sklearn.ensemble is now part of the private API.
  warnings.warn(message, FutureWarning)
C:\Users\drogpard\Anaconda3\lib\site-packages\sklearn\utils\deprecation.py:143: FutureWarning: The sklearn.utils.testing module is  deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.utils. Anything that cannot be imported from sklearn.utils is now part of the private API.
  warnings.warn(message, FutureWarning)
C:\Users\drogpard\Anaconda3\lib\site-packages\sklearn\utils\deprecation.py:143: FutureWarning: The sklearn.metrics.classification module is  deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.metrics. Anything that cannot be imported from sklearn.metrics is now part of the private API.
  warnings.warn(message, FutureWarning)


Before OverSampling, counts of label '1': 20663
Before OverSampling, counts of label '0': 569877 



C:\Users\drogpard\Anaconda3\lib\site-packages\sklearn\utils\deprecation.py:86: FutureWarning: Function safe_indexing is deprecated; safe_indexing is deprecated in version 0.22 and will be removed in version 0.24.
  warnings.warn(msg, category=FutureWarning)


After OverSampling, counts of label '1': 170963
After OverSampling, counts of label '0': 569877

# SMOTE 전후 데이터 형태 확인
print("Before OverSampling, the shape of X_train: {}".format(x_train.shape)) # SMOTE 적용 이전 데이터 형태
print("Before OverSampling, the shape of y_train: {}".format(y_train.shape)) # SMOTE 적용 이전 데이터 형태
print('After OverSampling, the shape of X_train: {}'.format(x_train_res.shape)) # SMOTE 적용 결과 확인
print('After OverSampling, the shape of y_train: {}'.format(y_train_res.shape)) # # SMOTE 적용 결과 확인

Before OverSampling, the shape of X_train: (590540, 13)
Before OverSampling, the shape of y_train: (590540,)
After OverSampling, the shape of X_train: (740840, 13)
After OverSampling, the shape of y_train: (740840,)

# LightGBM 수행
lgb_dtrain2 = lgb.Dataset(data = pd.DataFrame(x_train_res), label = pd.DataFrame(y_train_res)) # 학습 데이터를 LightGBM 모델에 맞게 변환
lgb_param2 = {'max_depth': 10, # 트리 깊이
            'learning_rate': 0.01, # Step Size
            'n_estimators': 50, # Number of trees, 트리 생성 개수
            'objective': 'multiclass', # 목적 함수
            'num_class': len(set(pd.DataFrame(y_train_res))) + 1} # 파라미터 추가, Label must be in [0, num_class) -> num_class보다 1 커야한다.
lgb_model2 = lgb.train(params = lgb_param2, train_set = lgb_dtrain2) # 학습 진행
lgb_model2_predict = np.argmax(lgb_model2.predict(x_test), axis = 1) # 평가 데이터 예측, Softmax의 결과값 중 가장 큰 값의 Label로 예측
model_evaluation(y_test, lgb_model2_predict) # 모델 분류 평가 결과

C:\Users\drogpard\Anaconda3\lib\site-packages\lightgbm\engine.py:148: UserWarning: Found `n_estimators` in params. Will use it instead of argument
  warnings.warn("Found `{}` in params. Will use it instead of argument".format(alias))


Model_Evaluation with Label:1
Accuracy:  0.9650099908558268
Precision:  nan
Recall:  0.0
F1-Score:  nan


C:\Users\drogpard\Anaconda3\lib\site-packages\ipykernel_launcher.py:4: RuntimeWarning: invalid value encountered in longlong_scalars
  after removing the cwd from sys.path.

STEP 4. BLSM을 이용한 Oversampling 수행

BLSM을 이용하여 Oversampling를 수행한 후 Light GBM 기본 모델로 평가해봅니다.

BLSM을 이용하여 Oversampling 수행
학습 데이터를 LightGBM 모델에 맞게 변환함
LightGBM 모델 학습함
LightGBM 모델 평가함

# BLSM 수행
from imblearn.over_sampling import BorderlineSMOTE
sm4 = BorderlineSMOTE(random_state = 42, sampling_strategy = 0.6) # BLSM 알고리즘 적용
x_train_res4, y_train_res4 = sm4.fit_sample(x_train, y_train.ravel()) # Over Sampling 적용

C:\Users\drogpard\Anaconda3\lib\site-packages\sklearn\utils\deprecation.py:86: FutureWarning: Function safe_indexing is deprecated; safe_indexing is deprecated in version 0.22 and will be removed in version 0.24.
  warnings.warn(msg, category=FutureWarning)
C:\Users\drogpard\Anaconda3\lib\site-packages\sklearn\utils\deprecation.py:86: FutureWarning: Function safe_indexing is deprecated; safe_indexing is deprecated in version 0.22 and will be removed in version 0.24.
  warnings.warn(msg, category=FutureWarning)
C:\Users\drogpard\Anaconda3\lib\site-packages\sklearn\utils\deprecation.py:86: FutureWarning: Function safe_indexing is deprecated; safe_indexing is deprecated in version 0.22 and will be removed in version 0.24.
  warnings.warn(msg, category=FutureWarning)

# LightGBM 수행
lgb_dtrain5 = lgb.Dataset(data = pd.DataFrame(x_train_res4), label = pd.DataFrame(y_train_res4)) # 학습 데이터를 LightGBM 모델에 맞게 변환
lgb_param5 = {'max_depth': 10, # 트리 깊이
            'learning_rate': 0.01, # Step Size
            'n_estimators': 50, # Number of trees, 트리 생성 개수
            'objective': 'multiclass', # 목적 함수
            'num_class': len(set(pd.DataFrame(y_train_res4))) + 1} # 파라미터 추가, Label must be in [0, num_class) -> num_class보다 1 커야한다.
lgb_model5 = lgb.train(params = lgb_param5, train_set = lgb_dtrain5) # 학습 진행
lgb_model5_predict = np.argmax(lgb_model5.predict(x_test), axis = 1) # 평가 데이터 예측, Softmax의 결과값 중 가장 큰 값의 Label로 예측
model_evaluation(y_test, lgb_model5_predict) # 모델 분류 평가 결과

C:\Users\drogpard\Anaconda3\lib\site-packages\lightgbm\engine.py:148: UserWarning: Found `n_estimators` in params. Will use it instead of argument
  warnings.warn("Found `{}` in params. Will use it instead of argument".format(alias))


Model_Evaluation with Label:1
Accuracy:  0.9261201612083856
Precision:  0.19434092844974446
Recall:  0.35333688235009436
F1-Score:  0.25075990451821195

Twitter Facebook LinkedIn

D.J Hwang

불균형 데이터 처리

IEEE-CIS Fraud Detection의 불균형 데이터

STEP 1~4. IEEE-CIS Fraud Detection의 불균형 데이터 문제

배경

목표

BLSM을 이용한 Oversampling 수행

STEP 1. IEEE-CIS Fraud Detection 데이터 불러오기 및 전처리

STEP 2. Light GBM 기본 모델 테스트

STEP 3. SMOTE을 이용한 Oversampling 수행

STEP 4. BLSM을 이용한 Oversampling 수행

You May Also Enjoy

Human Activity Recognition on STM32L4 IoTnode

Faster R-CNN

Fast R-CNN

R-CNN/SPP-Net