티스토리 뷰

반응형

지난 시간 그저 missing value를 채워주는 걸로는 개선이 이루어지지 않았다.

그렇다면 일단 다른 사람의 노트북을 참고해보면 어떨까?

 

 어떤 의견에서는 변수들이 연관성이 있다는 것도 있었다. 만약 그렇다면 문제가 더 복잡해진다.

그래서 discussion에서 missing value를 처리할 방법을 찾다가, 흥미로운 노트북을 보게 되었다.

 

 

Categorical Feature Encoding Challenge II

Explore and run machine learning code with Kaggle Notebooks | Using data from Categorical Feature Encoding Challenge II

www.kaggle.com

이 노트북의 요지는 다음과 같다.

(test.isna().sum(axis=1)==0).sum()/test.shape[0]
0.495405

When some rows have too many missing values, we can consider deleting them. Here, every row contains less than 8 missing values. So we would probably discard more information than noise by deleting it. Moreover, to drop rows can bias the dataset if the values are not randomly missing.

 

요약하자면, Nan중에 겹치는 행들이 거의 없어서 dropna로는 노이즈를 발생시킨다는 것.

필자도 예상했던 부분이다.

 

II.2 Different missing data mechanisms.

     Understanding the reasons why data are missing helps to handle them. Therefore, let's remind the generally considered missing data mechanisms.

  • Missingness completely at random (MCAR): the values of a variable are missing completely at random if the propensity of missingness is the same for each value. When data are MCAR, the analysis performed on the non-missing data are unbiased. Unfortunately, data are rarely MCAR.
  • Missing at random (MAR) occurs when the probability a variable is missing depends only on available information, that is to say on the other variables of the dataset (the expression "at random" refers to the random impact of the unavailable information on the missingness).
  • Missing not at random (MNAR): this mechanism includes the cases where the missingness depends on unobserved predictors or on the missing value itself (whose a special case is censoring : it happens for example if people with a salary superior to a certain threshold refuse to communicate on its value).

     Of course, the missing data mechanisms can be different for different variables of our dataset. We can never be sure of the mechanism by which the data are missing. Furthermore, data are never MCAR or even MAR. If a data is missing, there is necessarily a cause (or even several) as nothing happens by chance. This cause is in fact a variable linked to the missingness. But we are looking for a reasonable (in the sense that it is coherent and it conforms well to our observations) assumption more than the strict reality.

     We should usually try to include new features in our dataset if possible. Generally, more we add variables in our dataset, more reasonable is the MAR assumption. Even if these features are not necessarily efficient for the final task (whatever it is, prediction or clustering), they can be useful to handle the NaN (it is always possible to drop them once the NaN have been managed).

 

영어가 많지만, 세 줄 요약하면 이렇다.

 

데이터가 누락되었을 때, 일반적인 데이터 메커니즘은 이렇다.

- 완전 무작위 결측(MCAR) : 각 값에 대해 결측성향이 동일한 경우. 데이터가 MCAR인 경우는 거의 없다.

- 무작위 결측(MAR)  : 변수가 결측될 확률이 허용된 정보에만 의존할 때 발생. 즉, 데이터셋의 다른 변수에 의존할 때 발생.

- 무작위가 아닌 결측(MNAR) : 결측이 관찰되지 않는 예측 변수 또는 누락된 값 자체에 의존하는 경우를 포함.(특별한 경우는 검열)

 

일반적으로는 MAR 가정이 합리적이다.(다른 정보에는 의존하나, 무작위가 아니라서 생기는 예측성도 포함)

 

그렇기에 여기서는 MAR 가정을 하고, 다른 정보들이 NaN을 결정하는 데 의존한다는 가정 하에 데이터셋을 처리하는 게 좋다는 의미다.

 

II.3 Missing data exploration strategy.

     To identify the mechanisms involved, we can use knowledge about the data (their meaning, how they were collected, etc.) and investigate them to find patterns. Here, as mentionned previously, we have no knowledge about data. So we are bound to investigate patterns.

     The exploration of the relationship between two categorical variables is usually done with contingency tables. For example, below is the contingency table between nom_1 and nom_2. As all the categories are not equally represented, each value in the table is divided by the number of observations in the category corresponding to the row (so we can compare the values between the rows).

 

 

데이터에 대한 지식이 없으므로(연관관계를 모른다), 패턴을 조사해야한다.

 

관계 탐색은 분할표를 사용하여 수행

다음은 nom_1과 nom_2의 관계 탐색 표

cont_table = pd.crosstab(df.nom_1, df.nom_2, normalize='index')
cont_table
nom_2	Axolotl	Cat	Dog	Hamster	Lion	Snake
nom_1						
Circle	0.260670	0.045213	0.181373	0.283664	0.204534	0.024546
Polygon	0.262407	0.044725	0.179016	0.283159	0.206252	0.024440
Square	0.259412	0.044773	0.184271	0.280475	0.207748	0.023321
Star	0.263891	0.046376	0.182368	0.280735	0.203223	0.023407
Trapezoid	0.262692	0.045957	0.180624	0.282211	0.204246	0.024270
Triangle	0.261726	0.044783	0.179622	0.284450	0.205132	0.024287

 

혹은 히트맵 테이블

plt.matshow(cont_table, cmap=plt.cm.gray)

 

 

   Here, we observe that nom_2 takes more often the value Dog when nom_1 takes the value Polygon than when nom_1 takes the value Square. But how to know if this observation is enough to conclude to a link between the two variables ?
     We can't do it properly just by visualizing the data. Instead, we need to quantify the pattern observed. The classical way to check (or at least to estimate properly) if a pattern is due to sampling fluctuation or to an actual link is to perform a statistical test. Thus, we will not display a myriad of contingency tables. It would be useless. Instead, we will perform statistical tests.

 

이 관찰에서 어떻게 두 변수가 관계가 있다는 결론을 내리는가?

데이터를 시각화하는 것만으로는 알 수 없다. 관측된 패턴을 정형화 해야 한다.

-> 통계적 검증을 수행.

 

II.4 Tests of independence.

 

독립인지 아닌지에 대한 수식 검증이 필요하다.

어려워 보이지만, 한마디로 RMSE와 비슷하다. 오차 검증.

 

  For each variable, we will apply this test between the missingness indicator of that variable and each other variable in our dataset. The null hypothesis will be rejected when the p-value is under α = 5% (the threshold generally used).

 

각 변수들에 대해 이 오차가 5%이내여야 한다는 뜻.

def cochran_criterion(crosstab):
  criterion = True
  E = []
  N = df.shape[0]
  for O_i in crosstab.sum(axis=1):
    for O_j in crosstab.sum():
      if O_i*O_j/N==0:
        critetion = False
      E.append(O_i*O_j/N > 4)
  criterion = criterion & (np.mean(E)>0.8)
  return criterion
def chi2_test(X,Y):
  crosstab = pd.crosstab(X, Y)
  criterion = cochran_criterion(crosstab)
  chi2, p = chi2_contingency(crosstab)[:2]
  return [criterion, chi2, p]
start = datetime.now()

df_total = pd.concat([df.drop('target', axis=1), test])
chi2_missingness = pd.DataFrame(columns=['Cochran_criterion', 'Chi2', 'p_value'])

for col1 in df_total.columns:
  for col2 in df_total.columns.drop(col1):
    missingness_indicator = df_total[col1].isna()
    other_variable = df_total[col2]
    chi2_missingness.loc[col1 + '_' + col2] = chi2_test(missingness_indicator, other_variable)

for col in df.columns.drop('target'):
  missingness_indicator = df[col].isna()
  chi2_missingness.loc[col + '_target'] = chi2_test(missingness_indicator, df.target)

runtime = datetime.now() - start
print('Runtime ', runtime)
chi2_missingness.sort_values('p_value', inplace=True)
reject, pvalue_corrected = fdrcorrection(pvals=chi2_missingness.p_value, alpha=0.05, method='indep', is_sorted=True)
chi2_missingness['p_value_corrected'] = pvalue_corrected
chi2_missingness['BH_reject'] = chi2_missingness['Cochran_criterion'] & reject
chi2_missingness.to_csv('chi2_tests_missingness')

 

chi2_missingness.head()
	Cochran_criterion |	Chi2 |	p_value |	p_value_corrected|	BH_reject
bin_0_ord_1	True	19.899569	0.000523	0.276519	False
nom_4_bin_3	True	10.559514	0.001156	0.305740	False
nom_9_nom_3	True	18.400236	0.002484	0.339804	False
day_target	True	9.090502	0.002569	0.339804	False
ord_0_nom_5	True	1352.369550	0.004373	0.462678	False

 

 

 As we imputed all their missing values, the features bin_0, bin_1, bin_2, bin_3 and bin_4 are binary variables (hence their names) taking values 0 and 1 (for bin_0, bin_1 and bin_2) or F and T (for bin_3) or N and Y (for bin_4). Therefore, bin_0, bin_1 and bin_2 are kept in this form. The values of bin_3 and bin_4 are just mapped with 0 (for F and N respectively) and 1 (for T and Y respectively) to be in an appropriate format.

     The predictors nom_0, nom_1, nom_2, nom_3 and nom_4 have a priori no order and have a cardinality low enough to be one-hot encoded. So we will one-hot encode them.

     The predictors nom_5, nom_6, nom_7, nom_8, nom_9 have a priori no order either. Albeit their values look like hexadecimal numbers, to encode them this way was not efficient. But, as they have a high cardinality, we will not one-hot encode them (neither binary encode them or base-N encode them). We won't use ordinal encoding as it would create an artificial order in these nominal features. Instead, we will apply a target encoding in order to create an order that has sense: each modality will be encoded with the probability that the target equals 1 given that modality (in fact this probability is regularized by a weighted averaging with the a priori probability the target equals 1, the weights being tuned by a smoothing parameter).
     Nonetheless, target encoding is more prone to overfitting than ordinal encoding as target encoding use information about the variable we want to predict. To avoid overfitting, one approach is to add random noise (and sometimes to exclude the current row’s target when calculating the mean target for a level to reduce the effect of outliers). Instead, here, we use a stratified 5-fold splitting strategy in the same way as for 5-fold cross-validation: we train a target encoder on 80% of the training set and use it to encode the remaining 20%. Thus, with 5 splits we encode all the training set. The test set is then encoded with a target encoder trained on all the training set.

     Although ord_0 take three ordered values (1, 2 and 3), as we don't know the meaning of that variable, there is no particular reason to think the distance between the three values should be the same. I tried to use some transformations (taking the square, the square root, etc.) of ord_0 but the default order worked better, so we will keep it.
     To impute its missing values with the mode (which is 1) would bias not only its distribution (because it would over-represent the value 1) but also its relation with the target. Indeed, since the test of independence between the missingness of ord_0 and the target was not significant, there is no reason to think the target more often equals to 1 when ord_0 is missing. But, as shown in the graph below, the modality 1 of ord_0 is associated with a higher propensity of the target to equal 0 (as the target is positively correlated to ord_0 and 1 is the minimum of ord_0).

 

예상대로 bin_0~bin_5는 그냥 0또는 1 처리 해주면 된다.

 

나머지도 예상대로 똑같다. 다만 여기서 다른 건 nom_5~nom_9를 목표형으로 바꿔주었다는 것. 그에 따른 과적합 여부가 있겠지만, 그만큼 유용하다는 뜻이었다.

 

결론적으로 본 글에서는 다음과 같은 코드를 이용해 XGBoost로 예측 모델을 훈련시켰다.

#### Modelisation with XGBoost Gradient Boosting trees Classifier.

weight = sum(y.values==0)/sum(y.values==1)


xgb_gbrf = xgb.XGBClassifier(n_estimators=818, random_state=0, objective='binary:logistic', scale_pos_weight=weight, 
                             learning_rate=0.15, max_depth=2, subsample=0.7, min_child_weight=500,  colsample_bytree = 0.2,
                             reg_lambda = 3.5, reg_alpha=1.5, num_parallel_tree= 5)
                             
xgb_gbrf.fit(X, y, eval_set= [(X, y)], eval_metric=['auc'], verbose=False)
start = datetime.now()

train_scores = []
val_scores = []
y_validations = []
y_predictions = []

for index_train, index_val in StratifiedKFold(n_splits=5, random_state=42, shuffle=True).split(X, y):
        
    X_train = X.iloc[index_train]
    y_train = y.iloc[index_train]
    X_valid = X.iloc[index_val]
    y_valid = y.iloc[index_val]

    weights = sum(y_train.values==0)/sum(y_train.values==1)

    
    xgb_gbrf = xgb.XGBClassifier(n_estimators=3000, random_state=0, objective='binary:logistic', scale_pos_weight=weights, 
                                    learning_rate=0.15, max_depth=2, subsample=0.7, min_child_weight=500,  
                                    colsample_bytree = 0.2, reg_lambda = 3.5, reg_alpha=1.5, num_parallel_tree= 5)
    
    xgb_gbrf.fit(X_train, y_train, eval_set= [(X_train, y_train), (X_valid, y_valid)], eval_metric=['auc'], 
                    early_stopping_rounds=100, verbose=False)
    
    # We stock the results.
    y_pred = xgb_gbrf.predict_proba(X_train)
    train_score = roc_auc_score(y_train, y_pred[:,1])
    train_scores.append(train_score)

    y_pred = xgb_gbrf.predict_proba(X_valid)
    val_score = roc_auc_score(y_valid, y_pred[:,1])
    val_scores.append(val_score)
    
    y_validations.append(y_valid)
    y_predictions.append(y_pred[:,1])
    
    
display_scores(train_scores, val_scores, y_validations, y_predictions)

runtime = datetime.now() - start
print('\n\nRuntime ', runtime)
gbrf_feature_importances = pd.DataFrame({'Predictors': X.columns, 'Normalized gains': xgb_gbrf.feature_importances_}) 
gbrf_feature_importances.sort_values('Normalized gains',ascending=False, inplace=True)
gbrf_feature_importances.reset_index(inplace=True, drop=True)
gbrf_feature_importances.to_csv('fi_gbrf_1st_strategy')

그리고 연관성의 결과는?

	Predictors |	Normalized gains
0	ord_3	0.166439
1	ord_2	0.098356
2	ord_0	0.087061
3	ord_5	0.060891
4	bin_0	0.055027
5	bin_2	0.052249
6	ord_4	0.040665
7	ord_1	0.038445
8	nom_8	0.038403
9	sin_month	0.030957
10	nom_7	0.030522
11	x1_Trapezoid	0.028946
12	cos_day	0.023740
13	x3_Russia	0.022690
14	x4_Bassoon	0.022120
15	nom_9	0.020635
16	x1_Polygon	0.018377
17	x2_Lion	0.015700
18	x3_Costa Rica	0.015243
19	x4_Piano	0.014937
20	x1_Star	0.012118
21	x2_Axolotl	0.010810
22	x3_China	0.010142
23	nom_5	0.009305
24	bin_1	0.008811
25	bin_4	0.008791
26	cos_month	0.007786
27	x2_Snake	0.007034
28	x0_Blue	0.006531
29	x1_Circle	0.005164
30	x3_Canada	0.003338
31	x1_nan	0.003295
32	x3_nan	0.003044
33	x2_Cat	0.003010
34	x3_Finland	0.002858
35	x2_Dog	0.002835
36	x2_nan	0.002622
37	nom_6	0.002579
38	x1_Square	0.001794
39	x4_nan	0.001478
40	sin_day	0.001458
41	x0_Green	0.001300
42	x4_Oboe	0.001299
43	bin_3	0.001196

 

놀랍게도 그래프 예측에서와 비슷하게 잘 들어맞는다.

그리고 대부분의 사람이 bin_3는 별로 영향을 주지 않는다는 의견을 냈는데, 그 말과도 똑같다.

 

XGBoost를 아직 배우진 않았지만, 확실히 이런 걸 보면 왜 XGBoost를 쓰는 지 알 것 같다. 분류 관계를 더 명확하게 해준다.

 

또 다른 노트북

lr_cv = LogisticRegressionCV(Cs=7,
                        solver="lbfgs",
                        tol=0.0001,
                        max_iter=30000,
                        cv=5)

lr_cv.fit(train, labels)

lr_cv_pred = lr_cv.predict_proba(train)[:, 1]
score = roc_auc_score(labels, lr_cv_pred)

print("score: ", score)

 

이 노트북에서는 회귀 모델의 장점인 반복을 극대화시켰다.

 

pandas의 dummies를 사용해 그냥 일반적인 원-핫 인코딩을 했고, Nan 값 처리는 따로 해주지 않았다.

 

결과는?

 

아무것도 안한 베이스라인 모델이 0.78493이었다.

이 정도면 확실히 회귀 모델로는 한계가 있다는 정보를 얻을 수 있다.

따라서 이 문제에 대해서는 내가 다시 잘 알게 되었을 떄 다시 도전하기로 했다. XGBoost를 알게 되면 시험해보아야겠다.

반응형