Data spliting, CV, and re-sampling

motivation

Two common mistakes that many machine learning theorists and practioners make are:

impute missing values and standardizing features before splitting
resampling the imbalanced data before cross-validation Here I want to address the importance of the sequence with statistical evidence.

why split before standardization and imputation?

Some may refer to standadization/normalisation/scaling the same thing; however, they are different (see: https://scikit-learn.org/stable/modules/preprocessing.html). This is not the concern here. We should split the data into training/test (or training/val/test) sets before any operation performed, the reason being, if you do it the other way round, info from future data is leaked into your training set via the standerdizer/scaler (normally the denominator). That said, your training set carries info about future, which makes test scores invalid (over-estimated). Test data should be strictly reserved - that’s why it’s called unseen data. No info from the test data should be passed to the training set.

The same applied to missing value imputation.

why perform CV before re-sampling?

To clarify, there are two methods to deal with CV and re-sampling:

splitting into training/test sets, then re-sample (down/over sample) training set (no action on test set), and then do CV on training set. Finally score the un-sampled test set.
splitting into training/test sets, then do CV on training set, then re-sample (down/over sample) each fold (no action on test set). Finally score the un-sampled test set. (this is also termed ‘resampling during CV’)

The first method seems more “efficient”, however we are over-estimating the model performance. This is known as “overoptimism” (faulty CV design, see: https://www.researchgate.net/post/should_oversampling_be_done_before_or_within_cross-validation)

From my point of view, there are two issues here: first, when re-sample the data, it mixes the folds (imagine these folds are going to be genrated by later CV). Info leakage happens (similar the splitting/standardisation case). So later when you score your validation fold (iteratively generated by CV), it’s over-estimated.

Second, in a re-sampling/CV sequence, the model picked by the CV score (averaged) is always the most oversampled/overfitted one (e.g. the deepest tree in a tree-based model), i.e. overfitting becomes an issue. Imagine you are doing a classification task, and you have 12 training samples/obs, of which 3 belong to the target (rare) class. You apply over-sampling to achieve a 50-50 balance.

To simplify the re-sampling mechanism, let’s assume it simply repeatedly copy the less represented class, so each of the 3 rare obs will be copied once to generate 6 obs. The rest 6 samples will be taken from the normal class. Then you apply 2-fold CV (or stratified 2-fold CV, normally people make k to be 3, 5, 10 - here we just use 2 as an example, i.e. traing set is split into training/val sets).So your training set now contains 6 obs (3 bad samples, 3 good ones), you train a model and get scores on both training and val sets (they are expected to be consistent). The purpose of CV, in most cases, is to tune your (hyper)parameters. So you nest these into loops (each loop represents a param range scan), and each iteration gives you a validation score and you choose the model with the optimal validation score (highest/lowest depending on the scoring metrics). Unfortunately, chances are that you are likely to get the most over-fitted one, as it gives the optimal validation score. The reason why the most over-fitted model always gives the optimal validation score, in this case, is due to the consistency of the training and validation performance guarantee (due to the fact that both training and validation data are re-sampled, the bad samples are the same/similar in both CV folds - even though the normal samples can vary a bit). So we know if a model performs well on training set, it’s going to perform well or similarly on validation set. This invalidate the use of CV, i.g. it doesn’t make the best use of validation set.

So the more reasonable apporach is to do CV first, and within each fold you apply re-sampling techniques (e.g. SMOTE). This takes into account the variations in the training and validation sets, i.e. it doesn’t make them homogeneous, because the training and validation sets are -resampled independently (actually the validation doesn’t need to be re-sampled). The resulted scores on training set and validation set may vary, and CV can help pick up the optimal estimator based on the valid validation score.

Besides, the validation score can be viewed as a good estimation of test score, because no info leakage was included when we score the validation set. However, if you do the first way, training and validation scores are similar, but the test score will degrade, i.e. re-sampling before CV could make test performance much worse due to over-fitting and overestimating (essentially picking up the inappropriate model evaludated on inappropriate samples generated by inappropriate senquence).

In summary, a model is judged ‘good’/’bad’ in CV based on validation score. However, if the training/validation sets are generated inappropriately (i.e.g info leakage), the validation score is over-estimated, which invalidates the model selection (param tuning) practice. The wrong way of mixing up re-sampling and CV degrades our confidence in trusting the validation score, as evidenced by a degradation in test performance.

Note, you don’t need to do re-sampling on the test set (also the validation fold in CV), as you model has been fed with sufficient (if not complete) info from re-sampled training set data. Always keep in mind that, your test set can only be used once - it is used to evaluate ‘real’ model performance, by checking the consistency between validation/training and test performances.

There are many ways to to re-sampling, and a library for imbalanced data learning is ‘imblearn’ which takes care of the aforementioned considerations. Also, here is a suggested reading on the sequence of CV and re-sampling: https://www.researchgate.net/publication/328315720_Cross-Validation_for_Imbalanced_Datasets_Avoiding_Overoptimistic_and_Overfitting_Approaches

Thanks, Yong 19/08/2020

Data spliting, CV, and re-sampling

the logically justified approach of data processing

Data spliting, CV, and re-sampling

the logically justified approach of data processing

motivation

why split before standardization and imputation?

why perform CV before re-sampling?