Deep Learning

164_K-fold Cross Validation

elif 2024. 5. 13. 23:04

Various methods exist for general cross-validation techniques, such as hold-out cross-validation and k-fold cross-validation. These techniques help to reliably estimate the generalization performance of a model, that is, how well the model performs on unseen data.

 

 

A classical and popular approach to estimate the generalization performance of machine learning models is the hold-out method. The initial dataset is divided into a training dataset and a test dataset, where the training dataset is used to train the model, and the test dataset is used to estimate the generalization performance. However, in typical machine learning processes, various parameter settings are adjusted and compared to improve prediction performance on unseen data. This process is called model selection, which means selecting the optimal hyperparameter values for a given problem. However, if the same test dataset is repeatedly used during the model selection process, there is a high possibility that the test dataset becomes part of the training data, leading to model overfitting.

 

Therefore, the data is divided into three parts: training, validation, and test datasets. The training dataset is used to fit various models, and the performance on the validation dataset is used for model selection. The advantage of having a test dataset that the model has not seen before during training and model selection is that it provides a less biased estimate of the model's generalization ability on new data. Once the hyperparameter tuning is complete, the generalization performance of the model is estimated using the test dataset.

 

A drawback of the hold-out method is that the performance estimate can be highly sensitive to how the training dataset is partitioned. Therefore, a more robust technique, $k$-fold cross-validation, is often used for performance estimation. In this method, the training data is divided into $k$ subsets, and the hold-out method is repeated $k$ times.

Specifically, in $k$-fold cross-validation, the training dataset is randomly divided into $k$ folds. Out of these, $k$_1 folds are used for model training, and the remaining fold is used for performance evaluation. This procedure is repeated $k$ times, resulting in $k$ models and performance estimates. Then, the average performance across the different independent test folds is calculated, providing a performance estimate that is less sensitive to the partitioning of the training data compared to the hold-out method. Consequently, $k$-fold cross-validation is generally used to find the optimal hyperparameter values that provide satisfactory generalization performance.

 

In summary, $k$-fold cross-validation makes better use of the dataset compared to the hold-out method that uses a validation set. This is because every data point is used for evaluation in $k$-fold cross-validation. Increasing the value of $k$ means more training data is used in each iteration, which reduces the bias when averaging the individual model estimates to estimate generalization performance. However, as $k$ increases, the execution time of the cross-validation algorithm also increases, and the variance of the estimates can rise due to the training folds becoming more similar to each other.

 

A slightly improved approach to standard $k$-fold cross-validation is stratified $k$-fold cross-validation. This method can provide better bias and variance estimates, especially in cases of imbalanced class ratios. Using scikit-learn's 'StratifiedKFold' iterator, it can be implemented as follows

 

import numpy as np
from sklearn.model_selection import StratifiedKFold
kfold = StratifiedKFold(n_splits=10).split(X_train,y_train)
scores = []
for k, (train,test) in enumerate(kfold):
    pipe_lr.fit(X_train[train], y_train[train])
    score = pipe_lr.score(X_train[test], y_train[test])
    scores.append(score)
    print(f'Fold: {k+1:02d}, '
          f'Class distr.: {np.bincount(y_train[train])}, '
          f'Acc.: {score:.3f}')

 

 

Here, the 'StratifiedKFold' iterator from the 'sklearn.model_selection' module is initialized with the 'y_train' class labels of the training dataset, and the number of folds is specified through the 'n_splits' parameter. When iterating through the $k$ folds using the kfold iterator, a logistic regression pipeline is fitted using the train indices. The pipeline ensures that the examples are appropriately scaled in each iteration, and the accuracy score of the model is calculated using the test indices. These scores are collected to compute the mean accuracy and the standard deviation of the estimates.

scikit-learn also allows for a more concise implementation to evaluate the model using stratified $k$-fold cross-validation by using the $k$-fold cross-validation scorer.

 

from sklearn.model_selection import cross_val_score
scores = cross_val_score(estimator=pipe_lr,
                         X = X_train,
                         y = y_train,
                         cv = 10,
                         n_jobs=1)
print(f'CV accuracy scores: {scores}')
print(f'CV accuracy: {np.mean(scores):.3f}'
      f'+/- {np.std(scores):.3f}')

 

 

Here, the 'pipe_lr' is defined as follows.

 

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline

pipe_lr = make_pipeline(StandardScaler(),
                        PCA(n_components=2),
                        LogisticRegression())

 

ref: Raschka, Sebastian, et al. Machine Learning with PyTorch and Scikit-Learn: Develop machine learning and deep learning models with Python. Packt Publishing Ltd, 2022.