Deep Learning

165_Debugging with Learning and Validation Curves

elif 2024. 5. 14. 16:29

If a model is too complex relative to the given training dataset, it tends to overfit the training data and may not generalize well to unseen data. To reduce the degree of overfitting, collecting more training examples can be beneficial. However, in many real-world scenarios, collecting more data is often very challenging. By plotting the model's training and validation accuracy as a function of the training dataset size, it becomes easier to detect whether the model is experiencing high variance or high bias issues and to determine if collecting more data could help resolve these problems.

 

First, here is how to use scikit-learn's 'learning_curve' function to evaluate the model.

 

import matplotlib.pyplot as plt
from sklearn.model_selection import learning_curve

pipe_lr = make_pipeline(StandardScaler(),
                        LogisticRegression(penalty='l2',
                                           max_iter = 10000))
train_sizes, train_scores, test_scores = \
    learning_curve(estimator=pipe_lr,
                   X=X_train,
                   y=y_train,
                   train_sizes=np.linspace(
                       0.1, 1.0, 10),
                   cv = 10,
                   n_jobs=1)

train_mean = np.mean(train_scores, axis = 1)
train_std = np.std(train_scores, axis =1)
test_mean = np.mean(test_scores, axis = 1)
test_std = np.std(test_scores, axis=1)
plt.plot(train_sizes, train_mean, color = 'blue', marker = 'o', markersize = 5, label = 'Training accuracy')
plt.fill_between(train_sizes, train_mean+train_std, train_mean-train_std, alpha = 0.15, color = 'blue')
plt.plot(train_sizes, test_mean, color = 'green', linestyle='--', marker='s', markersize = 5, label = 'Validation accuracy')
plt.fill_between(train_sizes, test_mean+test_std, test_mean-test_std, alpha = 0.15, color = 'green')
plt.grid()
plt.xlabel('Number of training examples')
plt.ylabel('Accuracy')
plt.legend(loc='lower right')
plt.ylim([0.8,1.03])
plt.show()

 

The 'train_sizes' parameter of the 'learning_curve' function controls the number of training examples used to generate the learning curve. In this example, we used 10 evenly spaced intervals for the training dataset size. By default, the 'learning_curve' function uses stratified $k$-fold cross-validation to calculate the classifier's cross-validation accuracy, and we set the cv parameter to 10 to perform 10-fold stratified cross-validation.

 

Next, we compute the average accuracy from the cross-validated training and test scores for various training dataset sizes and plot these values. We also use the 'fill_between' function to add the standard deviation of the mean accuracy to the plot, representing the variance of the estimates.

 

 

When learning more than 250 examples during training, we can observe quite good performance on both the training and validation datasets. Additionally, when the training dataset has fewer than 250 examples, the training accuracy increases while the gap between the validation accuracy and training accuracy widens, indicating an increasing degree of overfitting.

 

Validation curves are useful tools for improving model performance by addressing issues such as overfitting or underfitting. Validation curves are related to learning curves, but instead of plotting training and testing accuracy as a function of sample size, they plot accuracy as a function of changing model parameter values.

 

from sklearn.model_selection import validation_curve

param_range = [0.001, 0.01, 0.1, 1.0, 10.0, 100.0]
train_scores, test_scores = validation_curve(
    estimator = pipe_lr,
    X = X_train,
    y = y_train,
    param_name = 'logisticregression__C',
    param_range = param_range,
    cv = 10
)

train_mean = np.mean(train_scores, axis = 1)
train_std = np.std(train_scores, axis = 1)
test_mean = np.mean(test_scores, axis = 1)
test_std = np.std(test_scores, axis = 1)
plt.plot(param_range, train_mean,
         color = 'blue', marker = 'o',
         markersize = 5, label = 'Training accuracy')
plt.fill_between(param_range, train_mean + train_std,
                 train_mean - train_std, alpha = 0.15,
                 color = 'blue')
plt.plot(param_range, test_mean,
         color = 'green', linestyle='--',
         marker = 's', markersize = 5,
         label = 'Validation accuracy')
plt.fill_between(param_range,
                 test_mean+test_std,
                 test_mean-test_std,
                 alpha = 0.15, color = 'green')
plt.grid()
plt.xscale('log')
plt.legend(loc = 'lower right')
plt.xlabel('Parameter C')
plt.ylabel('Accuracy')
plt.ylim([0.8,1.0])
plt.show()

 

 

Similar to the learning curve function, the 'validation_curve' function by default uses stratified $k$-fold cross-validation to estimate the classifier's performance. In this case, we access the 'LogisticRegression' object within the pipeline through the inverse regularization parameter of the logistic regression classifier specified over a range of values using the 'param_range' parameter. Similar to the previous code, we plot the mean training and cross-validation accuracy along with the respective standard deviations.

The differences in accuracy with changes in parameter values are subtle, but we can observe that as the regularization strength increases, the model tends to slightly underfit the data. Conversely, when the C value is large, the regularization strength decreases, causing the model to slightly overfit.