Regularization: The Key to Building Better Machine Learning Models

How to Prevent Your Models from Overfitting and Underfitting

14 min readJun 7, 2023

“Regularization is a powerful tool that can help you improve the performance of your machine learning models.” — Ian Goodfellow

Introduction

Machine learning has become increasingly popular in recent years, with applications in image and speech recognition, natural language processing, and predictive analytics. However, building accurate and reliable machine learning models can be challenging, especially when working with complex datasets that contain noisy or incomplete information. One of the key challenges in building machine learning models is finding the right balance between accuracy and generalization, or the ability of the model to perform well on new, unseen data.

Regularization is a technique that can be used to improve the generalization performance of machine learning models. Regularization works by adding a penalty to the cost function, which discourages the model from becoming too complex. This can help to prevent overfitting, which is a problem that occurs when a model learns the training data too well and is unable to generalize to new data.

The optimal amount of regularization to use depends on the specific machine learning model and the dataset that it is being trained on. Cross-validation can be used to find the optimal amount of regularization. Cross-validation involves splitting the dataset into a training set and a test set. The model is trained on the training set and then evaluated on the test set. This process is repeated for different values of the regularization hyperparameter. The value of the regularization hyperparameter that results in the best performance on the test set is the optimal value to use.

Why do we need Regularization?

Regularization is a machine learning approach used to reduce overfitting and improve the model’s generalization capacity. Overfitting happens when a model is trained on training data so successfully that it memorizes the data rather than understanding the underlying patterns and relationships. As a result, the model performs badly with fresh, previously unseen data. Regularization prevents overfitting by introducing a penalty term into the loss function that discourages the model from fitting the noise in the data.

Regularization for Overfitting

Regularization is a technique used in machine learning to prevent overfitting, which occurs when a model is too complex and fits the training data too well, but performs poorly on new, unseen data. Overfitting can lead to poor generalization performance and reduced accuracy.

“Regularization is a way of preventing your model from overfitting the training data.” — Yoshua Bengio

Cost Function and Overfitting

Before we dive into regularization, let’s first define the cost function. The cost function measures how well the model fits the training data. In linear regression, for example, the cost function is the mean squared error between the predicted values and the actual values.

When we train a model, we want to minimize the cost function. However, if the model is too complex, it may fit the training data too well and result in a low training error but high-test error. This is where regularization comes in.

Regularization adds a penalty term to the cost function that controls the complexity of the model. The penalty term penalizes complex models and encourages simpler models that are more likely to generalize well to unseen data.

Types of Regularization

L1 regularization

L1 regularization, also known as Lasso regularization, adds a penalty term to the loss function that is proportional to the absolute value of the weights. This has the effect of shrinking some of the weights to zero, effectively removing some of the features from the model. L1 regularization can be useful when working with high-dimensional datasets where there are many irrelevant or redundant features.

The cost function for a linear regression model with L1 regularization is given by:

J(θ) = 1/m * Σ(y(i) - hθ(x(i)))^2 + λ * Σ|θ|

where:

m is the number of training examples
y(i) is the ground truth label for the ith training example
hθ(x(i)) is the model's prediction for the ith training example
θ is the model's parameters
λ is the regularization hyperparameter

The first term in the cost function is the usual squared error loss function. The second term is the regularization penalty term. This term penalizes large values of the model parameters. As a result, the model parameters are forced to be small, which reduces the model complexity.

To see how L1 regularization affects the model parameters, let’s look at an example in Python:

from sklearn.linear_model import Lasso

# Load data
X, y = load_data()

# Create Lasso model
lasso = Lasso(alpha=0.1)

# Train model
lasso.fit(X, y)

# Print model coefficients
print(lasso.coef_)

In this example, we load some data and create a Lasso model with an alpha parameter of 0.1. We then train the model on the data and print the model coefficients. The Lasso model sets some of the coefficients to zero, resulting in a sparse solution.

“Regularization is like using a straitjacket to keep your model from getting too complex.” — Andrew Ng

L2 regularization

L2 regularization, also known as Ridge regularization, adds a penalty term to the loss function that is proportional to the square of the weights. This has the effect of shrinking all of the weights towards zero, but not necessarily to zero. L2 regularization can be useful when working with datasets where all of the features are relevant, but some of them may be more important than others.

The cost function for a linear regression model with L2 regularization is given by:

J(θ) = 1/m * Σ(y(i) - hθ(x(i)))^2 + λ * Σ(θ^2)

where:

m is the number of training examples
y(i) is the ground truth label for the ith training example
hθ(x(i)) is the model's prediction for the ith training example
θ is the model's parameters
λ is the regularization hyperparameter

The first term in the cost function is the usual squared error loss function. The second term is the regularization penalty term. This term penalizes large values of the model parameters squared. As a result, the model parameters are forced to be close to zero, which reduces the model complexity.

To see how L2 regularization affects the model parameters, let’s look at an example in Python:

from sklearn.linear_model import Ridge

# Load data
X, y = load_data()

# Create Ridge model
ridge = Ridge(alpha=0.1)

# Train model
ridge.fit(X, y)

# Print model coefficients
print(ridge.coef_)

In this example, we load some data and create a Ridge model with an alpha parameter of 0.1. We then train the model on the data and print the model coefficients. The Ridge model encourages smaller values of the coefficients, resulting in a simpler model that is less likely to overfit.

Elastic Net Regularization

Elastic net regularization is a regularization technique that combines L1 and L2 regularization. This can help to improve the performance of the model by reducing overfitting and improving the interpretability of the model.

The mathematical expression for elastic net regularization is as follows:

loss = (y - h)^2 + λ * (α * |w| + (1 - α) * w^2)

where:

α is the mixing parameter, which controls the relative weight of L1 and L2 regularization

from sklearn.linear_model import ElasticNet

# Create a model with Elastic Net regularization.
model = ElasticNet(alpha=0.5, l1_ratio=0.5)

# Train the model.
model.fit(x_train, y_train)

# Make predictions.
y_pred = model.predict(x_test)

This code will train a model with Elastic Net regularization. The alpha parameter controls the amount of regularization, and the l1_ratio parameter controls the ratio of L1 to L2 regularization.

Elastic Net regularization is a combination of L1 and L2 regularization. L1 regularization, also known as Lasso, penalizes the sum of the absolute values of the coefficients. L2 regularization, also known as Ridge, penalizes the sum of the squares of the coefficients.

Elastic Net regularization can be used to prevent overfitting in linear regression models. Overfitting occurs when a model learns the training data too well and is not able to generalize to new data. Elastic Net regularization helps to prevent overfitting by adding a penalty to the loss function that penalizes the model for having large coefficients.

The alpha and l1_ratio parameters can be tuned to find the best combination of regularization and accuracy. A good starting point is to use an alpha of 0.5 and a l1_ratio of 0.5. If the model is still overfitting, you can increase the alpha parameter. If the model is not learning well enough, you can decrease the alpha parameter.

You can also use cross-validation to find the best combination of alpha and l1_ratio parameters. Cross-validation is a technique for evaluating the performance of a model on data that it has not seen before.

Here is an example of how to use cross-validation to find the best combination of alpha and l1_ratio parameters:

from sklearn.model_selection import cross_val_score

# Create a list of alpha values to try.
alphas = [0.1, 0.5, 1.0]

# Create a list of l1_ratio values to try.
l1_ratios = [0.1, 0.5, 1.0]

# Create a list to store the cross-validation scores.
scores = []

# Loop over the alpha values.
for alpha in alphas:

    # Loop over the l1_ratio values.
    for l1_ratio in l1_ratios:

        # Create a model with the current alpha and l1_ratio values.
        model = ElasticNet(alpha=alpha, l1_ratio=l1_ratio)

        # Calculate the cross-validation score.
        score = cross_val_score(model, x_train, y_train, cv=5)

        # Store the cross-validation score.
        scores.append(score)

# Find the best alpha and l1_ratio values.
best_alpha = alphas[scores.index(max(scores))]
best_l1_ratio = l1_ratios[scores.index(max(scores))]

# Create a model with the best alpha and l1_ratio values.
model = ElasticNet(alpha=best_alpha, l1_ratio=best_l1_ratio)

# Train the model.
model.fit(x_train, y_train)

# Make predictions.
y_pred = model.predict(x_test)

Regularization is an important technique for preventing overfitting in machine learning models. By adding a regularization term to the loss function, the model is encouraged to have smaller weights, which can help to reduce the model’s complexity and improve its ability to generalize to new data.

“Regularization is not a silver bullet, but it can be a very effective way to improve the performance of your models.” — Michael Nielsen

Dropout

Dropout is a regularization technique that randomly drops out (or “turns off”) some of the neurons in a neural network during training. This means that the output of these neurons is set to zero, and they do not contribute to the forward pass or backward pass of the network.

The idea behind dropout is that it forces the network to learn more robust features by preventing any single neuron from relying too heavily on any other neuron. This can help to prevent overfitting and improve the generalization performance of the network.

J(θ) = E[(1 - p)L(θ, x, y)]

where:

J(θ) is the loss function
θ is the model parameters
p is the dropout rate
L(θ, x, y) is the loss function for a single data point

The dropout regularization term, (1 — p), penalizes the loss function for using any single neuron too much. This forces the model to learn to use all of the neurons together, which makes it more robust to overfitting.

The dropout rate is a hyperparameter that needs to be chosen by the user. A good starting point is to use a dropout rate of 0.5. If the model is still overfitting, you can increase the dropout rate. If the model is not learning well enough, you can decrease the dropout rate.

Let’s look at the mathematical proof for dropout. During training, we randomly drop out some of the neurons with probability p. This means that the output of each neuron is multiplied by a binary mask matrix m, where each element of the matrix is either 0 or 1 with probability p and 1-p respectively. The expected value of m is equal to 1-p.

During testing, we do not drop out any neurons and use the full network for prediction.

import keras

class DropoutRegularization(keras.layers.Layer):

    def __init__(self, rate):
        super(DropoutRegularization, self).__init__()
        self.rate = rate

    def call(self, inputs):
        # Randomly drop out neurons with probability `rate`.
        dropout_mask = keras.backend.random_uniform(
            shape=inputs.shape,
            dtype=inputs.dtype,
            minval=0.0,
            maxval=1.0
        ) < self.rate

        # Set the dropped out neurons to zero.
        inputs *= dropout_mask

        return inputs

The cost function for a neural network with dropout is:

J(w) = (1/m) * (sum(L(y_i, f(x_i, w, m))) + lambda * sum(w^2))

where:

w is the vector of model parameters
f(x, w, m) is the output of the network with input x
m is the binary mask matrix
L(y, y_hat) is the loss function
lambda is the regularization parameter.

The first term in the cost function is the average loss over all training examples, while the second term is the L2 regularization term. The regularization term encourages smaller values of the model parameters by adding a penalty proportional to the square of each parameter. The lambda parameter controls the strength of the regularization.

To see how dropout affects the performance of a neural network, let’s look at an example in Python:

import tensorflow as tf

# Define a neural network with dropout
model = tf.keras.Sequential([
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(10)
])

# Compile the model with a loss function and optimizer
model.compile(loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              optimizer=tf.keras.optimizers.Adam(),
              metrics=['accuracy'])

# Train the model with dropout
history = model.fit(train_images, train_labels, epochs=10,
                    validation_data=(test_images, test_labels))

In this example, we define a neural network with a dropout layer that drops out 50% of the neurons. We then compile the model with a loss function and optimizer and train it on a dataset. The dropout layer helps to prevent overfitting and improve the generalization performance of the network.

Regularization for underfitting

Underfitting is a common problem in machine learning, especially when working with simple or linear models that are not expressive enough to capture the complexity of the data. Underfitting occurs when the model is too simple and does not fit the training data well enough, resulting in high training error and poor generalization performance.

Regularization can also be used to prevent underfitting by adding a penalty term to the loss function that encourages the model to learn more complex features. The penalty term penalizes simple models and encourages more complex models that are better able to fit the data.

You can also create a function that can be used to prevent underfitting:

def regularized_linear_regression(x, y, λ):
  """
  Performs regularized linear regression.

  Args:
    x: The training data.
    y: The target data.
    λ: The regularization hyperparameter.

  Returns:
    The model's parameters.
  """

  # Initialize the model's parameters.
  θ = np.zeros(x.shape[1])

  # Calculate the loss function.
  J = 1 / len(x) * np.sum((y - θ.T @ x)**2) + λ * np.sum(θ**2)

  # Calculate the gradient of the loss function.
  ∇J = -1 / len(x) * np.sum(x * (y - θ.T @ x)) + 2λθ

  # Update the model's parameters using gradient descent.
  θ = θ - η * ∇J

  # Return the model's parameters.
  return θ

The mathematical proof of how regularization helps to prevent underfitting is as follows:

Consider a linear regression model with the following loss function:

J(θ) = 1/n * Σ(y - θ^T x)^2

where:

θ is the model’s parameters
x is the training data
y is the target data.

The gradient of this loss function with respect to θ is:

∇J(θ) = -1/n * Σ(x(y - θ^T x))

Regularization can be added to this loss function by adding a penalty term to the sum of the squares of the model’s parameters:

J(θ) = 1/n * Σ(y - θ^T x)^2 + λ * Σθ^2

where λ is the regularization hyperparameter. The gradient of this loss function with respect to θ is:

∇J(θ) = -1/n * Σ(x(y - θ^T x)) + 2λθ

The regularization term encourages the model’s parameters to be smaller, which can help to prevent overfitting. However, if the regularization term is too large, it can also lead to underfitting. The optimal value of the regularization hyperparameter λ needs to be tuned on the training data.

In the Python function above, the regularization hyperparameter λ is used to control the amount of regularization that is applied to the model. The larger the value of λ, the more regularization is applied, and the more likely the model is to underfit. The smaller the value of λ, the less regularization is applied, and the more likely the model is to overfit.

The optimal value of λ needs to be tuned on the training data. This can be done by using a cross-validation technique such as k-fold cross-validation. K-fold cross-validation involves splitting the training data into k folds. The model is then trained on k-1 folds and evaluated on the remaining fold. This process is repeated k times, and the average performance of the model is used to select the optimal value of λ.

Let’s look at the mathematical proof for regularization in the context of underfitting. We will use L2 regularization as an example, which adds a penalty term proportional to the square of the model parameters to the loss function.

The cost function for linear regression with L2 regularization is:

J(w) = (1/2m) * (sum((h(x_i) - y_i)^2)) + (lambda/2m) * (sum(w_j^2))

where:

w is the vector of model parameters
h(x) is the predicted value for input x
y is the actual output value
m is the number of training examples
lambda is the regularization parameter.

The first term in the cost function is the mean squared error between the predicted values and the actual values, while the second term is the L2 penalty term. The L2 penalty term encourages smaller values of the model parameters by adding a penalty proportional to the square of each parameter. The lambda parameter controls the strength of the regularization.

To see how L2 regularization affects the model parameters, let’s look at an example in Python:

import numpy as np

# Generate some sample data
np.random.seed(0)
X = np.random.rand(100, 1)
y = 5 * X + np.random.randn(100, 1)

# Add polynomial features
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=3, include_bias=False)
X_poly = poly.fit_transform(X)

# Train a linear regression model with L2 regularization
from sklearn.linear_model import Ridge
ridge = Ridge(alpha=1.0)
ridge.fit(X_poly, y)

# Plot the results
import matplotlib.pyplot as plt
plt.scatter(X, y, color='blue')
plt.plot(X, ridge.predict(X_poly), color='red')
plt.show()

In this example, we generate some sample data and add polynomial features to the input. We then train a linear regression model with L2 regularization using the Ridge class from scikit-learn. The alpha parameter controls the strength of the regularization. Finally, we plot the results to see how the regularization affects the model.

Conclusion

“Regularization is a technique that is worth learning about if you are serious about machine learning.” — Aurélien Géron

In conclusion, regularization is a powerful technique that plays a vital role in building better machine learning models. By preventing overfitting and underfitting, regularization helps to strike the right balance between accuracy and generalization, making it possible to build models that can perform well on new, unseen data.

Regularization techniques like L1 and L2 regularization, as well as dropout, have been proven to be effective in preventing overfitting and underfitting. These techniques add a penalty term to the cost function that encourages the model to learn more robust features and avoid relying too heavily on any single feature.

Regularization is an essential tool in your machine learning toolbox, whether you are working with image recognition, natural language processing, or predictive analytics. It can help you build more accurate and generalizable models that can make better predictions and drive better business outcomes.

In summary, if you want to build better machine learning models, don’t forget to apply regularization techniques. They can make a significant difference in the performance of your models and help you achieve better results in your machine learning projects.

Thank you for reading my blog post on Regularization: The Key to Building Better Machine Learning Models. I hope you found it informative and helpful. If you have any questions or feedback, please feel free to leave a comment below.

I also encourage you to check out my Portfolio and GitHub. You can find links to both in the description below.

I am always working on new and exciting projects, so be sure to subscribe to my blog so you don’t miss a thing!

Thanks again for reading, and I hope to see you next time!

[Portfolio Link] [Github Link]

Regularization: The Key to Building Better Machine Learning Models

How to Prevent Your Models from Overfitting and Underfitting

Introduction

Why do we need Regularization?

Regularization for Overfitting

Cost Function and Overfitting

Types of Regularization

L1 regularization

L2 regularization

Elastic Net Regularization

Dropout

Regularization for underfitting

Conclusion

Written by Mohit Mishra

Responses (1)