Ensemble Learning: A Beginner’s Guide

Using Ensemble Learning to Improve Machine Learning Models

Mohit Mishra
10 min readJun 14, 2023

“Two heads are better than one.” — English proverb

About Me

My name is Mohit Mishra, and I’m a blogger that creates intriguing content that leave readers wanting more. Anyone interested in machine learning and data science should check out my blog. My writing is designed to keep you engaged and intrigued with a regular publishing schedule of a new piece every two days. Follow along for in-depth information that will leave you wanting more!

If you liked the article, please clap and follow me since it will push me to write more and better content. I also cited my GitHub account and Portfolio at the bottom of the blog.

What is Ensemble Learning?

Ensemble learning is a machine learning technique that combines multiple models to create a more accurate and robust model than any of the individual models could be on its own. This is done by training multiple models on the same data and then combining their predictions to make a final prediction.

There are many different ways to combine the predictions of multiple models, but some of the most common methods include:

  • Voting: Each model makes a prediction, and the final prediction is the most common prediction.
  • Bagging: Each model is trained on a different bootstrap sample of the data, and the final prediction is the average of the predictions from all of the models.
  • Boosting: Each model is trained on a weighted sample of the data, where the weights are adjusted to make the model more likely to correct the mistakes of the previous models.
Credits

Why Do We Need Ensemble Learning?

There are several reasons why we need ensemble learning. In this article, we will explore some of the key reasons why ensemble learning is important.

1. Reducing Overfitting

One of the main benefits of ensemble learning is that it can help to reduce overfitting. Overfitting occurs when a model is too complex and fits the training data too closely, resulting in poor generalization to new data.

Ensemble learning can help to reduce overfitting by combining multiple models that have been trained on different subsets of the data or with different algorithms. This helps to reduce the variance of individual models and improve the generalization performance of the ensemble.

from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

# Create a decision tree classifier
tree = DecisionTreeClassifier()

# Create a bagging classifier
bag = BaggingClassifier(base_estimator=tree, n_estimators=10, random_state=42)

# Train the bagging classifier
bag.fit(X_train, y_train)

# Make predictions on the test data
y_pred = bag.predict(X_test)

In this example, we create a decision tree classifier and use it as the base estimator for the bagging classifier. We set the number of estimators to 10, which means we will train 10 models on different subsets of the data.

2. Improving Accuracy

Ensemble learning can also help to improve the accuracy of models. By combining multiple models, we can take advantage of their individual strengths and weaknesses.

For example, one model may be good at capturing linear relationships in the data, while another model may be good at capturing non-linear relationships. By combining these models, we can create a more accurate model that takes advantage of the strengths of each individual model.

from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier

# Create the individual classifiers
lr = LogisticRegression(random_state=42)
knn = KNeighborsClassifier()
tree = DecisionTreeClassifier(random_state=42)

# Create the ensemble classifier
ensemble = VotingClassifier(estimators=[('lr', lr), ('knn', knn), ('tree', tree)], voting='hard')

# Train the ensemble classifier
ensemble.fit(X_train, y_train)

# Make predictions on the test data
y_pred = ensemble.predict(X_test)

In this example, we create three different classifiers (logistic regression, k-nearest neighbors, and decision tree) and use them as the base estimators for the voting classifier. We set the voting parameter to ‘hard’, which means that the predictions are based on a simple majority vote.

3. Handling Noisy Data

Ensemble learning can also be useful when working with noisy data. Noisy data can make it difficult to train accurate models, as the noise can introduce errors and biases into the model.

Ensemble learning can help to reduce the impact of noise by combining multiple models that have been trained on different subsets of the data or with different algorithms. This helps to reduce the impact of noise on individual models and improve the overall accuracy of the ensemble.

from sklearn.ensemble import RandomForestClassifier

# Create a random forest classifier
rf = RandomForestClassifier(n_estimators=100, max_features='sqrt', random_state=42)

# Train the random forest classifier
rf.fit(X_train, y_train)

# Make predictions on the test data
y_pred = rf.predict(X_test)

In this example, we create a random forest classifier with 100 trees and use the square root of the number of features as the maximum number of features to consider at each split. We then train the classifier on the training data and make predictions on the test data.

4. Handling Imbalanced Data

Ensemble learning can also be useful when working with imbalanced data. Imbalanced data occurs when one class is much more prevalent than the others. This can make it difficult to train accurate models, as the model may be biased towards the majority class.

Ensemble learning can help to address this issue by combining multiple models that have been trained on different subsets of the data or with different algorithms. This helps to reduce the bias towards the majority class and improve the accuracy of the model for all classes.

from imblearn.ensemble import BalancedBaggingClassifier
from sklearn.tree import DecisionTreeClassifier

# Create a decision tree classifier
tree = DecisionTreeClassifier()

# Create a balanced bagging classifier
bag = BalancedBaggingClassifier(base_estimator=tree, n_estimators=10, random_state=42)

# Train the balanced bagging classifier
bag.fit(X_train, y_train)

# Make predictions on the test data
y_pred = bag.predict(X_test)

In this example, we create a decision tree classifier and use it as the base estimator for the balanced bagging classifier. We set the number of estimators to 10, which means we will train 10 models on different subsets of the data. The balanced bagging classifier uses a resampling strategy to balance the class distribution in each subset of the data.

5. Handling Large Datasets

Ensemble learning can also be useful when working with large datasets. Large datasets can be computationally expensive to train and may require specialized hardware or distributed computing systems.

Ensemble learning can help to reduce the computational cost of training models by allowing us to train multiple models in parallel on different subsets of the data or with different algorithms.

from dask_ml.ensemble import RandomForestClassifier

# Create a random forest classifier
rf = RandomForestClassifier(n_estimators=100, max_features='sqrt', random_state=42)

# Train the random forest classifier using dask
rf.fit(X_train_dask, y_train_dask)

# Make predictions on the test data using dask
y_pred_dask = rf.predict(X_test_dask)

In this example, we create a random forest classifier with 100 trees and use the square root of the number of features as the maximum number of features to consider at each split. We train the classifier using dask_ml, which is a library for distributed machine learning using Dask. This allows us to train the model on large datasets using multiple processors or even multiple machines.

Types of Ensemble Learning

There are many different types of ensemble learning, but some of the most common types include:

“Diversity is the key to resilience.” — Nassim Nicholas Taleb

1. Bagging

Bagging, short for bootstrap aggregating, is a type of ensemble learning that involves training multiple models on different subsets of the training data. The idea behind bagging is to reduce the variance of individual models by training them on different subsets of the data.

The basic algorithm for bagging is as follows:

  1. Split the training data into N subsets.
  2. Train a model on each subset.
  3. Combine the predictions of all N models to make a final prediction.
Credits

Here’s an example code for implementing bagging using scikit-learn:

from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

# Create a decision tree classifier
tree = DecisionTreeClassifier()

# Create a bagging classifier
bag = BaggingClassifier(base_estimator=tree, n_estimators=10, random_state=42)

# Train the bagging classifier
bag.fit(X_train, y_train)

# Make predictions on the test data
y_pred = bag.predict(X_test)

In this example, we create a decision tree classifier and use it as the base estimator for the bagging classifier. We set the number of estimators to 10, which means we will train 10 models on different subsets of the data.

2. Boosting

Boosting is another type of ensemble learning that involves training multiple models sequentially. The idea behind boosting is to focus on the samples that are difficult to classify and improve the accuracy of the model on these samples.

Credits

The basic algorithm for boosting is as follows:

  1. Train a model on the entire training data.
  2. Identify the samples that are misclassified by the model.
  3. Give more weight to the misclassified samples and train another model.
  4. Repeat steps 2–3 until a predefined number of models have been trained.
  5. Combine the predictions of all models to make a final prediction.

Here’s an example code for implementing boosting using scikit-learn:

from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

# Create a decision tree classifier
tree = DecisionTreeClassifier()

# Create an AdaBoost classifier
boost = AdaBoostClassifier(base_estimator=tree, n_estimators=10, random_state=42)

# Train the AdaBoost classifier
boost.fit(X_train, y_train)

# Make predictions on the test data
y_pred = boost.predict(X_test)

In this example, we create a decision tree classifier and use it as the base estimator for the AdaBoost classifier. We set the number of estimators to 10, which means we will train 10 models sequentially.

3. Stacking

Stacking is a type of ensemble learning that involves training multiple models and using their predictions as input to another model. The idea behind stacking is to combine the strengths of different models and make more accurate predictions.

“A chain is only as strong as its weakest link.” — Roman proverb

The basic algorithm for stacking is as follows:

  1. Split the training data into two subsets.
  2. Train several models on the first subset.
  3. Use the trained models to make predictions on the second subset.
  4. Combine the predictions of all models and use them as input to another model.
  5. Train the final model on the combined predictions.
Credits

Here’s an example code for implementing stacking using scikit-learn:

from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_predict

# Create several classifiers
rf = RandomForestClassifier(random_state=42)
lr = LogisticRegression(random_state=42)

# Use cross-validation to generate predictions for the second subset
rf_pred = cross_val_predict(rf, X_train, y_train, cv=5, method='predict_proba')
lr_pred = cross_val_predict(lr, X_train, y_train, cv=5, method='predict_proba')

# Combine the predictions of all classifiers
X_meta = np.concatenate((rf_pred, lr_pred), axis=1)

# Train a logistic regression classifier on the combined predictions
meta = LogisticRegression(random_state=42)
meta.fit(X_meta, y_train)

# Generate predictions on the test data
rf_test_pred = rf.predict_proba(X_test)
lr_test_pred = lr.predict_proba(X_test)
X_test_meta = np.concatenate((rf_test_pred, lr_test_pred), axis=1)
y_pred = meta.predict(X_test_meta)

In this example, we create two classifiers (a random forest classifier and a logistic regression classifier) and use them to generate predictions for the second subset of the training data. We then combine the predictions of both classifiers and use them as input to another logistic regression classifier.

4. Random Forest

Random forest is a popular ensemble learning algorithm that is based on bagging. It involves training multiple decision trees on different subsets of the training data, and then combining their predictions to make a final prediction.

Credits

The key difference between random forest and bagging is that random forest also randomly selects a subset of features to consider at each split. This helps to reduce overfitting and increase the diversity of the models.

The basic algorithm for random forest is as follows:

  1. Randomly select a subset of the training data.
  2. Randomly select a subset of features to consider at each split.
  3. Train a decision tree on the selected data and features.
  4. Repeat steps 1–3 for a fixed number of trees.
  5. Combine the predictions of all trees to make a final prediction.

Here’s an example code for implementing random forest using scikit-learn:

from sklearn.ensemble import RandomForestClassifier

# Create a random forest classifier
rf = RandomForestClassifier(n_estimators=100, max_features='sqrt', random_state=42)

# Train the random forest classifier
rf.fit(X_train, y_train)

# Make predictions on the test data
y_pred = rf.predict(X_test)

In this example, we create a random forest classifier with 100 trees and use the square root of the number of features as the maximum number of features to consider at each split. We then train the classifier on the training data and make predictions on the test data.

Random forest is a powerful algorithm that can be used for both classification and regression tasks. It is particularly useful when working with high-dimensional data or when dealing with noisy data.

“The whole is greater than the sum of its parts.” — Aristotle

Conclusion

Ensemble learning is a powerful technique that can be used to improve the accuracy and robustness of machine learning models. There are many different types of ensemble learning, and the best type to use will depend on the specific problem that you are trying to solve.

I hope this blog has given you a better understanding of ensemble learning. If you have any questions, please feel free to leave a comment below.

“Ensemble learning is the art of building better models by combining multiple, simpler models.” — Leo Breiman

Thank you for reading my blog post on Ensemble Learning: A Beginner’s Guide. I hope you found it informative and helpful. If you have any questions or feedback, please feel free to leave a comment below.

I also encourage you to check out my Portfolio and GitHub. You can find links to both in the description below.

I am always working on new and exciting projects, so be sure to subscribe to my blog so you don’t miss a thing!

Thanks again for reading, and I hope to see you next time!

[Portfolio Link] [Github Link]

--

--

Mohit Mishra
Mohit Mishra

Written by Mohit Mishra

engineer | engineering | doing what i love

No responses yet