Correlation in Machine Learning: What You Need to Know

The Essential Guide to Understanding and Using Correlation in Machine Learning

Mohit Mishra
7 min readJun 21, 2023

Machine learning is a popular field of study that uses algorithms to build models that can make predictions or decisions based on data. One of the most important concepts in machine learning is correlation. Correlation is a statistical measure that describes the relationship between two variables. In this article, we will explore what correlation is, why it is important in machine learning, and how to use it in Python.

It’s important to note that correlation does not imply causation. Just because two variables are correlated does not mean that one causes the other. Correlation is just one tool in our machine learning toolbox that we can use to build better models.

About Me

My name is Mohit Mishra, and I’m a blogger that creates intriguing content that leave readers wanting more. Anyone interested in machine learning and data science should check out my blog. My writing is designed to keep you engaged and intrigued with a regular publishing schedule of a new piece every two days. Follow along for in-depth information that will leave you wanting more!

If you liked the article, please clap and follow me since it will push me to write more and better content. I also cited my GitHub account and Portfolio at the bottom of the blog.

What is Correlation?

Correlation is a statistical measure that describes the strength and direction of the relationship between two variables. The most common type of correlation is Pearson’s correlation coefficient, which measures the linear relationship between two variables. Pearson’s correlation coefficient ranges from -1 to 1, where -1 indicates a perfect negative correlation, 0 indicates no correlation, and 1 indicates a perfect positive correlation.

There are other types of correlation coefficients as well, such as Spearman’s rank correlation coefficient and Kendall’s tau correlation coefficient. These coefficients are used when the relationship between variables is not linear.

“Correlation is a necessary but not sufficient condition for causation.” — Judea Pearl

Benefits of Correlation

There are several benefits to using correlation in machine learning:

  • It can help you to identify important features. By understanding the correlation between different features, you can identify those features that are most important for your machine learning model. This can help you to reduce the size of your dataset and improve the performance of your model.
  • It can help you to avoid overfitting. Overfitting occurs when a machine learning model learns the noise in the data instead of the underlying patterns. Correlation can help you to identify features that are correlated with the target variable but are not actually informative. This can help you to avoid overfitting your model.
  • It can help you to select the right machine learning algorithm. Different machine learning algorithms are better suited for different types of data. By understanding the correlation between different features, you can select the right machine learning algorithm for your task.

For example, if we are building a model to predict the price of a house, we might include features such as the number of bedrooms, the square footage of the house, and the location of the house. By understanding the correlation between these features, we can determine which features are most important for predicting the price of the house.

How Does Correlation Impact Machine Learning Modeling?

Correlation can impact machine learning modeling in several ways. First, highly correlated features can cause overfitting. Overfitting occurs when a model is too complex and fits the training data too closely, resulting in poor performance on new data. Highly correlated features can make a model too complex and cause overfitting.

Credits

Second, highly correlated features can cause multicollinearity. Multicollinearity occurs when two or more features are highly correlated with each other. This can cause problems when building linear regression models because it can be difficult to determine the effect of each feature on the target variable.

Finally, understanding the correlation between features can help us select the best features for our model. By selecting features that are highly correlated with the target variable but not highly correlated with each other, we can build more accurate models.

Using Correlation in Python

Python provides several libraries for calculating correlation coefficients. The most common library is NumPy, which provides the corrcoef() function for calculating Pearson’s correlation coefficient. Here’s an example:

import numpy as np

# Create two arrays
x = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 6, 8, 10])

# Calculate Pearson's correlation coefficient
corr_coef = np.corrcoef(x, y)[0, 1]

print(corr_coef)

This code will output 1.0, which indicates a perfect positive correlation between x and y.

To calculate Spearman’s rank correlation coefficient or Kendall’s tau correlation coefficient, we can use the spearmanr() and kendalltau() functions from the SciPy library:

from scipy.stats import spearmanr, kendalltau

# Create two arrays
x = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 6, 8, 10])

# Calculate Spearman's rank correlation coefficient
spearman_coef = spearmanr(x, y)[0]

# Calculate Kendall's tau correlation coefficient
kendall_coef = kendalltau(x, y)[0]

print(spearman_coef)
print(kendall_coef)

This code will output 1.0 for both Spearman’s rank correlation coefficient and Kendall’s tau correlation coefficient.

“Correlation can be misleading.” — Hans Rosling

Types of Correlation

There are three types of correlation: positive correlation, negative correlation, and zero correlation.

Credits

Positive Correlation

Positive correlation occurs when two variables increase or decrease together. For example, as the temperature outside increases, so does the number of ice cream cones sold. Here’s an example of how to calculate positive correlation in Python:

import numpy as np

# Create two arrays with positive correlation
x = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 6, 8, 10])

# Calculate Pearson's correlation coefficient
corr_coef = np.corrcoef(x, y)[0, 1]

print(corr_coef)

This code will output 1.0, which indicates a perfect positive correlation between x and y.

Negative Correlation

Negative correlation occurs when two variables move in opposite directions. For example, as the price of a product increases, the demand for that product decreases. Here’s an example of how to calculate negative correlation in Python:

import numpy as np

# Create two arrays with negative correlation
x = np.array([1, 2, 3, 4, 5])
y = np.array([10, 8, 6, 4, 2])

# Calculate Pearson's correlation coefficient
corr_coef = np.corrcoef(x, y)[0, 1]

print(corr_coef)

This code will output -1.0, which indicates a perfect negative correlation between x and y.

Zero Correlation

Zero correlation occurs when there is no relationship between two variables. For example, there is no relationship between the number of shoes a person owns and their height. Here’s an example of how to calculate zero correlation in Python:

import numpy as np

# Create two arrays with zero correlation
x = np.array([1, 2, 3, 4, 5])
y = np.array([10, 9, 7, 11, 8])

# Calculate Pearson's correlation coefficient
corr_coef = np.corrcoef(x, y)[0, 1]

print(corr_coef)

This code will output 0.052, which indicates no correlation between x and y.

Most Commonly Used Correlation Methods

Here are some of the most commonly used correlation methods in machine learning:

  1. Pearson Correlation Coefficient
  2. Spearman’s Rank Correlation Coefficient
  3. Kendall’s Tau Correlation Coefficient

“Correlation is a powerful tool, but it is important to use it wisely.” — David Hand

Pearson Correlation Coefficient

Pearson correlation coefficient is a measure of the linear relationship between two variables. It ranges from -1 to 1, where -1 indicates a perfect negative correlation, 0 indicates no correlation, and 1 indicates a perfect positive correlation.

Here’s how to calculate the Pearson correlation coefficient in Python:

import pandas as pd

# Load the data
data = pd.read_csv('data.csv')

# Calculate the Pearson correlation coefficient
corr = data['x'].corr(data['y'], method='pearson')

print('Pearson correlation coefficient:', corr)

Spearman’s Rank Correlation Coefficient

Spearman’s rank correlation coefficient is a measure of the monotonic relationship between two variables. It ranges from -1 to 1, where -1 indicates a perfect negative monotonic relationship, 0 indicates no monotonic relationship, and 1 indicates a perfect positive monotonic relationship.

Here’s how to calculate the Spearman’s rank correlation coefficient in Python:

import pandas as pd

# Load the data
data = pd.read_csv('data.csv')

# Calculate the Spearman's rank correlation coefficient
corr = data['x'].corr(data['y'], method='spearman')

print('Spearman\'s rank correlation coefficient:', corr)

Kendall’s Tau Correlation Coefficient

Kendall’s tau correlation coefficient is a measure of the ordinal relationship between two variables. It ranges from -1 to 1, where -1 indicates a perfect negative ordinal relationship, 0 indicates no ordinal relationship, and 1 indicates a perfect positive ordinal relationship.

“Correlation is a fickle beast.” — Andrew Gelman

Here’s how to calculate the Kendall’s tau correlation coefficient in Python:

import pandas as pd

# Load the data
data = pd.read_csv('data.csv')

# Calculate the Kendall's tau correlation coefficient
corr = data['x'].corr(data['y'], method='kendall')

print('Kendall\'s tau correlation coefficient:', corr)

Conclusion

Correlation is an important concept in machine learning that helps us understand the relationship between different features in our dataset. By understanding the correlation between features, we can build more accurate models that make better predictions. Python provides several libraries for calculating different types of correlation coefficients. By using these libraries and understanding the different types of correlation, we can build more accurate machine learning models.

“Correlation is not enough, but it is a start.” — Gary Marcus

Thank you for reading my blog post on Correlation in Machine Learning. I hope you found it informative and helpful. If you have any questions or feedback, please feel free to leave a comment below.

I also encourage you to check out my Portfolio and GitHub. You can find links to both in the description below.

I am always working on new and exciting projects, so be sure to subscribe to my blog so you don’t miss a thing!

Thanks again for reading, and I hope to see you next time!

[Portfolio Link] [Github Link]

--

--

Mohit Mishra
Mohit Mishra

Written by Mohit Mishra

engineer | engineering | doing what i love

No responses yet