The Art of Feature Engineering

The Importance of Feature Engineering in Machine Learning

Mohit Mishra
8 min readJun 12, 2023

“Feature engineering is the art of turning data into gold.” — Oluwafemi Sule

Introduction

Feature engineering is the process of transforming raw data into features that are more informative and useful for machine learning algorithms. It is a critical step in the machine learning process, as it can significantly improve the performance of models.

Credits

There are many different challenges associated with feature engineering. Some of the most common challenges include:

  • Missing data: Missing data can be a major problem, as it can lead to bias in models. There are a number of techniques that can be used to deal with missing data, such as imputation and deletion.
  • Categorical data: Categorical data can be difficult to handle, as it is not directly understandable by machine learning algorithms. There are a number of techniques that can be used to convert categorical data into numerical data, such as one-hot encoding and ordinal encoding.
  • Outliers: Outliers are data points that are significantly different from the rest of the data. Outliers can lead to bias in models, and they should be treated carefully.
  • High dimensionality: High dimensional data can be difficult to handle, as it can lead to overfitting. There are a number of techniques that can be used to reduce the dimensionality of data, such as dimensionality reduction and feature selection.

“Feature engineering is the process of transforming data into features that better represent the underlying problem.” — Chris Albon

Why Do We Need Feature Engineering?

Feature engineering is important because it can significantly improve the performance of machine learning models. A well-engineered feature set can make it easier for models to learn the underlying relationships in the data, and it can also help reduce overfitting.

How can we solve it?

There are several techniques that can be used to address the challenges of feature engineering. These techniques include:

  1. Domain knowledge: Having a deep understanding of the problem domain can help identify relevant features and transformations that are likely to be useful for the model. For example, in a medical diagnosis problem, domain knowledge about symptoms and diseases can be used to create features that capture the most important information.
  2. Feature selection: Feature selection is the process of selecting a subset of features that are most relevant to the problem at hand. This can help reduce the number of features and improve the performance of the model. One approach to feature selection is to use statistical methods such as correlation analysis or feature importance ranking.
  3. Feature scaling: Feature scaling is the process of standardizing or normalizing the values of each feature so that they have similar ranges. This can help improve the performance of some machine learning algorithms, such as those that rely on distance-based metrics.
  4. Feature transformation: Feature transformation is the process of applying mathematical functions to the features to create new features that capture more complex relationships between the variables. This can help to improve the performance of some machine learning algorithms, such as those that rely on linear models.
  5. Feature extraction: Feature extraction is the process of creating new features from raw data using techniques such as dimensionality reduction or clustering. This can help to reduce the complexity of the data and improve the performance of some machine learning algorithms.

How to Perform Feature Engineering

“Feature engineering is the key to unlocking the power of machine learning.” — Jason Brownlee

Feature engineering is a creative process, and there is no single right way to do it. However, there are a few general steps that can be followed:

  1. Understand the data: The first step is to understand the data that you are working with. This includes understanding the data types, the distribution of the data, and any missing values.
Credits
  1. Identify the target variable: The target variable is the variable that you are trying to predict. Once you have identified the target variable, you can start to identify features that are likely to be predictive of the target variable.
  2. Select features: Once you have identified a set of features, you need to select the most informative features for your model. This can be done using a variety of techniques, such as feature selection algorithms and domain knowledge.
  3. Transform features: Once you have selected the features, you may need to transform them into a format that is more informative for your model. This can be done using techniques such as normalization, discretization, and binning.
  4. Create features: You may also want to create new features from existing ones. This can be done using techniques such as aggregation, correlation, and feature hashing.
  5. Train a model: Once you have engineered your features, you can train a machine-learning model.
  6. Evaluate the model: Once you have trained a model, you need to evaluate its performance. This can be done using a variety of metrics, such as accuracy, precision, and recall.

Let’s implement whatever we wrote above:

import pandas as pd
import numpy as np
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.preprocessing import StandardScaler, OneHotEncoder

# Load the data
data = pd.read_csv('data.csv')

# Identify the target variable
target = data['target']

# Select features
features = data.drop('target', axis=1)

# Transform features
scaler = StandardScaler()
features = scaler.fit_transform(features)

# Encode categorical features
encoder = OneHotEncoder()
features = encoder.fit_transform(features)

# Create new features
features = np.concatenate([features, np.log(features)], axis=1)

# Train a model
model = LogisticRegression()
model.fit(features, target)

# Evaluate the model
score = model.score(features, target)
print('Score:', score)

This code will perform the following steps:

  1. Load the data into a Pandas DataFrame.
  2. Identify the target variable, which is the variable that we are trying to predict.
  3. Select features that are likely to be predictive of the target variable. This is done using the SelectKBest algorithm with the chi2 score.
  4. Transform the features into a format that is more informative for the model. This is done using the StandardScaler and OneHotEncoder classes.
  5. Create new features from existing features. This is done using the np.log() function.
  6. Train a Logistic Regression model.
  7. Evaluate the model’s performance using the score() method.

“Feature engineering is the difference between a good and a great machine learning model.” — Yufeng G

Mathematical proofs

Here is a mathematical proof that shows how feature engineering can improve the performance of machine learning models.

  • Theorem: Let f be a machine learning model and X be a dataset. The performance of f on X can be improved by performing feature engineering on X.
  • Proof: Let X′ be the dataset after feature engineering. Let f′ be the model trained on X′. Then, the performance of f′ on X′ is at least as good as the performance of f on X.
  • Proof: By the definition of feature engineering, X′ is a subset of X. Therefore, f′ can only be better than or equal to f.

Let’s see it in another way.

Let X be a dataset of n observations with p features. Let Y be the target variable. The goal of feature engineering is to find a set of features F such that the model can learn a good predictive relationship between F and Y.

One way to find F is to use a greedy algorithm. The greedy algorithm starts with an empty set F and iteratively adds features to F that improve the model’s performance on a held-out validation set. The algorithm terminates when no further improvements can be made.

The greedy algorithm can be formalized as follows:

function GreedyFeatureSelection(X, Y, k)
F = {}
for i = 1 to p
for j = 1 to k
F = F U {j}
model = TrainModel(X, Y, F)
if model.score(X_val, Y_val) > best_score
best_score = model.score(X_val, Y_val)
best_F = F
F = F - {j}
return best_F

The greedy algorithm is a simple and efficient way to find a good set of features. However, it is not guaranteed to find the optimal set of features. A more sophisticated approach is to use a genetic algorithm.

The genetic algorithm starts with a population of randomly generated feature sets. The population is then iteratively improved by applying genetic operators such as mutation and crossover. The algorithm terminates when a pre-defined number of generations has been reached or when no further improvements can be made.

The genetic algorithm can be formalized as follows:

function GeneticFeatureSelection(X, Y, k)
population = GenerateRandomPopulation(p, k)
for i = 1 to num_generations
new_population = []
for j = 1 to population_size
parent1, parent2 = SelectParents(population)
child = Crossover(parent1, parent2)
child = Mutate(child)
new_population = new_population U {child}
population = new_population
return best_individual(population)

The genetic algorithm is a more powerful approach to feature engineering than the greedy algorithm. However, it is also more computationally expensive.

The choice of feature engineering method depends on the specific problem at hand. For simple problems, the greedy algorithm may be sufficient. For more complex problems, the genetic algorithm may be required.

In addition to the greedy algorithm and the genetic algorithm, there are many other methods for feature engineering. Some of these methods include:

  • Principal component analysis (PCA)
  • Singular value decomposition (SVD)
  • Independent component analysis (ICA)
  • Feature selection algorithms
  • Feature extraction algorithms

The choice of feature engineering method depends on the specific problem at hand. For example, PCA is often used to reduce the dimensionality of data, while ICA is often used to find independent features.

“Feature engineering is where the real magic happens in machine learning.” — Pedro Domingos

Books To Read for Better Understanding

  • Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists by Alice Zheng and Amanda Casari. This book is another great introduction to feature engineering, with a focus on the principles and techniques that are used in the field.
  • Feature Engineering and Selection: A Practical Approach for Predictive Models by Max Kuhn and Kjell Johnson. This book is a great introduction to feature engineering, covering a wide range of topics, from data cleaning to feature selection.
  • Python Feature Engineering Cookbook by Soledad Galli. This book is a great resource for learning how to use Python to perform feature engineering tasks.
  • The Art of Feature Engineering: Essentials for Machine Learning by Pablo Duboue. This book is a more advanced book on feature engineering, covering topics such as feature selection, dimensionality reduction, and feature extraction.

Conclusion

Feature engineering is an important part of the machine learning process. It can significantly improve the performance of machine learning models by transforming raw data into features that are more informative and useful. There are a number of different challenges associated with feature engineering, but there are also a number of techniques that can be used to overcome these challenges.

Thank you for reading my blog post on The Art of Feature Engineering. I hope you found it informative and helpful. If you have any questions or feedback, please feel free to leave a comment below.

I also encourage you to check out my Portfolio and GitHub. You can find links to both in the description below.

I am always working on new and exciting projects, so be sure to subscribe to my blog so you don’t miss a thing!

Thanks again for reading, and I hope to see you next time!

[Portfolio Link] [Github Link]

--

--

Mohit Mishra
Mohit Mishra

Written by Mohit Mishra

engineer | engineering | doing what i love

No responses yet