[Paper Summary] ImageNet Classification with Deep Convolutional Neural Networks

The Impact of Depth on Image Classification Performance

11 min readJul 2, 2023

About Me

My name is Mohit Mishra, and I’m a blogger that creates intriguing content that leave readers wanting more. Anyone interested in machine learning and data science should check out my blog. My writing is designed to keep you engaged and intrigued with a regular publishing schedule of a new piece every two days. Follow along for in-depth information that will leave you wanting more!
If you liked the article, please clap and follow me since it will push me to write more and better content. I also cited my GitHub account and Portfolio at the bottom of the blog.

Abstract

They trained a deep CNN to classify the 1.2 million high resolution images into 1000 different classes.
They achieved top-1 and top5 error rates of 37.5% and 17.0%, which was much better than the previous state-of-the-art model.
To make training faster, they included non-saturating neurons and a very efficient GPU implementation of the convolution operation. They have used ReLU as the non-saturating neurons.
To reduce overfitting in the fully connected layers, they employed a regularization method called Dropout.

Non-Saturating Neurons

These are neurons that do not saturate when the inputs pass through them.
To illustrate this, let’s take the example of the Sigmoid Activation Function (Saturating Neurons).
Sigmoid maps the input to 0 and 1.
If we have a large input or a very small input, it will always map to 1 and 0 respectively.
Therefore, if these neurons continuously receive large inputs, the activation function will always output 1, leading to saturation.
Saturation means that the gradient becomes very small. It also means that the gradient of the loss function with respect to the weights of the network also becomes very small.
This makes it difficult to train the network effectively because the weights do not get updated properly during backpropagation.
To address this issue, we can use Non-Saturating Neurons like ReLU.

Introduction

Previously datasets of labeled images were very small in the order of tens of thousands. (e.g. NORB, Caltech & CIFAR)
Simple image recognition tasks can easily be done using datasets of this size, especially if they are augmented with label-preserving transformations.
For example, current best error rate on MNIST digit recognition task is around <0.3% approaches human performance.
Small image datasets have recognized limitations.
Researchers have acknowledged these limitations.
Collecting labeled datasets with millions of images has recently become possible.
Notable larger datasets include LabelMe, which has hundreds of thousands of fully-segmented images.
ImageNet is another notable dataset, containing over 15 million labeled high-resolution images in 22,000 categories.
Learning about thousands of objects from millions of images requires a model with a large learning capacity.
Convolutional neural networks (CNNs) are effective models for object recognition.
CNNs have fewer connections and parameters compared to standard feedforward neural networks, making them easier to train.
Current GPUs, along with optimized implementations of 2D convolution, enable the training of large CNNs on high-resolution images.
The paper presents the training of one of the largest CNNs on ImageNet subsets, achieving the best results reported on those datasets.
The network’s depth, including five convolutional and three fully-connected layers, is crucial for its performance.
Overfitting was addressed through various techniques.
The network’s size is limited by available GPU memory and training time.
Results can potentially be improved with faster GPUs and larger datasets.

Top-1 and Top-5 Error Rates

When we talk about ImageNet, there are two types of errors that we look at: top-1 and top-5 error rates. Let me explain what these mean.

Imagine you have a picture, and you want the computer to tell you what’s in the picture. The computer looks at the picture and makes a guess about what it could be. The top-1 error rate is the percentage of times when the computer’s first guess is incorrect. So, if the computer says the picture shows a cat, but it’s actually a dog, that would be a top-1 error.

Now, let’s move on to the top-5 error rate. The computer doesn’t just make one guess, it actually considers multiple possibilities. The top-5 error rate measures how often the correct answer is not in the computer’s top five guesses. So, if the computer’s top five guesses are cat, tree, car, bird, and house, but the picture is of a dog, then it would be a top-5 error.

These error rates help us understand how well the computer can identify objects in pictures. The lower the error rates, the better the computer is at recognizing things accurately.

The Dataset

ImageNet is a dataset of over 15 million labeled high-resolution images belonging to roughly 22,000 categories
An annual competition called the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) has been held since 2010 using a subset of ImageNet with roughly 1000 images in each of 1000 categories
On ImageNet, it is customary to report two error rates: top-1 and top-5
The architecture of their network contains eight learned layers — five convolutional and three fully-connected.
Their network was trained on the (centered) raw RGB values of the pixels

ReLU Nonlinearity

The standard way to model a neuron’s output is with saturating nonlinearities like tanh(x) or (1 + e^-x)^-1. However, Rectified Linear Units (ReLUs) offer faster training and better performance. Deep convolutional neural networks with ReLUs train several times faster than those with saturating units. By using ReLUs, large neural networks can be experimented with more effectively. While other alternatives have been considered, the accelerated ability to fit the training set distinguishes ReLUs. Faster learning greatly influences the performance of large models trained on large datasets.

Training with Multiple GPUs

To overcome the memory limitation of a single GTX 580 GPU with 3GB memory, we spread the network across two GPUs. Current GPUs allow direct memory sharing, facilitating efficient parallelization between them. In our parallelization scheme, we divide the kernels (neurons) between the GPUs, with specific layers for communication. This architecture, similar to a “columnar” CNN, reduces error rates compared to a network with fewer kernels trained on one GPU. The two-GPU network also takes slightly less time to train than the one-GPU network.

Local Response Normalization

ReLUs (Rectified Linear Units) have a special property that makes them useful in neural networks. Unlike some other activation functions, ReLUs don’t need input normalization to work effectively. If a ReLU receives positive inputs during training, it can learn and make progress. However, researchers have discovered that applying a specific local normalization scheme can further improve the performance and generalization of the network. This scheme involves adjusting the activity of neurons based on their inputs and applying the ReLU function. It helps to ensure that the network’s responses are balanced and optimized for better learning and performance.

Imagine you have a group of neurons in a neural network. Each neuron calculates its activity based on the inputs it receives from other neurons. In this case, the activity is computed using something called a kernel, and then a special function called ReLU is applied to it. But wait, there’s more! After applying ReLU, we want to further adjust the neuron’s activity to make it work even better. This adjustment is done using a process called response normalization. It’s like giving the neuron a boost or making it compete with its neighboring neurons.

Here’s how it works: The activity of a neuron at a particular position is divided by a sum of activities from nearby neurons. The neighboring neurons are like its close friends who help it out. The sum includes a certain number of adjacent neurons, and the ordering of these neurons is determined before training starts. This response normalization helps the neurons to be more balanced and compete with each other. It’s inspired by how real neurons in our brain work. By making these adjustments, the neural network becomes better at recognizing patterns and making accurate predictions.

The values used in this process, like k, n, α, and β, are carefully chosen to make everything work well. It’s like tuning the knobs on a stereo to get the perfect sound. The researchers tested this approach on different datasets and found that it reduced the error rates of the network.

So, in simple terms, response normalization is like giving a boost to some neurons and making them work together, which helps the neural network perform better at recognizing things.

Overlapping Pooling

Imagine you have a bunch of neurons arranged in a grid. Each neuron calculates something based on its input. Now, imagine another layer called a pooling layer. This layer summarizes the outputs of nearby groups of neurons.
In traditional pooling, each group of neurons doesn’t overlap with its neighbors. It’s like having different groups of friends where no one is in multiple groups at the same time.
But sometimes, overlapping can be useful. In our case, we use overlapping pooling, which means the groups of neurons have some overlap. It’s like having a group of friends where some people are part of multiple groups.
By using overlapping pooling, we found that our neural network performs slightly better. It reduces the error rates when trying to recognize things correctly. Specifically, it reduces the top-1 and top-5 error rates by 0.4% and 0.3% respectively compared to the non-overlapping pooling approach
So, overlapping pooling is like having friends who are part of multiple groups, and this helps our network perform better at recognizing things accurately.
Also they found that the training with overlapping pooling is generally harder to overfit.

Architecture

The CNN architecture consists of eight layers: five convolutional and three fully connected.
The last fully-connected layer produces a distribution over 1000 class labels.
Response-normalization layers and max-pooling layers are applied at specific points in the network.
The ReLU non-linearity is applied to the output of every convolutional and fully-connected layer.
The first convolutional layer uses 96 kernels of size 11x11x3, and the second convolutional layer uses 256 kernels of size 5x5x48.
The third, fourth, and fifth convolutional layers are connected to each other without pooling or normalization.
The fully-connected layers each have 4096 neurons.

Reducing Overfitting

Their network has 60 million parameters, and the 1000 classes in the dataset are not enough to prevent overfitting. To address this, we employ two main strategies to combat overfitting.

Data Augmentation

To reduce overfitting in image data, we use data augmentation techniques. The first method involves generating translated and horizontally reflected patches from the original images, increasing the training set size by 2048. During testing, predictions are made by averaging the network’s softmax predictions on multiple patches. The second method alters the RGB intensities by performing PCA on the training set and adding multiples of the principal components. This captures the property of object identity being invariant to changes in intensity and color. The second method reduces the top-1 error rate by over 1%.

Dropout

Combining predictions from multiple models can reduce test errors.
Traditional model combination is expensive for large neural networks.
Dropout is an efficient technique that sets the output of hidden neurons to zero with a 0.5 probability during training.
Dropped-out neurons don’t contribute to the forward pass or backpropagation.
Dropout forces the network to learn more robust features that work with different subsets of neurons.
At test time, all neurons are used but their outputs are multiplied by 0.5 to approximate the geometric mean of predictive distributions from multiple dropout networks.

Details of Learning

Models were trained using stochastic gradient descent with specific settings: batch size of 128, momentum of 0.9, and weight decay of 0.0005.
Weight decay was important for the model’s learning and reduced training error.
The weight update rule incorporated momentum and the derivative of the objective with respect to the weights.
Weights were initialized from a Gaussian distribution with a standard deviation of 0.01.
Biases were initialized differently for certain layers to accelerate learning.
A learning rate was manually adjusted throughout training, reduced when validation error stopped improving.
The network was trained for approximately 90 cycles on a dataset of 1.2 million images over five to six days using two NVIDIA GTX 580 3GB GPUs.

Results

The results of the CNN model on the ILSVRC-2010 test set show a top-1 error rate of 37.5% and a top-5 error rate of 17.0%.
Other approaches achieved lower performance with top-1 and top-5 error rates of 47.1%/28.2% (sparse coding) and 45.7%/25.7% (SIFT + FVs).
In the ILSVRC-2012 competition, the CNN achieved a top-5 error rate of 18.2%, and averaging predictions of similar CNNs reduced the error rate to 16.4%.

Fine-tuning the CNN on the ImageNet Fall 2011 release and using ensembles of CNNs further improved the error rate to 16.6% and 15.3%, respectively.
The second-best contest entry achieved an error rate of 26.2% using different types of densely-sampled features.
On the Fall 2009 version of ImageNet, the CNN with an additional convolutional layer achieved a top-1 error rate of 67.4% and a top-5 error rate of 40.9%.
The best published results on this dataset are 78.1% and 60.9%.

Qualitative Evaluation

The network’s convolutional kernels exhibit specialization and show different preferences for color.
Above figure demonstrates the network’s ability to recognize objects, even when they are off-center, and provides reasonable top-5 label predictions.
Feature activations in the last hidden layer reveal similarities between images based on their Euclidean distance, despite variations in pixel-level appearance.
Euclidean distance computation between high-dimensional vectors can be improved by training an auto-encoder to generate short binary codes, enhancing image retrieval based on semantic similarity.

Discussion

A large, deep convolutional neural network achieves remarkable performance on a challenging dataset through supervised learning.
Removing a single convolutional layer results in a significant performance drop of approximately 2% for the top-1 accuracy, highlighting the importance of network depth.
Although unsupervised pre-training is not utilized in the experiments, it is expected to be beneficial, especially with increased computational power and network size.
The network’s performance has improved with larger size and longer training, but there is still a substantial gap compared to the capabilities of the human visual system.
The goal is to apply very large and deep convolutional networks to video sequences, leveraging temporal information absent in static images.

Thank you for reading my blog post on [Paper Summary] ImageNet Classification with Deep Convolutional Neural Networks. I hope you found it informative and helpful. If you have any questions or feedback, please feel free to leave a comment below.

I also encourage you to check out my Portfolio and GitHub. You can find links to both in the description below.

I am always working on new and exciting projects, so be sure to subscribe to my blog so you don’t miss a thing!

Thanks again for reading, and I hope to see you next time!

[Portfolio Link] [Github Link]