The Power of Recurrent Neural Networks: A Fun Introduction

A Beginner’s Guide to Recurrent Neural Networks: Understanding the Basics

Mohit Mishra
7 min readMar 25


Neurons with Recurrence

  • If you will just see one mode of these then you know that it takes input and do some work over it and then give output. Here x & y both are real numbers.
Single Neuron
  • The above image simply depicts the perception and the green block is the place where all the operations are happening to our input. As this is for one timestamp, we can also have these same working for many timestamps as shown below

We can also simplify the whole into

Note: The same function and set of parameters are used at every time step.

Code for the above intuition:

# This code is just a Pseudocode

my_rnn = RNN()
hidden_state = [0,0,0]
sentence = ["I" , "love", "mathematics"]

# We are iterating through whole sentence word by word
for word in sentence:
# with the help of rnn function we are updating hidden_state and making prediction also
prediction , hidden_state = my_rnn(word , hidden_state)

next_word_prediction = prediction

RNNs: Computational Graph Across Time

Code for the above intuition:

import tensorflow as tf
class MyRNNCell(tf.keras.layers.Layer):
def __init__(self, rnn_units, input_dim, output_dim):
super(MyRNNCell, self).__init__()

# Initialisation of weight matrices
self.W_xh = self.add_weight([rnn_units, input_dim])
self.W_hh = self.add_weight([rnn_units, rnn_units])
self.W_hy = self.add_weight([output_dim, rnn_units])

# Initialization hidden state to zeros
self.h = tf.zeros([rnn_units , 1])

def call(self , x):
# Calculating current hidden state
self.h = tf.math.tanh(self.W_hh * self.h + self.W_xh * x)

# Calculation output
output = self.h * self.W_hy

# Return the current output and hidden state
return output, self.h

# We can simply define a simple RNN using tensorflow

RNN for Sequence Modelling

There can be many types of sequence modeling that we can do with the help of RNN. Some of them are listed below:

  • One-to-One (Image to Image)
  • One-to-Many (Image to Text)
  • Many-to-One (Text to Image)
  • Many-to-Many (Text Translation)

Last many-to-many can be used for video classification on frame level.

Design Criteria

Here by design criteria, I mean what type of design criteria do we need to build a robust and reliable network for processing the sequential modeling problems?

In simple words what are the characteristics or design requirements that RNN needs to fulfill in order to be able to handle sequential data?

Some of the criteria are listed below:

  • Handle variable-length sequences
  • Track long-term dependencies
  • Maintain information about an order
  • Share parameters across the sequence: given one set of weights can be used at different timesteps in modeling.

RNNs: Backpropagation Through Time

Here instead of backpropagation through the single network, we are going to backpropagate through each time step and after that, we will backpropagate across all timesteps from our current time t to the back of the sequence.

This is why this algorithm is called backpropagation through time. This algorithm is very tricky to implement in work.

Standard RNN Gradient Flow

If we look closely then we can see the flow of the gradient and during this flow, we are doing much repetitive weight matrix calculations repeatedly against each other.


  1. If this weight matrix W is very big then it can lead to Exploding Gradient problem. Due to this, it becomes very hard to train them or do any sort of work with them. To tackle this problem we simply do Gradient Clipping.
  2. Conversely, if the weight matrix W is very small then it can lead to Vanishing Gradient. This is a very common problem in RNN particularly.

There are some methods through which we can tackle this vanishing gradient problem:

  1. Activation Function
  2. Weight Initialization
  3. Network Architecture

The Problem of Long-Term Dependencies

Firstly we should know why the vanishing gradient is one of the major issues for Recurrent Neural Networks (RNN).

If the size of the input sentence is not very big then the vanishing gradient is not that much of a big problem. As in these types of problems, RNNs can use information from the immediately passed information to make the prediction.

But if we have a sentence with more long-term dependencies, then this simply indicates that we need information from way far back in the sequence and that distance between the relative and where we are currently becoming very large & therefore it basically leads to vanishing gradients problem. Due to this RNNs are not able to connect the dot and eventually fail to do their job.

Now here comes the part where we should be more focused.

How are we going to solve this problem of Vanishing Gradient?

Activation Functions

  • We can simply change the activation function such that they should help the gradients from shrinking. For example, using ReLU we can prevent gradients when x > 0.
  • The reason behind that is very simple for all x > 0 the derivative of the ReLU function will always be 1. As it is not less than 1, then it simply solves the issue of the Vanishing Gradient problem for us. You can confirm this by checking the graph of the ReLU Activation Function given below:

Parameter Initialization

  • We can also solve this Vanishing Gradient problem by initializing the weight matrix as an Identity matrix and biases to zero. This helps the weights from shrinking to zero
Identity Matrix

Gated Cells

  • This is one of the most robust solutions for the vanishing gradient problem.
  • This solution focuses on controlling the flow the information into the neural network. In simple words filtering out what is not important for us while maintaining what is important for us.
  • The most popular type of RNN that achieve this is Long Short Term Memory (LSTM). It basically relies on the gated cell to track information throughout many time steps.

Long Short Term Memory (LSTM)

LSTM cells can control the information flow in the following ways:

  • Forget
  • Store
  • Update
  • Output
  • LSTM cells are able to track information through many timesteps.

How LSTM operates

  1. It maintains a cell state, this cell state is purely independent of the output state
  2. Uses gates to control the flow of information:
  • Forget gate get rids of irrelevant information that does not have any or major relation to our current state.
  • Stores relevant information from the current input
  • Updates the cell state selectively
  • The output gate returns a filtered version of the cell state.

3. We can train the LSTM using backpropagation in time, but the mathematics of how LSTM is defined allows for a completely uninterrupted flow of the gradient. This largely eliminates the vanishing gradient problem.

Limitations of Recurrent Models

  • Encoding Bottleneck: Let's just say we have a very long sentence and we have to use the information of the starting word for the last word prediction then it becomes very hard to maintain the information till the last and it should be properly maintained, encoded & learned by the network. In practice, it is very hard to follow this way & a lot of information can be lost.
  • Slow and no parallelization: As we do everything timestamp by timestamp due to which RNN can be quite slow & also there is no easy way to parallelize the computation.
  • Not long memory: Together with the problem of Encoding bottleneck and No parallelization leads to the most important problem of not having a long memory

With all this, let’s end this blog as this is of Recurrent Neural Network (RNN). If you will find any issue regarding the concept or code, you can message me on my Twitter or LinkedIn. The next blog will be published on 29 March 2023.

Some words about me

I’m Mohit.❤️ You can also call me Chessman. I’m a Machine learning Developer and a competitive programmer. Most of my time is spent staring at a computer screen. During the day, I am usually programming, working to derive insight from large datasets. My skills include Data Analysis, Data Visualization, Machine learning, Deep Learning, DevOps and working toward Full Stack. I have developed a strong acumen for problem-solving, and I enjoy occasional challenges.

My Portfolio and Github.



Mohit Mishra

My skills include Data Analysis, Data Visualization, Machine learning, and Deep Learning. I have developed a strong acumen for problem-solving, and I enjoy ML.