Activation functions — Why the need?

image

You might often wonder about the activation functions and why do you need it ? Also, there are plethora of information on activation functions online, but can you get a quick introduction with minimal writing and useful insights. Of course, you can and you are in the right place. Because, I have tried to jot down the points that helped me understand the “why” and the “when to use” of activation functions quickly without spending too much time into the details. Let’s get started.

Purpose: to decide whether a neuron should be activated or not.

How: by applying non linear transformations.

Why not linear ? cause without non-linear transformations our model will be a stacked linear regression model and it will fail to learn the complexities of the data just like any linear regression model. So, at each layer we use an activation function.

How many are there? Quite a few

Step function

Sigmoid function

Tanh

ReLU

Leaky ReLU

Softmax

Great! Can we know about each of them ? Sure :)

Step function

image

Essentially, it means if the input is greater or equal to a threshold, then the output will be 1, meaning the neuron will be activated else not. NOT used often.

Sigmoid function

image

What this function does is — takes the input and outputs a probability value that obviously lies between 0 and 1. And for that reason it is mostly used in the last layer of binary classification.

TanH function

image

The hyperbolic TanH function is nothing but the scaled sigmoid function and little shifted so it outputs a value that lies between +1 and -1. Mostly used in the hidden layers.

ReLU function

image

The ReLU activation function is widely used in the hidden layers and is one of the most popular choices. What it does is — it outputs 0 for negative input values and simply outputs the input for positive values. Now, it might seem confusing, but it is a non-linear transformation.

Leaky ReLU function

image

Leaky ReLU is an improvement over ReLU and tries to solve a problem called- vanishing gradient. How is it similar to ReLU ? It is similar in a way that for inputs greater that 0 the output will be the input (like ReLU) but for inputs less than 0, the output will not be 0 but rather the input multiplied with a very small number say 0.0001. How does it help? For ReLU, since the values are zero for inputs less than 0, the gradients will also be zero in back propagation, meaning there will be no weight updation and the neurons will stop learning. So, when the weights are seen not updating then it’s a good idea to use Leaky ReLU in the hidden layers.

Softmax function

image

Softmax function is a popular choice for the last layer of a multi-class classification problem. It takes the inputs and outputs them as probability values, meaning the outputs will lie between 0 and 1.

Conclusion: With this I will end my introduction on activation functions. This write up can be used as a quick reference to activation functions during any interview prep or in general while working on deep learning :)

I am working as a Senior Data Scientist at Hewlett Packard Enterprise. I love exploring new ideas and new places !! :)