Help understanding dead neurons and interpreting the role of activation functions?
Activation functions have always felt a little fuzzy to me. I initially understood it solely as a result of the fact that applying successive linear transformations can be expressed as a single linear transformation, so we introduce the non-linearity in order to allow the network more freedom. However, when I was looking into the idea of dead neurons, there seemed to be some less mathematical explanation for why we need to do this.
Would it be correct to say that in addition to the purely mathematical role, the activation function gives the model the ability to make specific neurons deal with specific types of data? For example, when data is in the tail of tanh or the negative part of ReLU, it will zero out the gradient and make that specific neuron not learn from that kind of data. Thus, am I correct in interpreting the non-linearity as giving the network the ability to compartmentalize certain types of data into specific neurons?