Ethan (@ethanrosenthal.com)

Three ways to differentiate ReLU When a function is not differentiable in the classical sense there are multiple ways to compute a generalized derivative. This post will look at three generalizations of the classical derivative, each applied to the ReLU (rectified linear unit) function. The ReLU function is a commonly used activation function for neural networks. It’s also called the ramp function for obvious reasons. The function is simply _r_(_x_) = max(0, _x_). ## Pointwise derivative The **pointwise** derivative would be 0 for _x_ < 0, 1 for _x_ > 0, and undefined at _x_ = 0. So except at 0, the pointwise derivative of the ramp function is the Heaviside function. In a real analysis course, you’d simply say _r_ ′(_x_) =_H_(_x_) because functions are only defined up to equivalent modulo sets of measure zero, i.e. the definition at _x_ = 0 doesn’t matter. ## Distributional derivative In **distribution theory** you’d identify the function _r_(_x_) with the distribution whose action on a test function φ is Then the derivative of _r_ would be the distribution _r_ ′ satisfying for all smooth functions φ with compact support. You can prove using integration by parts that the above equals the integral of φ from 0 to ∞, which is the same as the action of _H_(_x_) on φ. In this case the distributional derivative of _r_ is the same as the pointwise derivative of _r_ interpreted as a distribution. This does not happen in general [1]. For example, the pointwise derivative of _H_ is zero but the distributional derivative of _H_ is δ, the Dirac delta distribution. For more on distributional derivatives, see How to differentiate a non-differentiable function. ## Subgradient The subgradient of a function _f_ at a point _x_ , written ∂ _f_(_x_), is the set of slopes of tangent lines to the graph of _f_ at _x_. If _f_ is differentiable at _x_ , then there is only one slope, namely _f_ ′(_x_), and we typically say the subgradient of _f_ at _x_ is simply _f_ ′(_x_) when strictly speaking we should say it is the one-element set {_f_ ′(_x_)}. A line tangent to the graph of the ReLU function at a negative value of _x_ has slope 0, and a tangent line at a positive _x_ has slope 1. But because there’s a sharp corner at _x_ = 0, a tangent at this point could have any slope between 0 and 1. My dissertation was full of subgradients of convex functions. This made me uneasy because subgradients are not real-valued functions; they’re set-valued functions. Most of the time you can blithely ignore this distinction, but there’s always a nagging suspicion that it’s going to bite you unexpectedly. [1] When _is_ the pointwise derivative of _f_ as a function equal to the derivative of _f_ as a distribution? It’s not enough for _f_ to be continuous, but it is sufficient for _f_ to be _absolutely_ continuous. https://www.johndcook.com/blog/2026/04/30/derivative-of-relu/