LSTM as a sideways ResNet

#resnet #lstm #ilya-sutskever

I watched a talk where Ilya Sutskever described LSTMs as "a ResNet rotated 90 degrees". I love this analogy and will explain the connection between LSTM memory vectors and residual streams.

The ResNet view

In a ResNet, each layer takes its input and adds to it:

x_{ℓ + 1} = x_{ℓ} + Δ_{ℓ}

This lets the network learn small, composable changes without losing what it already knows. The vector $x_{ℓ}$ is often conceptualized as a residual stream that encodes state. This stream flows vertically, from layer to layer.

The original reason for adding skip connections in ResNets was to avoid the vanishing gradients problem, but that's out of scope for this post.

The LSTM view

LSTMs do something very similar, but over time instead of depth. Each step has a “memory” vector $c_{t}$ -- sometimes called an "integrator"-- that’s passed forward like this:

c_{t + 1} = {forget}_{t} \cdot c_{t} + {input}_{t} \cdot {new_info}_{t}

That’s a gated residual update. The memory stream $c_{t}$ is just another form of a residual stream, except it flows horizontally, across time steps.

What about transformers?

Transformers sort of do both. As with ResNets, there's a residual stream flowing vertically layer-wise that protects the flow of information as keeps gradients alive. Information is also transmitted time-wise, but not via a hard-coded mechanism as in LSTMs. Instead, it's an learned, emergent property via attention.

Takeaway

I think this is an interesting design principle: protect the flow of information, and let the model make small, reversible edits along the way.

The ResNet view

The LSTM view

What about transformers?

Takeaway

Copyright Ricardo Decal. richarddecal.com