LSTM as a sideways ResNet

I watched a talk where Ilya Sutskever described LSTMs as "a ResNet rotated 90 degrees". I love this analogy and will explain it the connection between memory vectors and residual streams.


The ResNet View

In a ResNet, each layer takes its input and adds to it:

xβ„“+1=xβ„“+Ξ”β„“

This lets the network learn small, composable changes without losing what it already knows. The vector xβ„“ is often conceptualized as a residual stream that encodes state. This stream flows vertically, from layer to layer.

The original reason for adding skip connections in ResNets was to avoid the vanishing gradients problem, but that's out of scope for this post.


The LSTM View: Memory as a Residual Stream

LSTMs do something very similar, but over time instead of depth.

Each step has a β€œmemory” vector, often called an "integrator" ct that’s passed forward like this:

ct+1=forgettβ‹…ct+inputtβ‹…new_infot

That’s a gated residual update. The memory stream ct is just another form of a residual stream, except it flows horizontally, across time steps.


What About Transformers?

Transformers sort of so both. As with ResNets, there's a residual stream flowing vertically layer-wise. Information is also transmitted time-wise, but not via a hard-coded mechanism as in LSTMs. Instead, it's an learned, emergent property via attention.


Takeaway

I think this is an interesting design principle: protect the flow of information, and let the model make small, reversible edits along the way.

Copyright Ricardo Decal. richarddecal.com