LSTM as a sideways ResNet
I watchedย a talkย where Ilya Sutskever described LSTMs as "a ResNet rotated 90 degrees". I love this analogy and will explain the connection between LSTM memory vectors and residual streams.
The ResNet view
In a ResNet, each layer takes its input and adds to it:
This lets the network learn small, composable changes without losing what it already knows. The vector
The original reason for adding skip connections in ResNets was to avoid the vanishing gradients problem, but that's out of scope for this post.
The LSTM view
LSTMs do something very similar, but overย timeย instead of depth. Each step has a โmemoryโ vector
Thatโs a gated residual update. The memory stream
What about transformers?
Transformers sort of do both. As with ResNets, there's a residual stream flowing vertically layer-wise that protects the flow of information as keeps gradients alive. Information is also transmitted time-wise, but not via a hard-coded mechanism as in LSTMs. Instead, it's an learned, emergent property via attention.
Takeaway
I think this is an interesting design principle: protect the flow of information, and let the model make small, reversible edits along the way.