LSTM as a sideways ResNet
I watched a talk where Ilya Sutskever described LSTMs as "a ResNet rotated 90 degrees". I love this analogy and will explain it the connection between memory vectors and residual streams.
The ResNet View
In a ResNet, each layer takes its input and adds to it:
This lets the network learn small, composable changes without losing what it already knows. The vector
The original reason for adding skip connections in ResNets was to avoid the vanishing gradients problem, but that's out of scope for this post.
The LSTM View: Memory as a Residual Stream
LSTMs do something very similar, but over time instead of depth.
Each step has a βmemoryβ vector, often called an "integrator"
Thatβs a gated residual update. The memory stream
What About Transformers?
Transformers sort of so both. As with ResNets, there's a residual stream flowing vertically layer-wise. Information is also transmitted time-wise, but not via a hard-coded mechanism as in LSTMs. Instead, it's an learned, emergent property via attention.
Takeaway
I think this is an interesting design principle: protect the flow of information, and let the model make small, reversible edits along the way.