What does 1.58-bit models even mean?
Today I was reading about 1-bit and 1.58 bit networks, which enable faster inference using less compute. The first question that I had was "what does 1.58-bits even mean"?
It boils down to this: 1-bit networks have just two possible weight values:
How do you store "1.58 bits" on real hardware? Since bits come in whole numbers, you can pack several weights together into the same byte: 5 ternary weights take
Having the third bit (encoding zero) opens up the possibilities for a few tricks over binary weights. One is that it lets you sparsify models (prune no-op connections). It also gives you a bigger dynamic range of effective weight magnitudes, so you quantize fp16 transformers without losing as much expressive power as you do with 1-bit networks.