What does 1.58-bit models even mean?

#TIL #quantization #information-theory #ml #sparsity

Today I was reading about 1-bit and 1.58 bit networks, which enable faster inference using less compute. The first question that I had was "what does 1.58-bits even mean"?

It boils down to this: 1-bit networks have just two possible weight values: ${- 1, + 1}$ (binary). 1.58-bit networks have three possible weight values: ${- 1, 0, + 1}$ (ternary). The information per weight for binary models is $\log_{2} 2 = 1.0 bit$ and $\log_{2} 3 \approx 1.585 bits$ . In other words, "1.58-bit" isn't a new physical storage unit, but rather the information-theoretic cost of storing one ternary weight.

How do you store "1.58 bits" on real hardware? Since bits come in whole numbers, you can pack several weights together into the same byte: 5 ternary weights take $3^{5} = 243$ , which is less than a byte ( $2^{8} = 256$ ).

Having the third bit (encoding zero) opens up the possibilities for a few tricks over binary weights. One is that it lets you sparsify models (prune no-op connections). It also gives you a bigger dynamic range of effective weight magnitudes, so you quantize fp16 transformers without losing as much expressive power as you do with 1-bit networks.

Copyright Ricardo Decal. richarddecal.com