Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The same can be said about any recurrent network. To predict the token n+1 you could recalculate the hidden state up to token n, or reuse the hidden state of token n from the previous forward pass. The only difference is the amount of memory and computation.

The thing is that, fundamentally, an auto-regressive transformer is a model whose state grows linearly with each token without compression, which is what bestows them with (theoretical) perfect recall.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: