To be clear, this is a 7B model. It's just trained on 1.7 trillion tokens. At first I was confused why they were making such a big deal of of a massive 1.7T model outperforming a 7B model.
By the way, GPT-4 has been predicted to be using a 1.7T model, although OpenAI has not confirmed this to my knowledge.
The most interesting bit is that this is RWKV model, meaning constant size state (no quadratic attention). AFAIK the biggest open weights non-transformer model.
My bet is that this is the reason they are scoring high in "their" benchmarks. For model which are just trained on completely unlabelled data like llama, 0 shot won't work well.
e.g. For llama Hellaswag accuracy is 57.13% in their benchmark compared to 78.59% in [1].
How does that work for RWKV architecture? Wouldn't you have to feed the same data through all the experts regardless if they're currently active or not to keep the rolling state consistent? Or am I misunderstanding that architecture?
By the way, GPT-4 has been predicted to be using a 1.7T model, although OpenAI has not confirmed this to my knowledge.