I’m finishing up a language identification model that runs on cpu, 70k texts/s s...

I’m finishing up a language identification model that runs on cpu, 70k texts/s single thread, 13mb model artifact and 148 supported languages (though only ~100 have good accuracy).

This is a model trained as static embeddings from the gemma 3 token embeddings.

https://github.com/dleemiller/WordLlamaDetect