I’m finishing up a language identification model that runs on cpu, 70k texts/s single thread, 13mb model artifact and 148 supported languages (though only ~100 have good accuracy).
This is a model trained as static embeddings from the gemma 3 token embeddings.
This is a model trained as static embeddings from the gemma 3 token embeddings.
https://github.com/dleemiller/WordLlamaDetect