santhoshtr/lid.md

## lid.md

      
    Raw
  

              lid.md
            
          
    An Open Dataset and Model for Language Identification

From https://github.com/laurieburchell/open-lid-dataset
Paper:  https://arxiv.org/pdf/2305.13820.pdf
Mode: https://data.statmt.org/lid/lid201-model.bin.gz
Licensed under GPLv3
NLLB LID model for 218 languages

wget https://dl.fbaipublicfiles.com/nllb/lid/lid218e.bin
Then use it for inference
import fasttext
pretrained_lang_model = "lid218e.bin"
model = fasttext.load_model(pretrained_lang_model)
text = "これ、浅草に、行きますか"
predictions = model.predict(text, k=1) 
print(predictions)
License is CC-BY-NC
Links


https://github.com/slone-nlp/myv-nmt/blob/main/dirty-code-2022/model_training/01_multilang-detect.ipynb
https://huggingface.co/slone/fastText-LID-323
https://fasttext.cc/blog/2017/10/02/blog-post.html
https://fasttext.cc/docs/en/supervised-tutorial.html
facebookresearch/fastText#1323 Any plans to update the pre-trained model for Language Identification?