Skip to content

Instantly share code, notes, and snippets.

@santhoshtr
Created June 23, 2023 06:31
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save santhoshtr/b05227b4a5450517b3d0cac170f93bea to your computer and use it in GitHub Desktop.
Save santhoshtr/b05227b4a5450517b3d0cac170f93bea to your computer and use it in GitHub Desktop.
Language identification - notes

An Open Dataset and Model for Language Identification

From https://github.com/laurieburchell/open-lid-dataset

Paper: https://arxiv.org/pdf/2305.13820.pdf

Mode: https://data.statmt.org/lid/lid201-model.bin.gz Licensed under GPLv3

NLLB LID model for 218 languages

wget https://dl.fbaipublicfiles.com/nllb/lid/lid218e.bin

Then use it for inference

import fasttext
pretrained_lang_model = "lid218e.bin"
model = fasttext.load_model(pretrained_lang_model)
text = "これ、浅草に、行きますか"
predictions = model.predict(text, k=1) 
print(predictions)

License is CC-BY-NC

Links

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment