Skip to content

Instantly share code, notes, and snippets.

@tathagata-raha
Created August 30, 2021 22:28
Show Gist options
  • Save tathagata-raha/867310843883bf97bf1572715007ede3 to your computer and use it in GitHub Desktop.
Save tathagata-raha/867310843883bf97bf1572715007ede3 to your computer and use it in GitHub Desktop.
Example code for code mixed language identification PR in the inltk toolkit
from inltk.inltk import identify_language, reset_language_identifying_models
inp = 'The model was trained on code-mixed data'
print('inp: ', identify_language(inp))
from inltk.codemixed_util import * # In order to check code-mixed or romanized Indian languages, you have to import all the classes from codemixed_util. Else it will raise AttributeError. Comment this line and check for yourself.
inp2 = 'Tu achha insan hain'
print('inp2: ', identify_language(inp2, check_codemixed=True)) # Passing the check_codemixed argument as True will check the romanised Indian languages and code-mixed instances
inp3 = 'Tu achha insan hain'
print('inp3: ', identify_language(inp3)) # if check_codemixed is set to False, it will return 'en' for anything written in Latin script
inp4 = 'thanks, nahi khoj paye to batana, i have a few tough ones, but will need to work together for them'
print('inp4: ', identify_language(inp4, check_codemixed=True)) # The new functionality also works on code-mixed languages
'''
Output:
-----------------------------------------------------------
inp: en
inp2: hi-en
inp3: en
inp4: hi-en
-----------------------------------------------------------
'''
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment