Created
August 30, 2021 22:28
-
-
Save tathagata-raha/867310843883bf97bf1572715007ede3 to your computer and use it in GitHub Desktop.
Example code for code mixed language identification PR in the inltk toolkit
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from inltk.inltk import identify_language, reset_language_identifying_models | |
inp = 'The model was trained on code-mixed data' | |
print('inp: ', identify_language(inp)) | |
from inltk.codemixed_util import * # In order to check code-mixed or romanized Indian languages, you have to import all the classes from codemixed_util. Else it will raise AttributeError. Comment this line and check for yourself. | |
inp2 = 'Tu achha insan hain' | |
print('inp2: ', identify_language(inp2, check_codemixed=True)) # Passing the check_codemixed argument as True will check the romanised Indian languages and code-mixed instances | |
inp3 = 'Tu achha insan hain' | |
print('inp3: ', identify_language(inp3)) # if check_codemixed is set to False, it will return 'en' for anything written in Latin script | |
inp4 = 'thanks, nahi khoj paye to batana, i have a few tough ones, but will need to work together for them' | |
print('inp4: ', identify_language(inp4, check_codemixed=True)) # The new functionality also works on code-mixed languages | |
''' | |
Output: | |
----------------------------------------------------------- | |
inp: en | |
inp2: hi-en | |
inp3: en | |
inp4: hi-en | |
----------------------------------------------------------- | |
''' | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment