Skip to content

Instantly share code, notes, and snippets.

@janithl janithl/getlang.py
Last active Dec 18, 2016

Embed
What would you like to do?
from collections import defaultdict
UNICODE_BLOCKS = {
'en': range(0x0000, 0x02AF),
'si': range(0x0D80, 0x0DFF),
'ta': range(0x0B80, 0x0BFF),
'dv': range(0x0780, 0x07BF)
}
def getlang(text):
"""Get language via Unicode range. Partially based on:
https://github.com/kent37/guess-language/blob/master/guess_language/guess_language.py#L344
"""
run_types = defaultdict(int)
for c in text:
if(c.isalpha()):
for block in UNICODE_BLOCKS:
if(ord(c) in UNICODE_BLOCKS[block]):
run_types[block] += 1
return max(run_types, key=run_types.get)
@pathumego

This comment has been minimized.

Copy link

pathumego commented Dec 15, 2016

nice :)

@kiriappeee

This comment has been minimized.

Copy link

kiriappeee commented Dec 15, 2016

can you add some sample data in for this? Just want to try an overthought out optimization

@janithl

This comment has been minimized.

Copy link
Owner Author

janithl commented Dec 18, 2016

@kiriappeee Hey man, sorry I just saw this! Made a quick and incomplete test file. https://gist.github.com/janithl/bdc5d0470e024cc284fb777c92081428

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.