Skip to content

Instantly share code, notes, and snippets.

@avriiil
Created April 5, 2021 19:50
Show Gist options
  • Save avriiil/0b860bebbf6938dc0bbe258e5538adb8 to your computer and use it in GitHub Desktop.
Save avriiil/0b860bebbf6938dc0bbe258e5538adb8 to your computer and use it in GitHub Desktop.
Perform morphological tokenization on Arabic text
from camel_tools.tokenizers.morphological import MorphologicalTokenizer
# atbseg scheme
tokenizer = MorphologicalTokenizer(mle, scheme='atbseg')
tokens = tokenizer.tokenize(df.tweet_text.iloc[0])
print(tokens)
# atbtok scheme
tokenizer = MorphologicalTokenizer(mle, scheme='atbtok')
tokens = tokenizer.tokenize(df.tweet_text.iloc[0])
print(tokens)
# bwtok scheme
tokenizer = MorphologicalTokenizer(mle, scheme='bwtok')
tokens = tokenizer.tokenize(df.tweet_text.iloc[0])
print(tokens)
# ...and so on...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment