Skip to content

Instantly share code, notes, and snippets.

@dyerrington
Last active November 9, 2022 19:35
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save dyerrington/70cb9b55ef2dd34f484d879ae45c5b3b to your computer and use it in GitHub Desktop.
Save dyerrington/70cb9b55ef2dd34f484d879ae45c5b3b to your computer and use it in GitHub Desktop.
Google Translate API demo tested with Python 3.9.x. I want to say this may not work so well with Python 3.10 for some reason but if you follow the guide I referenced otherwise, you should be in business. Highly recommended that you create a new Python environment before engaging with any serious development if you haven't done so.
Display the source blob
Display the rendered blob
Raw
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@dyerrington
Copy link
Author

dyerrington commented Nov 9, 2022

One limitation with TextBlob sentence tokenizer is that it really only works great in latin-based languages and stumbles a bit with multi-byte punctuation such as Cyrillic, Hanzi/Phono-semantic, and Asian-based UTF-8 strings. This is where spaCy is a better choice but this requires a bit more planning to setup and execute since you have to load more libraries and deal with context a bit more selectively. So, if you have a specific care you want to handle, you should be able to extend the above examples with a switch statement to use better sentence handling prior to translation.

Here's a good starting point if you want better sentence handling for non-latin based languages:
https://spacy.io/api/sentencizer

An example of using this (on English at least):

nlp  = spacy.load('en_core_web_sm') # See supported language models here: https://spacy.io/usage/models
doc = nlp(text)
for sent in doc.sents:
    print(sent.text)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment