Skip to content

Instantly share code, notes, and snippets.

@dyerrington
Last active November 9, 2022 19:35
Show Gist options
  • Save dyerrington/70cb9b55ef2dd34f484d879ae45c5b3b to your computer and use it in GitHub Desktop.
Save dyerrington/70cb9b55ef2dd34f484d879ae45c5b3b to your computer and use it in GitHub Desktop.
Google Translate API demo tested with Python 3.9.x. I want to say this may not work so well with Python 3.10 for some reason but if you follow the guide I referenced otherwise, you should be in business. Highly recommended that you create a new Python environment before engaging with any serious development if you haven't done so.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"id": "58bea5fd-1e38-4b7c-805f-fd27f7158297",
"metadata": {},
"source": [
"# Google Translate Demo\n",
"David Yerrington <david@yerrington.net>\n",
"\n",
"A client I worked with last year had me on a project related to sentiment using multi-langauge corpus. The steps in the Pipeline looked roughly like:\n",
"\n",
"1. Detect language from text\n",
"2. Translate source languages `['es', 'ar', 'zh', ]` to target `en`\n",
" - By sentence\n",
"3. Feature engineering\n",
" - Source langauge 2 letter code\n",
" - Source confidence score\n",
" - Sentiment\n",
" - English translation text\n",
" - Sentence #\n",
" - Parent observation ID"
]
},
{
"cell_type": "markdown",
"id": "10506d17-ea22-480d-bd97-863a1e0f1b64",
"metadata": {},
"source": [
"## Imports"
]
},
{
"cell_type": "code",
"execution_count": 124,
"id": "94421d6c-d2bd-4ffc-b2ce-269f00eee8d1",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"INFO:numexpr.utils:NumExpr defaulting to 8 threads.\n"
]
}
],
"source": [
"from os import environ\n",
"from google.cloud import translate\n",
"import logging, sys\n",
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"id": "08a05d81-3e45-40d6-a0ed-5b1ef30f953a",
"metadata": {},
"source": [
"### Environment Variables\n",
"\n",
"If you're running this in a notebook, you need to authenticate and/or configure your credentials properly with your Google service account. If you've never worked with Google API services, you need to setup a service acccount. It's actually quite easy compared to the other cloud services since Google takes a very proactive and developer-centric approach to working with their services.\n",
"\n",
"**Use this guide to setup your account properly and setup your credentials:**\n",
"\n",
"https://codelabs.developers.google.com/codelabs/cloud-translation-python3\n",
"\n",
"> **If you've already setup your account**, you just need to set the project ID in your environment then make sure you load Juptyer in that same environment. Long-term, you should set this up in a Docker instance after you get it to work like how you want.\n",
"\n",
"```bash\n",
"export PROJECT_ID=$(gcloud config get-value core/project)\n",
"echo $PROJECT_ID\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 189,
"id": "cf9583ce-1dcf-490c-a014-cec62b8799c6",
"metadata": {},
"outputs": [],
"source": [
"%%bash\n",
"# You can use this cell to check that your PROJECT_ID is set properly or possible to set properly.\n",
"# You may have to gconsole login again.\n",
"export PROJECT_ID=$(gcloud config get-value core/project)\n",
"export GOOGLE_APPLICATION_CREDENTIALS=\"~/my-translation-sa-key.json\"\n",
"echo $PROJECT_ID"
]
},
{
"cell_type": "markdown",
"id": "a24f62fd-cd4d-4a47-97f9-c66d661522cc",
"metadata": {},
"source": [
"### Alterantively - Set ENV from Python\n",
"If you want to set these values from your notebook or Python script, use this method to set them.\n",
"\n",
"> You should review https://codelabs.developers.google.com/codelabs/cloud-translation-python3 to see how we get the JSON credentials file. Either set your environment variable, or use the code below to reference your .json file from Google -- described in the docs above."
]
},
{
"cell_type": "code",
"execution_count": 44,
"id": "06cc63eb-6778-4c17-8399-5bc4e5862a3c",
"metadata": {},
"outputs": [],
"source": [
"environ['GOOGLE_APPLICATION_CREDENTIALS'] = \"/Users/david.yerrington/my-translation-sa-key.json\""
]
},
{
"cell_type": "markdown",
"id": "ac51ff79-9626-4de5-8ef8-3ae60556ef94",
"metadata": {
"tags": []
},
"source": [
"### Translate API Wrapper Class\n",
"\n",
"Feel free to update this for your own needs. This class is a _stripped down_ version from another project but I created it specifically for working with Pandas DataFrames.\n",
"\n",
"> You might find it handy to save this cell to a file and include it in files you're using to automate this process."
]
},
{
"cell_type": "code",
"execution_count": 197,
"id": "983d4054-0463-4f19-9076-006a248c90b0",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"DEBUG:google.auth._default:Checking /Users/david.yerrington/my-translation-sa-key.json for explicit credentials as part of auth process...\n",
"INFO:root:Set 136 available language code(s).\n"
]
}
],
"source": [
"logging.basicConfig(stream=sys.stdout, level=logging.INFO)\n",
"\n",
"class GCPTranslateAPI:\n",
" \n",
" logger = False\n",
" \n",
" project_id = False\n",
" parent = False\n",
" client = False\n",
" \n",
" available_languages = []\n",
" \n",
" def __init__(self, **kwargs):\n",
" \n",
" # Setup basic logging\n",
" self.logger = logging.getLogger()\n",
" self.logger.setLevel(logging.DEBUG)\n",
" \n",
" # Set class attributes from init\n",
" for attribute, value in kwargs.items():\n",
" if hasattr(self, attribute):\n",
" setattr(self, attribute, value)\n",
" \n",
" self.set_credentials()\n",
" self.set_available_languages()\n",
" \n",
" \n",
" \n",
" def set_credentials(self):\n",
" assert self.project_id\n",
" self.client = translate.TranslationServiceClient()\n",
" \n",
" def get_available_languages(self):\n",
" response = self.client.get_supported_languages(\n",
" parent = self.parent, \n",
" display_language_code = \"en\"\n",
" )\n",
" languages = [\n",
" {\n",
" \"code\": language.language_code, \n",
" \"language\": language.display_name\n",
" } \n",
" for language in response.languages\n",
" ]\n",
" return languages\n",
" \n",
" def set_available_languages(self):\n",
" self.available_languages = self.get_available_languages()\n",
" self.logger.info(f\"Set {len(self.available_languages)} available language code(s).\")\n",
" \n",
" def get_language_from_code(self, code):\n",
" for row in self.available_languages:\n",
" if row['code'] == code:\n",
" return row['language']\n",
" else:\n",
" return \"Unknown\"\n",
" \n",
" def detect_language(self, text):\n",
" response = self.client.detect_language(\n",
" parent = self.parent, \n",
" content = text\n",
" )\n",
" languages = [\n",
" {\n",
" \"code\": language.language_code, \n",
" \"language\": self.get_language_from_code(language.language_code),\n",
" \"confidence\": language.confidence,\n",
" \"text\": text\n",
" } \n",
" for language in response.languages\n",
" ]\n",
" return languages\n",
" \n",
" def get_translation(self, text, target=\"en\", confidence_score = False):\n",
" \n",
" if confidence_score:\n",
" detected_language = self.detect_language(text)\n",
" languages_detected = len(detected_language)\n",
" else:\n",
" detected_language = False\n",
" languages_detected = 0\n",
" \n",
" response = self.client.translate_text(\n",
" contents=[text],\n",
" target_language_code = target,\n",
" parent=self.parent,\n",
" )\n",
" \n",
" if hasattr(response, \"translations\"):\n",
" translations = []\n",
" for translation in response.translations:\n",
" result = dict(\n",
" source_text = text,\n",
" detected_language_code = translation.detected_language_code,\n",
" translated_text = translation.translated_text\n",
" )\n",
" if languages_detected > 1:\n",
" result['languages_found'] = []\n",
" for detected in detected_language:\n",
" languages_found.append((detected['code'], detected['language'], detected['confidence']))\n",
" result['languages_found'].append(languages_found)\n",
" elif languages_detected == 1:\n",
" result['confidence'] = detected_language[0]['confidence']\n",
" translations.append(result)\n",
" \n",
" return translations\n",
" return False # default = nothing returned\n",
"\n",
"# Your project_id from Google service account\n",
"project_id = environ.get(\"PROJECT_ID\", \"\")\n",
"\n",
"# Class options -- mainly setting up authentication / connection parameters\n",
"api_options = dict(\n",
" project_id = project_id,\n",
" parent = f\"projects/{project_id}\"\n",
")\n",
"\n",
"api = GCPTranslateAPI(**api_options)"
]
},
{
"cell_type": "markdown",
"id": "7059f5c1-50fa-455a-b9be-f49fe081b5c7",
"metadata": {},
"source": [
"## Using With Pandas\n",
"\n",
"This wrapper class was created specifcally for using Pandas but it mainly just returns lists of dictionaries so it's easy to use it standalone to write your own scripts."
]
},
{
"cell_type": "markdown",
"id": "8d1db9a0-0c7c-4431-8d95-04bed1f1488f",
"metadata": {},
"source": [
"### Example 0: Get reference list of supported langauges\n",
"\n",
"|Variable|Description|\n",
"|--|--|\n",
"|**code**|2 letter language code|\n",
"|**language**|Full language name|\n",
"\n",
"The api class I wrote automatically sets this upon initialization so they are handy for lookups if you need them."
]
},
{
"cell_type": "code",
"execution_count": 192,
"id": "d24ed4cd-1b31-457b-9f2a-172083a755fd",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>code</th>\n",
" <th>language</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>af</td>\n",
" <td>Afrikaans</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>ak</td>\n",
" <td>Akan</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>sq</td>\n",
" <td>Albanian</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>am</td>\n",
" <td>Amharic</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>ar</td>\n",
" <td>Arabic</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>131</th>\n",
" <td>cy</td>\n",
" <td>Welsh</td>\n",
" </tr>\n",
" <tr>\n",
" <th>132</th>\n",
" <td>xh</td>\n",
" <td>Xhosa</td>\n",
" </tr>\n",
" <tr>\n",
" <th>133</th>\n",
" <td>yi</td>\n",
" <td>Yiddish</td>\n",
" </tr>\n",
" <tr>\n",
" <th>134</th>\n",
" <td>yo</td>\n",
" <td>Yoruba</td>\n",
" </tr>\n",
" <tr>\n",
" <th>135</th>\n",
" <td>zu</td>\n",
" <td>Zulu</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>136 rows × 2 columns</p>\n",
"</div>"
],
"text/plain": [
" code language\n",
"0 af Afrikaans\n",
"1 ak Akan\n",
"2 sq Albanian\n",
"3 am Amharic\n",
"4 ar Arabic\n",
".. ... ...\n",
"131 cy Welsh\n",
"132 xh Xhosa\n",
"133 yi Yiddish\n",
"134 yo Yoruba\n",
"135 zu Zulu\n",
"\n",
"[136 rows x 2 columns]"
]
},
"execution_count": 192,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_languages = pd.DataFrame(api.available_languages)\n",
"df_languages"
]
},
{
"cell_type": "markdown",
"id": "e6bd8fce-4af9-4915-821d-fee47b309074",
"metadata": {},
"source": [
"### Setup Basic Example Data"
]
},
{
"cell_type": "code",
"execution_count": 125,
"id": "1c8520ce-6196-4c57-b348-21b164b806e4",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>row_id</th>\n",
" <th>text</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>Cual es el razon por la vida? Trabajar? Ganas ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>No posis mai sal als ulls. Te'n penediràs.</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>Никогда не сыпьте соль в глаза. Вы будете сожа...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4</td>\n",
" <td>Este es un ejemplo más sofisticado que demuest...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5</td>\n",
" <td>强制性普通话示例。 如果你重视你的视力,千万不要在你的眼睛里撒盐</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" row_id text\n",
"0 1 Cual es el razon por la vida? Trabajar? Ganas ...\n",
"1 2 No posis mai sal als ulls. Te'n penediràs.\n",
"2 3 Никогда не сыпьте соль в глаза. Вы будете сожа...\n",
"3 4 Este es un ejemplo más sofisticado que demuest...\n",
"4 5 强制性普通话示例。 如果你重视你的视力,千万不要在你的眼睛里撒盐"
]
},
"execution_count": 125,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"examples = [\n",
" (1, \"Cual es el razon por la vida? Trabajar? Ganas de libro tiempo? Queseria saber hasta pronto.\"),\n",
" (2, \"No posis mai sal als ulls. Te'n penediràs.\"),\n",
" (3, \"Никогда не сыпьте соль в глаза. Вы будете сожалеть об этом.\"),\n",
" (4, \"Este es un ejemplo más sofisticado que demuestra un mayor nivel de uso del idioma dentro de la construcción del castellano, también conocido como español europeo.\"),\n",
" (5, \"强制性普通话示例。 如果你重视你的视力,千万不要在你的眼睛里撒盐\")\n",
"]\n",
"\n",
"df = pd.DataFrame(examples, columns = [\"row_id\", \"text\"])\n",
"df"
]
},
{
"cell_type": "markdown",
"id": "321efebc-fa3b-4be1-8adc-87032fc65f6a",
"metadata": {},
"source": [
"### Example 1: Detect language(s)\n",
"\n",
"While it's possible to return multiple results per language detection call, this example will `.pop()` the last result and extend the current dataframe with:\n",
"\n",
"|Variable|Definition|\n",
"|--|--|\n",
"|**code**| 2 letter language code detected.|\n",
"|**language**| Full language name pulled from an internal reference that's set upon the API wrappers initialization. This is nice because otherwise you have look these up manually.|\n",
"|**confidence**| Google's confidence score on the text example.| \n",
"\n",
"You may want to prequalify certain observations / rows prior to translating them if you have a very large batch to process. This implementation will look each text series (row) with one request per row. For lookups < 10,000 this is probably ok. You can also feed it many results at a time but you will want to update the class a little bit -- just write a new method.\n",
"\n",
"> Note the example #3 is written by me in the moment so it's likely badly written. "
]
},
{
"cell_type": "code",
"execution_count": 140,
"id": "6e0f3eb7-f014-40a0-89df-db88daea2597",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 46.5 ms, sys: 28.6 ms, total: 75.1 ms\n",
"Wall time: 818 ms\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>row_id</th>\n",
" <th>text</th>\n",
" <th>code</th>\n",
" <th>language</th>\n",
" <th>confidence</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>Cual es el razon por la vida? Trabajar? Ganas ...</td>\n",
" <td>es</td>\n",
" <td>Spanish</td>\n",
" <td>1.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>No posis mai sal als ulls. Te'n penediràs.</td>\n",
" <td>ca</td>\n",
" <td>Catalan</td>\n",
" <td>1.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>Никогда не сыпьте соль в глаза. Вы будете сожа...</td>\n",
" <td>ru</td>\n",
" <td>Russian</td>\n",
" <td>1.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4</td>\n",
" <td>Este es un ejemplo más sofisticado que demuest...</td>\n",
" <td>es</td>\n",
" <td>Spanish</td>\n",
" <td>0.770804</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5</td>\n",
" <td>强制性普通话示例。 如果你重视你的视力,千万不要在你的眼睛里撒盐</td>\n",
" <td>zh-CN</td>\n",
" <td>Chinese (Simplified)</td>\n",
" <td>1.000000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" row_id text code \\\n",
"0 1 Cual es el razon por la vida? Trabajar? Ganas ... es \n",
"1 2 No posis mai sal als ulls. Te'n penediràs. ca \n",
"2 3 Никогда не сыпьте соль в глаза. Вы будете сожа... ru \n",
"3 4 Este es un ejemplo más sofisticado que demuest... es \n",
"4 5 强制性普通话示例。 如果你重视你的视力,千万不要在你的眼睛里撒盐 zh-CN \n",
"\n",
" language confidence \n",
"0 Spanish 1.000000 \n",
"1 Catalan 1.000000 \n",
"2 Russian 1.000000 \n",
"3 Spanish 0.770804 \n",
"4 Chinese (Simplified) 1.000000 "
]
},
"execution_count": 140,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%time\n",
"\n",
"def engineer_language_detection_features(row):\n",
" detected = api.detect_language(row['text'])\n",
" for key, value in detected.pop().items():\n",
" row[key] = value\n",
" return row\n",
"\n",
"# This will only be in memory unless you assign it back to the original DataFrame or a copy of it.\n",
"df.apply(engineer_language_detection_features, axis = 1)"
]
},
{
"cell_type": "markdown",
"id": "a02a9613-78a7-4833-92c2-e2bce8ea6b24",
"metadata": {},
"source": [
"For the case of sending larger batches (recommended if using large datasets), you can return all of the examples as chunks of lists by using a function like the following:\n",
"\n",
"```python\n",
"def chunks(input_list, n):\n",
" \"\"\"Yield successive n-sized chunks from input_list.\"\"\"\n",
" for i in range(0, len(input_list), n):\n",
" yield input_list[i:i + n]\n",
"```\n",
"\n",
"Then you can either iterate by offset from `df['text'].loc[start:end].tolist()`, or all at once `df['text'].tolist()`."
]
},
{
"cell_type": "markdown",
"id": "0b74ff8e-93b5-4e11-8a5d-d75313c4819e",
"metadata": {},
"source": [
"**Example 1.1: Removing records below confidence threashold**\n",
"\n",
"As mentioned, you may want to remove records that may not translate well if Google can't detected it properly. This is not a bad idea from experience. You can set the confidence to anything you want but I recall above .6 is usually pretty good. You can plot a histogram to see what your population looks like and examine further."
]
},
{
"cell_type": "code",
"execution_count": 144,
"id": "d7b0c334-5930-4ab3-9fc3-7c71cd881971",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>row_id</th>\n",
" <th>text</th>\n",
" <th>code</th>\n",
" <th>language</th>\n",
" <th>confidence</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>Cual es el razon por la vida? Trabajar? Ganas ...</td>\n",
" <td>es</td>\n",
" <td>Spanish</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>No posis mai sal als ulls. Te'n penediràs.</td>\n",
" <td>ca</td>\n",
" <td>Catalan</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>Никогда не сыпьте соль в глаза. Вы будете сожа...</td>\n",
" <td>ru</td>\n",
" <td>Russian</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5</td>\n",
" <td>强制性普通话示例。 如果你重视你的视力,千万不要在你的眼睛里撒盐</td>\n",
" <td>zh-CN</td>\n",
" <td>Chinese (Simplified)</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" row_id text code \\\n",
"0 1 Cual es el razon por la vida? Trabajar? Ganas ... es \n",
"1 2 No posis mai sal als ulls. Te'n penediràs. ca \n",
"2 3 Никогда не сыпьте соль в глаза. Вы будете сожа... ru \n",
"4 5 强制性普通话示例。 如果你重视你的视力,千万不要在你的眼睛里撒盐 zh-CN \n",
"\n",
" language confidence \n",
"0 Spanish 1.0 \n",
"1 Catalan 1.0 \n",
"2 Russian 1.0 \n",
"4 Chinese (Simplified) 1.0 "
]
},
"execution_count": 144,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_detected = df.apply(engineer_language_detection_features, axis = 1)\n",
"df_detected.query(\"confidence > .8\")"
]
},
{
"cell_type": "markdown",
"id": "810298e9-2219-47cf-bb39-f92dc9162bbc",
"metadata": {
"tags": []
},
"source": [
"### Example 2: Translations \n",
"\n",
"Again, this example assumes `row['text']` is the target column for translation. It will create a redundant `source_text` but you can ommit from your final application. "
]
},
{
"cell_type": "code",
"execution_count": 141,
"id": "04f292f6-e9a1-4f31-87b8-f142e3ab44f4",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 52.5 ms, sys: 24.7 ms, total: 77.2 ms\n",
"Wall time: 1.1 s\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>row_id</th>\n",
" <th>text</th>\n",
" <th>source_text</th>\n",
" <th>detected_language_code</th>\n",
" <th>translated_text</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>Cual es el razon por la vida? Trabajar? Ganas ...</td>\n",
" <td>Cual es el razon por la vida? Trabajar? Ganas ...</td>\n",
" <td>es</td>\n",
" <td>What is the reason for life? To work? Do you e...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>No posis mai sal als ulls. Te'n penediràs.</td>\n",
" <td>No posis mai sal als ulls. Te'n penediràs.</td>\n",
" <td>ca</td>\n",
" <td>Never put salt in your eyes. You will regret it.</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>Никогда не сыпьте соль в глаза. Вы будете сожа...</td>\n",
" <td>Никогда не сыпьте соль в глаза. Вы будете сожа...</td>\n",
" <td>ru</td>\n",
" <td>Never put salt in your eyes. You will regret it.</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4</td>\n",
" <td>Este es un ejemplo más sofisticado que demuest...</td>\n",
" <td>Este es un ejemplo más sofisticado que demuest...</td>\n",
" <td>es</td>\n",
" <td>This is a more sophisticated example that demo...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5</td>\n",
" <td>强制性普通话示例。 如果你重视你的视力,千万不要在你的眼睛里撒盐</td>\n",
" <td>强制性普通话示例。 如果你重视你的视力,千万不要在你的眼睛里撒盐</td>\n",
" <td>zh-CN</td>\n",
" <td>Mandatory Mandarin example. If you value your ...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" row_id text \\\n",
"0 1 Cual es el razon por la vida? Trabajar? Ganas ... \n",
"1 2 No posis mai sal als ulls. Te'n penediràs. \n",
"2 3 Никогда не сыпьте соль в глаза. Вы будете сожа... \n",
"3 4 Este es un ejemplo más sofisticado que demuest... \n",
"4 5 强制性普通话示例。 如果你重视你的视力,千万不要在你的眼睛里撒盐 \n",
"\n",
" source_text detected_language_code \\\n",
"0 Cual es el razon por la vida? Trabajar? Ganas ... es \n",
"1 No posis mai sal als ulls. Te'n penediràs. ca \n",
"2 Никогда не сыпьте соль в глаза. Вы будете сожа... ru \n",
"3 Este es un ejemplo más sofisticado que demuest... es \n",
"4 强制性普通话示例。 如果你重视你的视力,千万不要在你的眼睛里撒盐 zh-CN \n",
"\n",
" translated_text \n",
"0 What is the reason for life? To work? Do you e... \n",
"1 Never put salt in your eyes. You will regret it. \n",
"2 Never put salt in your eyes. You will regret it. \n",
"3 This is a more sophisticated example that demo... \n",
"4 Mandatory Mandarin example. If you value your ... "
]
},
"execution_count": 141,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%time\n",
"\n",
"def translate_rows(row):\n",
" detected = api.get_translation(row['text'])\n",
" for key, value in detected.pop().items():\n",
" row[key] = value\n",
" return row\n",
"\n",
"# This will only be in memory unless you assign it back to the original DataFrame or a copy of it.\n",
"df.apply(translate_rows, axis = 1)"
]
},
{
"cell_type": "markdown",
"id": "629ee53b-173f-4d34-bdd2-ec174262cc59",
"metadata": {},
"source": [
"**Example 2.1: Translations with confidence \"detection\" scores**\n",
"\n",
"With these sort of projects I found it handy to get the detected confidence scores with the translations so that's why this feature exists. It's annoying to mung such a simple piece of data but you can always create one DataFrame for the detected language, then update it with the `get_translation` function but passing `get_confidence_score=True` to `.get_translation(text)` combines the two into one. Make note this will eat up 2x requests per row observation."
]
},
{
"cell_type": "code",
"execution_count": 148,
"id": "84543218-1649-4894-a0a9-bb69a66e50c3",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 67.2 ms, sys: 37.8 ms, total: 105 ms\n",
"Wall time: 1.54 s\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>row_id</th>\n",
" <th>text</th>\n",
" <th>source_text</th>\n",
" <th>detected_language_code</th>\n",
" <th>translated_text</th>\n",
" <th>confidence</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>Cual es el razon por la vida? Trabajar? Ganas ...</td>\n",
" <td>Cual es el razon por la vida? Trabajar? Ganas ...</td>\n",
" <td>es</td>\n",
" <td>What is the reason for life? To work? Do you e...</td>\n",
" <td>1.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>No posis mai sal als ulls. Te'n penediràs.</td>\n",
" <td>No posis mai sal als ulls. Te'n penediràs.</td>\n",
" <td>ca</td>\n",
" <td>Never put salt in your eyes. You will regret it.</td>\n",
" <td>1.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>Никогда не сыпьте соль в глаза. Вы будете сожа...</td>\n",
" <td>Никогда не сыпьте соль в глаза. Вы будете сожа...</td>\n",
" <td>ru</td>\n",
" <td>Never put salt in your eyes. You will regret it.</td>\n",
" <td>1.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4</td>\n",
" <td>Este es un ejemplo más sofisticado que demuest...</td>\n",
" <td>Este es un ejemplo más sofisticado que demuest...</td>\n",
" <td>es</td>\n",
" <td>This is a more sophisticated example that demo...</td>\n",
" <td>0.770804</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5</td>\n",
" <td>强制性普通话示例。 如果你重视你的视力,千万不要在你的眼睛里撒盐</td>\n",
" <td>强制性普通话示例。 如果你重视你的视力,千万不要在你的眼睛里撒盐</td>\n",
" <td>zh-CN</td>\n",
" <td>Mandatory Mandarin example. If you value your ...</td>\n",
" <td>1.000000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" row_id text \\\n",
"0 1 Cual es el razon por la vida? Trabajar? Ganas ... \n",
"1 2 No posis mai sal als ulls. Te'n penediràs. \n",
"2 3 Никогда не сыпьте соль в глаза. Вы будете сожа... \n",
"3 4 Este es un ejemplo más sofisticado que demuest... \n",
"4 5 强制性普通话示例。 如果你重视你的视力,千万不要在你的眼睛里撒盐 \n",
"\n",
" source_text detected_language_code \\\n",
"0 Cual es el razon por la vida? Trabajar? Ganas ... es \n",
"1 No posis mai sal als ulls. Te'n penediràs. ca \n",
"2 Никогда не сыпьте соль в глаза. Вы будете сожа... ru \n",
"3 Este es un ejemplo más sofisticado que demuest... es \n",
"4 强制性普通话示例。 如果你重视你的视力,千万不要在你的眼睛里撒盐 zh-CN \n",
"\n",
" translated_text confidence \n",
"0 What is the reason for life? To work? Do you e... 1.000000 \n",
"1 Never put salt in your eyes. You will regret it. 1.000000 \n",
"2 Never put salt in your eyes. You will regret it. 1.000000 \n",
"3 This is a more sophisticated example that demo... 0.770804 \n",
"4 Mandatory Mandarin example. If you value your ... 1.000000 "
]
},
"execution_count": 148,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%time\n",
"\n",
"def translate_rows(row):\n",
" detected = api.get_translation(row['text'], confidence_score = True)\n",
" for key, value in detected.pop().items():\n",
" row[key] = value\n",
" return row\n",
"\n",
"# This will only be in memory unless you assign it back to the original DataFrame or a copy of it.\n",
"df.apply(translate_rows, axis = 1)"
]
},
{
"cell_type": "markdown",
"id": "2351bdde-d4cd-4d05-b20f-6a64ab1b6355",
"metadata": {},
"source": [
"### Example 3: Transforming rows to row->sentences\n",
"\n",
"Because I had to do this, and since it's possible there could be multiple languages being spoken at the observation level, you might find it handy to translate your DataFrame prior to translating with a `sentence #` feature and the original `row_id`.\n",
"\n",
"> **TextBlob** is a \"quick-and-dirty-get-it-done\" solution. For larger projects, check out spaCy for this."
]
},
{
"cell_type": "code",
"execution_count": 167,
"id": "f8a0d49e-b129-42aa-8750-a7a7277d2404",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>row_id</th>\n",
" <th>sentence 0</th>\n",
" <th>sentence 1</th>\n",
" <th>sentence 2</th>\n",
" <th>sentence 3</th>\n",
" <th>text</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>Cual es el razon por la vida?</td>\n",
" <td>Trabajar?</td>\n",
" <td>Ganas de libro tiempo?</td>\n",
" <td>Queseria saber hasta pronto.</td>\n",
" <td>Cual es el razon por la vida? Trabajar? Ganas ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>No posis mai sal als ulls.</td>\n",
" <td>Te'n penediràs.</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>No posis mai sal als ulls. Te'n penediràs.</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>Никогда не сыпьте соль в глаза.</td>\n",
" <td>Вы будете сожалеть об этом.</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>Никогда не сыпьте соль в глаза. Вы будете сожа...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4</td>\n",
" <td>Este es un ejemplo más sofisticado que demuest...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>Este es un ejemplo más sofisticado que demuest...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5</td>\n",
" <td>强制性普通话示例。 如果你重视你的视力,千万不要在你的眼睛里撒盐</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>强制性普通话示例。 如果你重视你的视力,千万不要在你的眼睛里撒盐</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" row_id sentence 0 \\\n",
"0 1 Cual es el razon por la vida? \n",
"1 2 No posis mai sal als ulls. \n",
"2 3 Никогда не сыпьте соль в глаза. \n",
"3 4 Este es un ejemplo más sofisticado que demuest... \n",
"4 5 强制性普通话示例。 如果你重视你的视力,千万不要在你的眼睛里撒盐 \n",
"\n",
" sentence 1 sentence 2 \\\n",
"0 Trabajar? Ganas de libro tiempo? \n",
"1 Te'n penediràs. NaN \n",
"2 Вы будете сожалеть об этом. NaN \n",
"3 NaN NaN \n",
"4 NaN NaN \n",
"\n",
" sentence 3 \\\n",
"0 Queseria saber hasta pronto. \n",
"1 NaN \n",
"2 NaN \n",
"3 NaN \n",
"4 NaN \n",
"\n",
" text \n",
"0 Cual es el razon por la vida? Trabajar? Ganas ... \n",
"1 No posis mai sal als ulls. Te'n penediràs. \n",
"2 Никогда не сыпьте соль в глаза. Вы будете сожа... \n",
"3 Este es un ejemplo más sofisticado que demuest... \n",
"4 强制性普通话示例。 如果你重视你的视力,千万不要在你的眼睛里撒盐 "
]
},
"execution_count": 167,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# In a new cell if module not found:\n",
"# !pip install textblob\n",
"from textblob import TextBlob\n",
"\n",
"def transform_to_sentences(row):\n",
" blob = TextBlob(row['text'])\n",
" for index, sentence in enumerate(blob.sentences):\n",
" row[f\"sentence {index}\"] = sentence.raw\n",
" return row\n",
"\n",
"# step 1: Push each sentence to a new column feature\n",
"df_sentence = df.apply(transform_to_sentences, axis = 1)\n",
"df_sentence"
]
},
{
"cell_type": "code",
"execution_count": 188,
"id": "7681daf1-9d70-4e29-a30b-997b5953f712",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>row_id</th>\n",
" <th>text</th>\n",
" <th>sentence_n</th>\n",
" <th>sentence_text</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>Cual es el razon por la vida? Trabajar? Ganas ...</td>\n",
" <td>0</td>\n",
" <td>Cual es el razon por la vida?</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>1</td>\n",
" <td>Cual es el razon por la vida? Trabajar? Ganas ...</td>\n",
" <td>1</td>\n",
" <td>Trabajar?</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>1</td>\n",
" <td>Cual es el razon por la vida? Trabajar? Ganas ...</td>\n",
" <td>2</td>\n",
" <td>Ganas de libro tiempo?</td>\n",
" </tr>\n",
" <tr>\n",
" <th>15</th>\n",
" <td>1</td>\n",
" <td>Cual es el razon por la vida? Trabajar? Ganas ...</td>\n",
" <td>3</td>\n",
" <td>Queseria saber hasta pronto.</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>No posis mai sal als ulls. Te'n penediràs.</td>\n",
" <td>0</td>\n",
" <td>No posis mai sal als ulls.</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>2</td>\n",
" <td>No posis mai sal als ulls. Te'n penediràs.</td>\n",
" <td>1</td>\n",
" <td>Te'n penediràs.</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>2</td>\n",
" <td>No posis mai sal als ulls. Te'n penediràs.</td>\n",
" <td>2</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>16</th>\n",
" <td>2</td>\n",
" <td>No posis mai sal als ulls. Te'n penediràs.</td>\n",
" <td>3</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>Никогда не сыпьте соль в глаза. Вы будете сожа...</td>\n",
" <td>0</td>\n",
" <td>Никогда не сыпьте соль в глаза.</td>\n",
" </tr>\n",
" <tr>\n",
" <th>17</th>\n",
" <td>3</td>\n",
" <td>Никогда не сыпьте соль в глаза. Вы будете сожа...</td>\n",
" <td>3</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>3</td>\n",
" <td>Никогда не сыпьте соль в глаза. Вы будете сожа...</td>\n",
" <td>1</td>\n",
" <td>Вы будете сожалеть об этом.</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>3</td>\n",
" <td>Никогда не сыпьте соль в глаза. Вы будете сожа...</td>\n",
" <td>2</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4</td>\n",
" <td>Este es un ejemplo más sofisticado que demuest...</td>\n",
" <td>0</td>\n",
" <td>Este es un ejemplo más sofisticado que demuest...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>4</td>\n",
" <td>Este es un ejemplo más sofisticado que demuest...</td>\n",
" <td>1</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>18</th>\n",
" <td>4</td>\n",
" <td>Este es un ejemplo más sofisticado que demuest...</td>\n",
" <td>3</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13</th>\n",
" <td>4</td>\n",
" <td>Este es un ejemplo más sofisticado que demuest...</td>\n",
" <td>2</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>5</td>\n",
" <td>强制性普通话示例。 如果你重视你的视力,千万不要在你的眼睛里撒盐</td>\n",
" <td>1</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5</td>\n",
" <td>强制性普通话示例。 如果你重视你的视力,千万不要在你的眼睛里撒盐</td>\n",
" <td>0</td>\n",
" <td>强制性普通话示例。 如果你重视你的视力,千万不要在你的眼睛里撒盐</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14</th>\n",
" <td>5</td>\n",
" <td>强制性普通话示例。 如果你重视你的视力,千万不要在你的眼睛里撒盐</td>\n",
" <td>2</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>19</th>\n",
" <td>5</td>\n",
" <td>强制性普通话示例。 如果你重视你的视力,千万不要在你的眼睛里撒盐</td>\n",
" <td>3</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" row_id text sentence_n \\\n",
"0 1 Cual es el razon por la vida? Trabajar? Ganas ... 0 \n",
"5 1 Cual es el razon por la vida? Trabajar? Ganas ... 1 \n",
"10 1 Cual es el razon por la vida? Trabajar? Ganas ... 2 \n",
"15 1 Cual es el razon por la vida? Trabajar? Ganas ... 3 \n",
"1 2 No posis mai sal als ulls. Te'n penediràs. 0 \n",
"6 2 No posis mai sal als ulls. Te'n penediràs. 1 \n",
"11 2 No posis mai sal als ulls. Te'n penediràs. 2 \n",
"16 2 No posis mai sal als ulls. Te'n penediràs. 3 \n",
"2 3 Никогда не сыпьте соль в глаза. Вы будете сожа... 0 \n",
"17 3 Никогда не сыпьте соль в глаза. Вы будете сожа... 3 \n",
"7 3 Никогда не сыпьте соль в глаза. Вы будете сожа... 1 \n",
"12 3 Никогда не сыпьте соль в глаза. Вы будете сожа... 2 \n",
"3 4 Este es un ejemplo más sofisticado que demuest... 0 \n",
"8 4 Este es un ejemplo más sofisticado que demuest... 1 \n",
"18 4 Este es un ejemplo más sofisticado que demuest... 3 \n",
"13 4 Este es un ejemplo más sofisticado que demuest... 2 \n",
"9 5 强制性普通话示例。 如果你重视你的视力,千万不要在你的眼睛里撒盐 1 \n",
"4 5 强制性普通话示例。 如果你重视你的视力,千万不要在你的眼睛里撒盐 0 \n",
"14 5 强制性普通话示例。 如果你重视你的视力,千万不要在你的眼睛里撒盐 2 \n",
"19 5 强制性普通话示例。 如果你重视你的视力,千万不要在你的眼睛里撒盐 3 \n",
"\n",
" sentence_text \n",
"0 Cual es el razon por la vida? \n",
"5 Trabajar? \n",
"10 Ganas de libro tiempo? \n",
"15 Queseria saber hasta pronto. \n",
"1 No posis mai sal als ulls. \n",
"6 Te'n penediràs. \n",
"11 NaN \n",
"16 NaN \n",
"2 Никогда не сыпьте соль в глаза. \n",
"17 NaN \n",
"7 Вы будете сожалеть об этом. \n",
"12 NaN \n",
"3 Este es un ejemplo más sofisticado que demuest... \n",
"8 NaN \n",
"18 NaN \n",
"13 NaN \n",
"9 NaN \n",
"4 强制性普通话示例。 如果你重视你的视力,千万不要在你的眼睛里撒盐 \n",
"14 NaN \n",
"19 NaN "
]
},
"execution_count": 188,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# step 2: Pivot sentence features to rows\n",
"# Finds all \"sentence #\" columns since we may not know how many there are\n",
"sentence_features = [column for column in df_sentence.columns if \"sentence\" in column]\n",
"\n",
"# The feared melt transformation -- they're not that bad really\n",
"df_sentence_melt = df_sentence.melt(\n",
"\n",
" # columns to remain at the \"row level\" as an index\n",
" id_vars = [\"row_id\", \"text\"], \n",
" \n",
" # The columns to turn into rows\n",
" value_vars = sentence_features,\n",
" \n",
" # Name the new row columns to something more expected\n",
" var_name = \"sentence_n\",\n",
" value_name = \"sentence_text\"\n",
")\n",
"\n",
"# Might be nice to transform the \"sentence n\" values to just \"n\" for sentence_n\n",
"df_sentence_melt['sentence_n'] = df_sentence_melt['sentence_n'].map(lambda text: text[-1])\n",
"\n",
"# Sort by original row_id so you can see the sentences grouped by their original text\n",
"df_sentence_melt.sort_values(\"row_id\")"
]
},
{
"cell_type": "markdown",
"id": "e93cfc4c-6315-4c80-87aa-56bfe92de64d",
"metadata": {},
"source": [
"Because the rows with fewer than the maximum number of sentences produce `NaN`, you may want to drop them."
]
},
{
"cell_type": "code",
"execution_count": 196,
"id": "3df23e2c-53a7-4146-a1fc-d16cf3671d87",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>row_id</th>\n",
" <th>text</th>\n",
" <th>sentence_n</th>\n",
" <th>sentence_text</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>Cual es el razon por la vida? Trabajar? Ganas ...</td>\n",
" <td>0</td>\n",
" <td>Cual es el razon por la vida?</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>No posis mai sal als ulls. Te'n penediràs.</td>\n",
" <td>0</td>\n",
" <td>No posis mai sal als ulls.</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>Никогда не сыпьте соль в глаза. Вы будете сожа...</td>\n",
" <td>0</td>\n",
" <td>Никогда не сыпьте соль в глаза.</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4</td>\n",
" <td>Este es un ejemplo más sofisticado que demuest...</td>\n",
" <td>0</td>\n",
" <td>Este es un ejemplo más sofisticado que demuest...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5</td>\n",
" <td>强制性普通话示例。 如果你重视你的视力,千万不要在你的眼睛里撒盐</td>\n",
" <td>0</td>\n",
" <td>强制性普通话示例。 如果你重视你的视力,千万不要在你的眼睛里撒盐</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>1</td>\n",
" <td>Cual es el razon por la vida? Trabajar? Ganas ...</td>\n",
" <td>1</td>\n",
" <td>Trabajar?</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>2</td>\n",
" <td>No posis mai sal als ulls. Te'n penediràs.</td>\n",
" <td>1</td>\n",
" <td>Te'n penediràs.</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>3</td>\n",
" <td>Никогда не сыпьте соль в глаза. Вы будете сожа...</td>\n",
" <td>1</td>\n",
" <td>Вы будете сожалеть об этом.</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>1</td>\n",
" <td>Cual es el razon por la vida? Trabajar? Ganas ...</td>\n",
" <td>2</td>\n",
" <td>Ganas de libro tiempo?</td>\n",
" </tr>\n",
" <tr>\n",
" <th>15</th>\n",
" <td>1</td>\n",
" <td>Cual es el razon por la vida? Trabajar? Ganas ...</td>\n",
" <td>3</td>\n",
" <td>Queseria saber hasta pronto.</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" row_id text sentence_n \\\n",
"0 1 Cual es el razon por la vida? Trabajar? Ganas ... 0 \n",
"1 2 No posis mai sal als ulls. Te'n penediràs. 0 \n",
"2 3 Никогда не сыпьте соль в глаза. Вы будете сожа... 0 \n",
"3 4 Este es un ejemplo más sofisticado que demuest... 0 \n",
"4 5 强制性普通话示例。 如果你重视你的视力,千万不要在你的眼睛里撒盐 0 \n",
"5 1 Cual es el razon por la vida? Trabajar? Ganas ... 1 \n",
"6 2 No posis mai sal als ulls. Te'n penediràs. 1 \n",
"7 3 Никогда не сыпьте соль в глаза. Вы будете сожа... 1 \n",
"10 1 Cual es el razon por la vida? Trabajar? Ganas ... 2 \n",
"15 1 Cual es el razon por la vida? Trabajar? Ganas ... 3 \n",
"\n",
" sentence_text \n",
"0 Cual es el razon por la vida? \n",
"1 No posis mai sal als ulls. \n",
"2 Никогда не сыпьте соль в глаза. \n",
"3 Este es un ejemplo más sofisticado que demuest... \n",
"4 强制性普通话示例。 如果你重视你的视力,千万不要在你的眼睛里撒盐 \n",
"5 Trabajar? \n",
"6 Te'n penediràs. \n",
"7 Вы будете сожалеть об этом. \n",
"10 Ganas de libro tiempo? \n",
"15 Queseria saber hasta pronto. "
]
},
"execution_count": 196,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_sentence_melt.dropna(subset=\"sentence_text\") # only done in memory -- commit to new variable or update existing dataframe"
]
},
{
"cell_type": "markdown",
"id": "9cf5e118-7989-4e92-9f60-262db991ddd8",
"metadata": {},
"source": [
"# That's it! \n",
"From here you can translate at the sentence level if you like but if you want futher direction on how to process in batches to really get the most efficiency out of the Google Translate API, contact me for more advice!"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python [conda env:cx-galaxy]",
"language": "python",
"name": "conda-env-cx-galaxy-py"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
@dyerrington
Copy link
Author

dyerrington commented Nov 9, 2022

One limitation with TextBlob sentence tokenizer is that it really only works great in latin-based languages and stumbles a bit with multi-byte punctuation such as Cyrillic, Hanzi/Phono-semantic, and Asian-based UTF-8 strings. This is where spaCy is a better choice but this requires a bit more planning to setup and execute since you have to load more libraries and deal with context a bit more selectively. So, if you have a specific care you want to handle, you should be able to extend the above examples with a switch statement to use better sentence handling prior to translation.

Here's a good starting point if you want better sentence handling for non-latin based languages:
https://spacy.io/api/sentencizer

An example of using this (on English at least):

nlp  = spacy.load('en_core_web_sm') # See supported language models here: https://spacy.io/usage/models
doc = nlp(text)
for sent in doc.sents:
    print(sent.text)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment