Skip to content

Instantly share code, notes, and snippets.

@rhiskey
Created July 23, 2022 19:51
Show Gist options
  • Save rhiskey/bbb32def43c48c8bda5fdbafe7b88f67 to your computer and use it in GitHub Desktop.
Save rhiskey/bbb32def43c48c8bda5fdbafe7b88f67 to your computer and use it in GitHub Desktop.
Seminarus Lang_det_ML_Google.ipynb
Display the source blob
Display the rendered blob
Raw
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"name": "Seminarus Lang_det_ML_Google.ipynb",
"provenance": [],
"collapsed_sections": [],
"authorship_tag": "ABX9TyM3eiN+NbcdMMz9E+ZQZbwH",
"include_colab_link": true
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"language_info": {
"name": "python"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"<a href=\"https://colab.research.google.com/gist/rhiskey/bbb32def43c48c8bda5fdbafe7b88f67/seminarus-lang_det_ml_google.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"source": [
"## Определение языка с помощью Python\n",
":"
],
"metadata": {
"id": "GhE2KmUgh5Wu"
}
},
{
"cell_type": "code",
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"from sklearn.feature_extraction.text import CountVectorizer\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.naive_bayes import MultinomialNB\n",
"\n",
"data = pd.read_csv(\"dataset.csv\")\n",
"data.head()"
],
"metadata": {
"id": "8HKqEKSCh29A",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 206
},
"outputId": "094c817b-c476-4435-e547-82317ca32ceb"
},
"execution_count": 3,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" Text language\n",
"0 klement gottwaldi surnukeha palsameeriti ning ... Estonian\n",
"1 sebes joseph pereira thomas på eng the jesuit... Swedish\n",
"2 ถนนเจริญกรุง อักษรโรมัน thanon charoen krung เ... Thai\n",
"3 விசாகப்பட்டினம் தமிழ்ச்சங்கத்தை இந்துப் பத்திர... Tamil\n",
"4 de spons behoort tot het geslacht haliclona en... Dutch"
],
"text/html": [
"\n",
" <div id=\"df-b039d807-6260-4b15-9bcd-96dd39c760da\">\n",
" <div class=\"colab-df-container\">\n",
" <div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Text</th>\n",
" <th>language</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>klement gottwaldi surnukeha palsameeriti ning ...</td>\n",
" <td>Estonian</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>sebes joseph pereira thomas på eng the jesuit...</td>\n",
" <td>Swedish</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>ถนนเจริญกรุง อักษรโรมัน thanon charoen krung เ...</td>\n",
" <td>Thai</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>விசாகப்பட்டினம் தமிழ்ச்சங்கத்தை இந்துப் பத்திர...</td>\n",
" <td>Tamil</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>de spons behoort tot het geslacht haliclona en...</td>\n",
" <td>Dutch</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>\n",
" <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-b039d807-6260-4b15-9bcd-96dd39c760da')\"\n",
" title=\"Convert this dataframe to an interactive table.\"\n",
" style=\"display:none;\">\n",
" \n",
" <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
" width=\"24px\">\n",
" <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
" <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
" </svg>\n",
" </button>\n",
" \n",
" <style>\n",
" .colab-df-container {\n",
" display:flex;\n",
" flex-wrap:wrap;\n",
" gap: 12px;\n",
" }\n",
"\n",
" .colab-df-convert {\n",
" background-color: #E8F0FE;\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: #1967D2;\n",
" height: 32px;\n",
" padding: 0 0 0 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-convert:hover {\n",
" background-color: #E2EBFA;\n",
" box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: #174EA6;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert {\n",
" background-color: #3B4455;\n",
" fill: #D2E3FC;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert:hover {\n",
" background-color: #434B5C;\n",
" box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
" filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
" fill: #FFFFFF;\n",
" }\n",
" </style>\n",
"\n",
" <script>\n",
" const buttonEl =\n",
" document.querySelector('#df-b039d807-6260-4b15-9bcd-96dd39c760da button.colab-df-convert');\n",
" buttonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
"\n",
" async function convertToInteractive(key) {\n",
" const element = document.querySelector('#df-b039d807-6260-4b15-9bcd-96dd39c760da');\n",
" const dataTable =\n",
" await google.colab.kernel.invokeFunction('convertToInteractive',\n",
" [key], {});\n",
" if (!dataTable) return;\n",
"\n",
" const docLinkHtml = 'Like what you see? Visit the ' +\n",
" '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
" + ' to learn more about interactive tables.';\n",
" element.innerHTML = '';\n",
" dataTable['output_type'] = 'display_data';\n",
" await google.colab.output.renderOutput(dataTable, element);\n",
" const docLink = document.createElement('div');\n",
" docLink.innerHTML = docLinkHtml;\n",
" element.appendChild(docLink);\n",
" }\n",
" </script>\n",
" </div>\n",
" </div>\n",
" "
]
},
"metadata": {},
"execution_count": 3
}
]
},
{
"cell_type": "code",
"source": [
"data.isnull().sum()"
],
"metadata": {
"id": "RCYFI_ubiB2r",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "f646cf85-e131-4d40-968c-d357b548da36"
},
"execution_count": 4,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"Text 0\n",
"language 0\n",
"dtype: int64"
]
},
"metadata": {},
"execution_count": 4
}
]
},
{
"cell_type": "code",
"source": [
"data[\"language\"].value_counts()"
],
"metadata": {
"id": "8X3i2L9CiG6I",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "697d0ca6-cf59-45f9-d214-a39356f7d6c7"
},
"execution_count": 5,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"Estonian 1000\n",
"Swedish 1000\n",
"English 1000\n",
"Russian 1000\n",
"Romanian 1000\n",
"Persian 1000\n",
"Pushto 1000\n",
"Spanish 1000\n",
"Hindi 1000\n",
"Korean 1000\n",
"Chinese 1000\n",
"French 1000\n",
"Portugese 1000\n",
"Indonesian 1000\n",
"Urdu 1000\n",
"Latin 1000\n",
"Turkish 1000\n",
"Japanese 1000\n",
"Dutch 1000\n",
"Tamil 1000\n",
"Thai 1000\n",
"Arabic 1000\n",
"Name: language, dtype: int64"
]
},
"metadata": {},
"execution_count": 5
}
]
},
{
"cell_type": "markdown",
"source": [
"# Модель определения языка\n"
],
"metadata": {
"id": "CkqfODJiiM2-"
}
},
{
"cell_type": "code",
"source": [
"x = np.array(data[\"Text\"])\n",
"y = np.array(data[\"language\"])\n",
"\n",
"cv = CountVectorizer()\n",
"X = cv.fit_transform(x)\n",
"X_train, X_test, y_train, y_test = train_test_split(\n",
" X, y,\n",
" test_size=0.20,\n",
" random_state=61\n",
")"
],
"metadata": {
"id": "FaaD2P1LiQ5H"
},
"execution_count": 6,
"outputs": []
},
{
"cell_type": "code",
"source": [
"model = MultinomialNB()\n",
"model.fit(X_train, y_train)\n",
"model.score(X_test, y_test)"
],
"metadata": {
"id": "LFoi05JwicbQ",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "e01260d7-1dba-4452-ad6a-8fa428f3a531"
},
"execution_count": 7,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"0.956590909090909"
]
},
"metadata": {},
"execution_count": 7
}
]
},
{
"cell_type": "markdown",
"source": [
""
],
"metadata": {
"id": "Np8SsYUYienq"
}
},
{
"cell_type": "code",
"source": [
"user = input(\"Введите текст:\")\n",
"data = cv.transform([user]).toarray()\n",
"output = model.predict(data)\n",
"\n",
"output"
],
"metadata": {
"id": "aQyop5aUieAh",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "b60b2d52-08ce-4e59-ef75-67c610b0f581"
},
"execution_count": 11,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Введите текст:Я записал семинар\n"
]
},
{
"output_type": "execute_result",
"data": {
"text/plain": [
"array(['Russian'], dtype='<U10')"
]
},
"metadata": {},
"execution_count": 11
}
]
},
{
"cell_type": "markdown",
"source": [
"Итак, как вы можете видеть, модель работает хорошо. Здесь следует отметить, что эта модель может обнаруживать только языки, присутствующие в наборе данных."
],
"metadata": {
"id": "C91RPRePimHQ"
}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment