Skip to content

Instantly share code, notes, and snippets.

@vene
Last active December 30, 2015 22:09
Show Gist options
  • Save vene/7892239 to your computer and use it in GitHub Desktop.
Save vene/7892239 to your computer and use it in GitHub Desktop.
Simple language similarity with character n-grams
Display the source blob
Display the rendered blob
Raw
{
"metadata": {
"name": ""
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "code",
"collapsed": false,
"input": [
"from sklearn.feature_extraction.text import TfidfVectorizer"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 38
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"texts = [(u\"Personalmente, e credo condividiate la mia opinione, sento \"\n",
" u\"una dolorosa sensazione di d\u00e9j\u00e0 vu quando vedo queste immagini in televisione.\"),\n",
" (u\"Personal, \u015fi sunt sigur c\u0103 acest lucru este valabil pentru cei \"\n",
" u\"mai mul\u0163i dintre noi, imaginile transmise la televizor \u00eemi trezesc \"\n",
" u\"un dureros sentiment de d\u00e9j\u00e0 vu\"),\n",
" u\"An english sentence that shouldn't be too similar.\"]"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 39
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"vect = TfidfVectorizer(analyzer=\"char_wb\", # n-grams within word boundary\n",
" ngram_range=(3, 9),\n",
" lowercase=True,\n",
" use_idf=False)"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 52
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"X = vect.fit_transform(texts)"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 53
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from sklearn.metrics import euclidean_distances"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 57
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"print(1 - euclidean_distances(X) / np.sqrt(2)) # I think this is cosine similarity"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"[[ 1. 0.10266664 0.02078989]\n",
" [ 0.10266664 1. 0.01606852]\n",
" [ 0.02078989 0.01606852 1. ]]\n"
]
}
],
"prompt_number": 65
}
],
"metadata": {}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment