Skip to content

Instantly share code, notes, and snippets.

@maxjf1
Last active September 24, 2018 07:15
Show Gist options
  • Save maxjf1/5fdf72e502a48e3cd5ccb3284a372819 to your computer and use it in GitHub Desktop.
Save maxjf1/5fdf72e502a48e3cd5ccb3284a372819 to your computer and use it in GitHub Desktop.
Trabalho Recuperação de informação
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Trabalho 2: Testando métodos para pesos de termos\n",
"\n",
"Term weighting é o processo de calcular o peso ideal para cada termo nos documentos. Vamos testar alguns desses métodos nesse trabalhos. Vamos fazer os testes com pequenas matrizes de documentos, apenas para ilustrar os métodos aplicados ao Modelo Vetorial.\n",
"\n",
"Abaixo temos nosso vetor de termos documentos."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import math\n",
"import pandas as pd\n",
"\n",
"mydoclist = ['estudei redes de computadores ontem e estudei mal',\n",
" 'gostaria de estudar mais sobre redes',\n",
" 'terminei o trabalho após deixar três computadores ligados em rede',\n",
" 'gosto de usar a rede da casa de Julia',\n",
" 'quando Julia ligar, termino de processar o trabalho',\n",
" 'ela gosta de mim, mas não gosto dela',\n",
" 'gostaria de estudar, mas estudo menos do que gostaria'\n",
" ]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Computando frequência dos termos\n",
"\n",
"O que vamos fazer é contabilizar a frequência de cada termo em cada documento. Para isso, a classe Counter é bem útil. Veja um exemplo abaixo."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Counter({'estudei': 2, 'redes': 1, 'de': 1, 'computadores': 1, 'ontem': 1, 'e': 1, 'mal': 1})\n",
"Counter({'gostaria': 1, 'de': 1, 'estudar': 1, 'mais': 1, 'sobre': 1, 'redes': 1})\n",
"Counter({'terminei': 1, 'o': 1, 'trabalho': 1, 'após': 1, 'deixar': 1, 'três': 1, 'computadores': 1, 'ligados': 1, 'em': 1, 'rede': 1})\n",
"Counter({'de': 2, 'gosto': 1, 'usar': 1, 'a': 1, 'rede': 1, 'da': 1, 'casa': 1, 'Julia': 1})\n",
"Counter({'quando': 1, 'Julia': 1, 'ligar,': 1, 'termino': 1, 'de': 1, 'processar': 1, 'o': 1, 'trabalho': 1})\n",
"Counter({'ela': 1, 'gosta': 1, 'de': 1, 'mim,': 1, 'mas': 1, 'não': 1, 'gosto': 1, 'dela': 1})\n",
"Counter({'gostaria': 2, 'de': 1, 'estudar,': 1, 'mas': 1, 'estudo': 1, 'menos': 1, 'do': 1, 'que': 1})\n"
]
}
],
"source": [
"from collections import Counter\n",
"for doc in mydoclist:\n",
" tf = Counter()\n",
" for word in doc.split():\n",
" tf[word] +=1\n",
" print(tf)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Então vamos, lá! Primeiro vamos criar o nosso vocabulário (lexicon). Para isso, implemente o método abaixo. Ele recebe um vetor de textos como o criado acima e deve acrescentar as palavras que te interessam como termos de busca. Aqui, como no trabalho anterior, pode-se usar stemming, remoção de stopwords, o que quiser. Quanto mais reduzir o vocabulário, mais fácil vai ser para entender as matrizes depois..."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'e', 'o', 'd', 'termin', 'comput', 'lig', 'estud', 'jul', 'mim', 'process', 'red', 'a', 'gost', 'deix', 'cas'}\n"
]
}
],
"source": [
"import spacy\n",
"import Stemmer\n",
"\n",
"nlp = spacy.load('pt')\n",
"stemmer = Stemmer.Stemmer('portuguese')\n",
"\n",
"def build_lexicon(corpus):\n",
" lexicon = set()\n",
" for doc in corpus:\n",
" doc = nlp(doc.lower())\n",
" lexicon |= set([token.orth_ for token in doc if not token.is_punct and not token.is_stop])\n",
" #seu código aqui\n",
" return set(stemmer.stemWords(lexicon))\n",
"\n",
"vocabulary = build_lexicon(mydoclist)\n",
"print(vocabulary)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A função tf deve retornar o valor do term-frequency no texto. Nesse primeiro momento, vamos fazer com que o tf seja exatamente igual à frequencia.\n",
"\n",
"`` Exemplo: tf('estud', 'estudei redes de computadores ontem e estudei mal') ==> 2 ``"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"2\n"
]
}
],
"source": [
"def freq(term, document):\n",
" document = nlp(document.lower())\n",
" term = stemmer.stemWord(term)\n",
" frequency = Counter()\n",
" stemmed = stemmer.stemWords([token.orth_ for token in document if not token.is_punct and not token.is_stop])\n",
" for word in stemmed:\n",
" frequency[word] += 1\n",
" return frequency[term]\n",
"\n",
"def tf(term, document):\n",
" return freq(term, document)\n",
"\n",
"print(tf('estudei', mydoclist[0]))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Vamos agora criar nossa matriz de termos por documentos. Você deve calcular a frequencia para cada termo do ``vocabulary`` presente no documento."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Nosso vetor do vocabulário é [e, o, d, termin, comput, lig, estud, jul, mim, process, red, a, gost, deix, cas]\n",
"\n",
"O doc é \"estudei redes de computadores ontem e estudei mal\"\n",
"O vetor tf para o doc 1 é [1, 0, 0, 0, 1, 0, 2, 0, 0, 0, 1, 0, 0, 0, 0]\n",
"\n",
"O doc é \"gostaria de estudar mais sobre redes\"\n",
"O vetor tf para o doc 2 é [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0]\n",
"\n",
"O doc é \"terminei o trabalho após deixar três computadores ligados em rede\"\n",
"O vetor tf para o doc 3 é [0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0]\n",
"\n",
"O doc é \"gosto de usar a rede da casa de Julia\"\n",
"O vetor tf para o doc 4 é [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1]\n",
"\n",
"O doc é \"quando Julia ligar, termino de processar o trabalho\"\n",
"O vetor tf para o doc 5 é [0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0]\n",
"\n",
"O doc é \"ela gosta de mim, mas não gosto dela\"\n",
"O vetor tf para o doc 6 é [0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 2, 0, 0]\n",
"\n",
"O doc é \"gostaria de estudar, mas estudo menos do que gostaria\"\n",
"O vetor tf para o doc 7 é [0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 2, 0, 0]\n",
"\n",
"A matriz final é: \n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" <th>3</th>\n",
" <th>4</th>\n",
" <th>5</th>\n",
" <th>6</th>\n",
" <th>7</th>\n",
" <th>8</th>\n",
" <th>9</th>\n",
" <th>10</th>\n",
" <th>11</th>\n",
" <th>12</th>\n",
" <th>13</th>\n",
" <th>14</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14\n",
"0 1 0 0 0 1 0 2 0 0 0 1 0 0 0 0\n",
"1 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0\n",
"2 0 1 0 1 1 1 0 0 0 0 1 0 0 1 0\n",
"3 0 0 0 0 0 0 0 1 0 0 1 1 1 0 1\n",
"4 0 1 0 1 0 1 0 1 0 1 0 0 0 0 0\n",
"5 0 0 1 0 0 0 0 0 1 0 0 0 2 0 0\n",
"6 0 0 0 0 0 0 2 0 0 0 0 0 2 0 0"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def createMatrix(document):\n",
" matrix = []\n",
" print('Nosso vetor do vocabulário é [' + ', '.join(list(vocabulary)) + ']')\n",
" for doc in document:\n",
" #aqui, monte o vetor percorrendo cada palavra no vocabulário e comparando com o tf do termo no doc\n",
" tf_vector = []\n",
" for term in vocabulary: \n",
" tf_vector.append(tf(term, doc))\n",
" matrix.append(tf_vector)\n",
"\n",
" # linhas abaixo somente somente para debug. Não retire.\n",
" print('\\nO doc é \"' + doc + '\"')\n",
" tf_vector_string = ', '.join(format(value, 'd') for value in tf_vector)\n",
" print('O vetor tf para o doc %d é [%s]' % ((document.index(doc)+1), tf_vector_string))\n",
"\n",
" return matrix\n",
"\n",
"\n",
"matrix = createMatrix(mydoclist)\n",
"\n",
"print('\\nA matriz final é: ')\n",
"pd.DataFrame(matrix)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Normalizando\n",
"Frequência dos termos não é uma boa métrica, pois um termo aparecer 2 vezes em um documento com 4 palavras é diferente de aparecer 4 vezes em um documento com 100 palavras. Existem vários métodos para normalizar documentos. Um forma bem comum de normalizar é pela norma do vetor (também chamado de norma $L^2$). Caso tenha dúvidas de como calcular a norma, recomendo acessar http://mathworld.wolfram.com/L2-Norm.html."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"A nova matriz normalizada:\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" <th>3</th>\n",
" <th>4</th>\n",
" <th>5</th>\n",
" <th>6</th>\n",
" <th>7</th>\n",
" <th>8</th>\n",
" <th>9</th>\n",
" <th>10</th>\n",
" <th>11</th>\n",
" <th>12</th>\n",
" <th>13</th>\n",
" <th>14</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0.377964</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.377964</td>\n",
" <td>0.000000</td>\n",
" <td>0.755929</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.377964</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.577350</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.577350</td>\n",
" <td>0.000000</td>\n",
" <td>0.577350</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>0.000000</td>\n",
" <td>0.408248</td>\n",
" <td>0.000000</td>\n",
" <td>0.408248</td>\n",
" <td>0.408248</td>\n",
" <td>0.408248</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.408248</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.408248</td>\n",
" <td>0.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.447214</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.447214</td>\n",
" <td>0.447214</td>\n",
" <td>0.447214</td>\n",
" <td>0.000000</td>\n",
" <td>0.447214</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0.000000</td>\n",
" <td>0.447214</td>\n",
" <td>0.000000</td>\n",
" <td>0.447214</td>\n",
" <td>0.000000</td>\n",
" <td>0.447214</td>\n",
" <td>0.000000</td>\n",
" <td>0.447214</td>\n",
" <td>0.000000</td>\n",
" <td>0.447214</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.408248</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.408248</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.816497</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.707107</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.707107</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0 1 2 3 4 5 6 \\\n",
"0 0.377964 0.000000 0.000000 0.000000 0.377964 0.000000 0.755929 \n",
"1 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.577350 \n",
"2 0.000000 0.408248 0.000000 0.408248 0.408248 0.408248 0.000000 \n",
"3 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 \n",
"4 0.000000 0.447214 0.000000 0.447214 0.000000 0.447214 0.000000 \n",
"5 0.000000 0.000000 0.408248 0.000000 0.000000 0.000000 0.000000 \n",
"6 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.707107 \n",
"\n",
" 7 8 9 10 11 12 13 \\\n",
"0 0.000000 0.000000 0.000000 0.377964 0.000000 0.000000 0.000000 \n",
"1 0.000000 0.000000 0.000000 0.577350 0.000000 0.577350 0.000000 \n",
"2 0.000000 0.000000 0.000000 0.408248 0.000000 0.000000 0.408248 \n",
"3 0.447214 0.000000 0.000000 0.447214 0.447214 0.447214 0.000000 \n",
"4 0.447214 0.000000 0.447214 0.000000 0.000000 0.000000 0.000000 \n",
"5 0.000000 0.408248 0.000000 0.000000 0.000000 0.816497 0.000000 \n",
"6 0.000000 0.000000 0.000000 0.000000 0.000000 0.707107 0.000000 \n",
"\n",
" 14 \n",
"0 0.000000 \n",
"1 0.000000 \n",
"2 0.000000 \n",
"3 0.447214 \n",
"4 0.000000 \n",
"5 0.000000 \n",
"6 0.000000 "
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#receba um vetor, calcule a norma do vetor e normalize todos os termos de acordo com a norma\n",
"def normalizer(vetor): \n",
" vetorNormalizado = []\n",
" norma = 0\n",
" for peso in vetor:\n",
" norma+= peso ** 2\n",
" norma = math.sqrt(norma)\n",
" for peso in vetor:\n",
" vetorNormalizado.append(peso / norma)\n",
" return vetorNormalizado\n",
"\n",
"# passa por toda a matriz de documentos e normaliza os documentos de acordo com o método anterior\n",
"def matrixNormalizer(matrix):\n",
" matrix2 = []\n",
" for vec in matrix:\n",
" matrix2.append(normalizer(vec))\n",
" return matrix2\n",
"\n",
"print('A nova matriz normalizada:')\n",
"matrix_normalizada = matrixNormalizer(matrix)\n",
"pd.DataFrame(matrix_normalizada) \n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Calculando o IDF (inverse document frequency)\n",
"Agora vamos calcular o valor do IDF de cada termo. Lembre-se que o IDF é calculado como o inverso da frequência do documento na base. Assim, um termo que aparece em 3 documentos (independente do número de vezes que aparece em cada documento), poderia ter um IDF de um terço. "
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.3333333333333333"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Esse método retorna em quantos documentos um dado termo apareceu\n",
"def numDocsContaining(word, doclist):\n",
" doccount = 0\n",
" for doc in doclist:\n",
" if freq(word, doc) > 0:\n",
" doccount +=1\n",
" return doccount \n",
"\n",
"# aqui você deve calcular o IDF de cada termo\n",
"def idf(word, doclist):\n",
" return 1/numDocsContaining(word, doclist)\n",
"\n",
"# Testando...\n",
"idf('estud', mydoclist)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Agora, crie então um vetor que contenha os valores de idf para cada termo do vocabulário."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>e</th>\n",
" <td>1.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>o</th>\n",
" <td>0.500000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>d</th>\n",
" <td>1.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>termin</th>\n",
" <td>0.500000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>comput</th>\n",
" <td>0.500000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>lig</th>\n",
" <td>0.500000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>estud</th>\n",
" <td>0.333333</td>\n",
" </tr>\n",
" <tr>\n",
" <th>jul</th>\n",
" <td>0.500000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>mim</th>\n",
" <td>1.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>process</th>\n",
" <td>1.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>red</th>\n",
" <td>0.250000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>a</th>\n",
" <td>1.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>gost</th>\n",
" <td>0.250000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>deix</th>\n",
" <td>1.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>cas</th>\n",
" <td>1.000000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0\n",
"e 1.000000\n",
"o 0.500000\n",
"d 1.000000\n",
"termin 0.500000\n",
"comput 0.500000\n",
"lig 0.500000\n",
"estud 0.333333\n",
"jul 0.500000\n",
"mim 1.000000\n",
"process 1.000000\n",
"red 0.250000\n",
"a 1.000000\n",
"gost 0.250000\n",
"deix 1.000000\n",
"cas 1.000000"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"idf_vetor = []\n",
"for term in vocabulary:\n",
" idf_vetor.append(idf(term, mydoclist))\n",
"\n",
"pd.DataFrame(idf_vetor, vocabulary)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Juntando tudo!\n",
"Agora vamos finalmente fazer uma matrix TF-IDF. Para isso, para cada termo de cada documento da matriz, multiple o seu valor de tf com o seu valor de idf. Nesse momento, faço algumas sugestões:\n",
"* A nossa função tf(term, doc) apenas verifica a frequência do termo. Seria mais interessante passar uma função de log nesse resultado, embora não seja obrigatório. Caso queira fazer isso, sugiro você sobreescrever a função tf(term,doc). Para isso, basta defini-la novamente abaixo e o python irá desconsiderar a função definida anterimente.\n",
"* Você pode também sobreescrever a função createMatrix para ela já ser capaz de entregar o vetor também normalizado. Caso vá criar novamente o tf(), então lembre-se de normalizar os pesos novamente!\n",
"* Caso esteja com preguiça, basta escrever os códigos abaixo mesmo, sem criar funções :-)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Nosso vetor do vocabulário é [e, o, d, termin, comput, lig, estud, jul, mim, process, red, a, gost, deix, cas]\n",
"e 1\n",
"o 0\n",
"d 0\n",
"termin 0\n",
"comput 1\n",
"lig 0\n",
"estud 2\n",
"jul 0\n",
"mim 0\n",
"process 0\n",
"red 1\n",
"a 0\n",
"gost 0\n",
"deix 0\n",
"cas 0\n",
"\n",
"O doc é \"estudei redes de computadores ontem e estudei mal\"\n",
"O vetor tf para o doc 1 é [1, 0, 0, 0, 1, 0, 2, 0, 0, 0, 1, 0, 0, 0, 0]\n",
"e 0\n",
"o 0\n",
"d 0\n",
"termin 0\n",
"comput 0\n",
"lig 0\n",
"estud 1\n",
"jul 0\n",
"mim 0\n",
"process 0\n",
"red 1\n",
"a 0\n",
"gost 1\n",
"deix 0\n",
"cas 0\n",
"\n",
"O doc é \"gostaria de estudar mais sobre redes\"\n",
"O vetor tf para o doc 2 é [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0]\n",
"e 0\n",
"o 1\n",
"d 0\n",
"termin 1\n",
"comput 1\n",
"lig 1\n",
"estud 0\n",
"jul 0\n",
"mim 0\n",
"process 0\n",
"red 1\n",
"a 0\n",
"gost 0\n",
"deix 1\n",
"cas 0\n",
"\n",
"O doc é \"terminei o trabalho após deixar três computadores ligados em rede\"\n",
"O vetor tf para o doc 3 é [0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0]\n",
"e 0\n",
"o 0\n",
"d 0\n",
"termin 0\n",
"comput 0\n",
"lig 0\n",
"estud 0\n",
"jul 1\n",
"mim 0\n",
"process 0\n",
"red 1\n",
"a 1\n",
"gost 1\n",
"deix 0\n",
"cas 1\n",
"\n",
"O doc é \"gosto de usar a rede da casa de Julia\"\n",
"O vetor tf para o doc 4 é [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1]\n",
"e 0\n",
"o 1\n",
"d 0\n",
"termin 1\n",
"comput 0\n",
"lig 1\n",
"estud 0\n",
"jul 1\n",
"mim 0\n",
"process 1\n",
"red 0\n",
"a 0\n",
"gost 0\n",
"deix 0\n",
"cas 0\n",
"\n",
"O doc é \"quando Julia ligar, termino de processar o trabalho\"\n",
"O vetor tf para o doc 5 é [0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0]\n",
"e 0\n",
"o 0\n",
"d 1\n",
"termin 0\n",
"comput 0\n",
"lig 0\n",
"estud 0\n",
"jul 0\n",
"mim 1\n",
"process 0\n",
"red 0\n",
"a 0\n",
"gost 2\n",
"deix 0\n",
"cas 0\n",
"\n",
"O doc é \"ela gosta de mim, mas não gosto dela\"\n",
"O vetor tf para o doc 6 é [0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 2, 0, 0]\n",
"e 0\n",
"o 0\n",
"d 0\n",
"termin 0\n",
"comput 0\n",
"lig 0\n",
"estud 2\n",
"jul 0\n",
"mim 0\n",
"process 0\n",
"red 0\n",
"a 0\n",
"gost 2\n",
"deix 0\n",
"cas 0\n",
"\n",
"O doc é \"gostaria de estudar, mas estudo menos do que gostaria\"\n",
"O vetor tf para o doc 7 é [0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 2, 0, 0]\n",
"{'e', 'o', 'd', 'termin', 'comput', 'lig', 'estud', 'jul', 'mim', 'process', 'red', 'a', 'gost', 'deix', 'cas'}\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" <th>3</th>\n",
" <th>4</th>\n",
" <th>5</th>\n",
" <th>6</th>\n",
" <th>7</th>\n",
" <th>8</th>\n",
" <th>9</th>\n",
" <th>10</th>\n",
" <th>11</th>\n",
" <th>12</th>\n",
" <th>13</th>\n",
" <th>14</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0.377964</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.188982</td>\n",
" <td>0.000000</td>\n",
" <td>0.251976</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.094491</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.192450</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.144338</td>\n",
" <td>0.000000</td>\n",
" <td>0.144338</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>0.000000</td>\n",
" <td>0.204124</td>\n",
" <td>0.000000</td>\n",
" <td>0.204124</td>\n",
" <td>0.204124</td>\n",
" <td>0.204124</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.102062</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.408248</td>\n",
" <td>0.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.223607</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.111803</td>\n",
" <td>0.447214</td>\n",
" <td>0.111803</td>\n",
" <td>0.000000</td>\n",
" <td>0.447214</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0.000000</td>\n",
" <td>0.223607</td>\n",
" <td>0.000000</td>\n",
" <td>0.223607</td>\n",
" <td>0.000000</td>\n",
" <td>0.223607</td>\n",
" <td>0.000000</td>\n",
" <td>0.223607</td>\n",
" <td>0.000000</td>\n",
" <td>0.447214</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.408248</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.408248</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.204124</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.235702</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.176777</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0 1 2 3 4 5 6 \\\n",
"0 0.377964 0.000000 0.000000 0.000000 0.188982 0.000000 0.251976 \n",
"1 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.192450 \n",
"2 0.000000 0.204124 0.000000 0.204124 0.204124 0.204124 0.000000 \n",
"3 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 \n",
"4 0.000000 0.223607 0.000000 0.223607 0.000000 0.223607 0.000000 \n",
"5 0.000000 0.000000 0.408248 0.000000 0.000000 0.000000 0.000000 \n",
"6 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.235702 \n",
"\n",
" 7 8 9 10 11 12 13 \\\n",
"0 0.000000 0.000000 0.000000 0.094491 0.000000 0.000000 0.000000 \n",
"1 0.000000 0.000000 0.000000 0.144338 0.000000 0.144338 0.000000 \n",
"2 0.000000 0.000000 0.000000 0.102062 0.000000 0.000000 0.408248 \n",
"3 0.223607 0.000000 0.000000 0.111803 0.447214 0.111803 0.000000 \n",
"4 0.223607 0.000000 0.447214 0.000000 0.000000 0.000000 0.000000 \n",
"5 0.000000 0.408248 0.000000 0.000000 0.000000 0.204124 0.000000 \n",
"6 0.000000 0.000000 0.000000 0.000000 0.000000 0.176777 0.000000 \n",
"\n",
" 14 \n",
"0 0.000000 \n",
"1 0.000000 \n",
"2 0.000000 \n",
"3 0.447214 \n",
"4 0.000000 \n",
"5 0.000000 \n",
"6 0.000000 "
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def tf(term, document):\n",
" result = freq(term, document)\n",
" print(term, result)\n",
" return result\n",
"\n",
"def createNormalizedMatrix(document):\n",
" return matrixNormalizer(createMatrix(document))\n",
"\n",
"new_matriz = []\n",
"\n",
"for doc in createNormalizedMatrix(mydoclist):\n",
" items = []\n",
" for term in range(len(doc)):\n",
" items.append(doc[term] * idf_vetor[term])\n",
" new_matriz.append(items)\n",
" \n",
"print(vocabulary)\n",
"pd.DataFrame(new_matriz)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Trabalho: agora vai começar o trabalho de fato :)\n",
"\n",
"Existem vários métodos para term weighting. Neste trabalho, você deve mostrar o passo-a-passo de implementação no seu notebook de dois outros métodos de term weighting:\n",
"* Term Discrimitation Model\n",
"* Signal-to-Noise Ratio\n",
"\n",
"As fórmulas para cálculo desses modelos estão nas notas de aula. Basta implementar as fórmulas. Recomendo tenta reusar ao máximo as funções já implementadas acima.\n",
"\n",
"Para teu trabalho ficar muito legal, seria interessante gerar alguma visualização dessas matrizes de pesos após a aplicação de cada método para mostrar como ela foi mudando. Sugestões para isso, seria (1) mostrar um gráfico de barra para cada documento, onde cada barra representa o peso do termo, ou (2) plotar uma matriz com cores que representam o peso de cada termo no documento (ver em https://about.sofia2.com/2017/09/13/analitica-de-datos-con-python-y-sofia2-24-graficos-de-relacion/)\n",
"\n",
"Bons códigos :)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"-----\n",
"### Helpers\n",
"Primeiramente definimos algumas funções de auxílio como funções matemáticas para vetores."
]
},
{
"cell_type": "code",
"execution_count": 193,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"14\n",
"1.4142135623730951\n",
"0.7071067811865475 0.7071067811865476\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" <th>3</th>\n",
" <th>4</th>\n",
" <th>5</th>\n",
" <th>6</th>\n",
" <th>7</th>\n",
" <th>8</th>\n",
" <th>9</th>\n",
" <th>10</th>\n",
" <th>11</th>\n",
" <th>12</th>\n",
" <th>13</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0 1 2 3 4 5 6 7 8 9 10 11 12 13\n",
"0 0 0 0 1 0 2 0 0 0 1 0 0 0 0\n",
"1 0 0 0 0 0 1 0 0 0 1 0 1 0 0\n",
"2 1 0 1 1 1 0 0 0 0 1 0 0 1 0\n",
"3 0 0 0 0 0 0 1 0 0 1 1 1 0 1\n",
"4 1 0 1 0 1 0 1 0 1 0 0 0 0 0\n",
"5 0 1 0 0 0 0 0 1 0 0 0 2 0 0\n",
"6 0 0 0 0 0 2 0 0 0 0 0 2 0 0"
]
},
"execution_count": 193,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import matplotlib\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"matplotlib.rcParams['figure.figsize'] = (20.0, 5.0)\n",
"sns.set()\n",
"\n",
"def displayMatrix(matrix, vocabulary): \n",
" labels = []\n",
" for i in range(len(matrix)):\n",
" labels.append('Doc '+str(i+1))\n",
" sns.heatmap(matrix,yticklabels=labels, xticklabels=vocabulary, square=True, annot=True)\n",
"\n",
"# Obtém um objeto range dos termos da matriz\n",
"def getTermsRange(matrix):\n",
" return range(len(matrix[0]))\n",
"\n",
"# Produto de vetores\n",
"def vecProduct(v1, v2):\n",
" sum = 0\n",
" for i in range(len(v1)):\n",
" sum+=v1[i]*v2[i]\n",
" return sum\n",
"\n",
"# Módulo de vetor\n",
"def vecModule(vector):\n",
" sum = 0\n",
" for val in vector:\n",
" sum+= val**2\n",
" return math.sqrt(sum)\n",
"\n",
"# Obtem o angulo ou cosseno entre dois vetores\n",
"def vecAngle(v1, v2):\n",
" return vecProduct(v1, v2) / (vecModule(v1)*vecModule(v2))\n",
"\n",
"\n",
"# Documentos de testes do slide\n",
"testDoc1 = [\n",
" [10, 1, 0],\n",
" [9, 2, 10],\n",
" [8, 1, 1],\n",
" [8, 1, 50],\n",
" [19, 2, 15],\n",
" [9, 2, 0]\n",
"]\n",
"\n",
"testDoc2 = [\n",
" [10, 1, 1],\n",
" [9, 2, 10],\n",
" [8, 1, 1],\n",
" [8, 1, 50],\n",
" [19, 2, 15],\n",
" [9, 2, 1]\n",
"]\n",
"\n",
"print(vecProduct([3,4], [-2, 5]))\n",
"print(vecModule([1,1]))\n",
"print(vecAngle([2,2], [0,2]), math.sqrt(2)/2)\n",
"pd.DataFrame(matrix)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Term Discrimination\n",
"Definimos os métodos de obtenção de centróide, similaridade entre termos, média das similaridades.\n",
"\n",
"Em seguida, implementamos o método de obtenção da lista de discriminação dos termos, e aplica-se estes pesos na matriz tf"
]
},
{
"cell_type": "code",
"execution_count": 183,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" <th>3</th>\n",
" <th>4</th>\n",
" <th>5</th>\n",
" <th>6</th>\n",
" <th>7</th>\n",
" <th>8</th>\n",
" <th>9</th>\n",
" <th>10</th>\n",
" <th>11</th>\n",
" <th>12</th>\n",
" <th>13</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>-0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>-0.000000</td>\n",
" <td>0.003007</td>\n",
" <td>-0.000000</td>\n",
" <td>0.002907</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>-0.015417</td>\n",
" <td>0.000000</td>\n",
" <td>-0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>-0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>-0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>-0.000000</td>\n",
" <td>0.001454</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>-0.015417</td>\n",
" <td>0.000000</td>\n",
" <td>-0.053579</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>-0.000639</td>\n",
" <td>0.000000</td>\n",
" <td>-0.000639</td>\n",
" <td>0.003007</td>\n",
" <td>-0.000639</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>-0.015417</td>\n",
" <td>0.000000</td>\n",
" <td>-0.000000</td>\n",
" <td>0.003887</td>\n",
" <td>0.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>-0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>-0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>-0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.001183</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>-0.015417</td>\n",
" <td>0.006429</td>\n",
" <td>-0.053579</td>\n",
" <td>0.000000</td>\n",
" <td>0.006429</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>-0.000639</td>\n",
" <td>0.000000</td>\n",
" <td>-0.000639</td>\n",
" <td>0.000000</td>\n",
" <td>-0.000639</td>\n",
" <td>0.000000</td>\n",
" <td>0.001183</td>\n",
" <td>0.000000</td>\n",
" <td>0.002695</td>\n",
" <td>-0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>-0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>-0.000000</td>\n",
" <td>0.004438</td>\n",
" <td>-0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>-0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.004438</td>\n",
" <td>0.000000</td>\n",
" <td>-0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>-0.107157</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>-0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>-0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>-0.000000</td>\n",
" <td>0.002907</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>-0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>-0.107157</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0 1 2 3 4 5 6 \\\n",
"0 -0.000000 0.000000 -0.000000 0.003007 -0.000000 0.002907 0.000000 \n",
"1 -0.000000 0.000000 -0.000000 0.000000 -0.000000 0.001454 0.000000 \n",
"2 -0.000639 0.000000 -0.000639 0.003007 -0.000639 0.000000 0.000000 \n",
"3 -0.000000 0.000000 -0.000000 0.000000 -0.000000 0.000000 0.001183 \n",
"4 -0.000639 0.000000 -0.000639 0.000000 -0.000639 0.000000 0.001183 \n",
"5 -0.000000 0.004438 -0.000000 0.000000 -0.000000 0.000000 0.000000 \n",
"6 -0.000000 0.000000 -0.000000 0.000000 -0.000000 0.002907 0.000000 \n",
"\n",
" 7 8 9 10 11 12 13 \n",
"0 0.000000 0.000000 -0.015417 0.000000 -0.000000 0.000000 0.000000 \n",
"1 0.000000 0.000000 -0.015417 0.000000 -0.053579 0.000000 0.000000 \n",
"2 0.000000 0.000000 -0.015417 0.000000 -0.000000 0.003887 0.000000 \n",
"3 0.000000 0.000000 -0.015417 0.006429 -0.053579 0.000000 0.006429 \n",
"4 0.000000 0.002695 -0.000000 0.000000 -0.000000 0.000000 0.000000 \n",
"5 0.004438 0.000000 -0.000000 0.000000 -0.107157 0.000000 0.000000 \n",
"6 0.000000 0.000000 -0.000000 0.000000 -0.107157 0.000000 0.000000 "
]
},
"execution_count": 183,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import seaborn as sns\n",
"\n",
"# Obtém o centróide de uma matriz de documentos\n",
"def getCentroid(matrix):\n",
" centroid = []\n",
" size = getTermsRange(matrix)\n",
" for i in size:\n",
" centroid.append(0)\n",
"\n",
" for doc in matrix:\n",
" for i in size:\n",
" centroid[i]+=doc[i]\n",
"\n",
" for i in size:\n",
" centroid[i] /= len(matrix[0])\n",
" return centroid\n",
"\n",
"# Retorna a similaridade entre a frequencia de 2 documentos\n",
"def similarity(doc1, doc2):\n",
" return vecAngle(doc1, doc2)\n",
"\n",
"# Obtém a média das similaridades\n",
"def avgSim(matrix):\n",
" avg = 0\n",
" centroid = getCentroid(matrix)\n",
" for doc in matrix:\n",
" avg+= similarity(doc, centroid)\n",
" return avg/len(matrix)\n",
"\n",
"# Obtém a lista de discriminação dos termos\n",
"def getDiscList(matrix):\n",
" discList = []\n",
" avg = avgSim(matrix)\n",
" for term in getTermsRange(matrix):\n",
" discMatrix = []\n",
" for doc in matrix:\n",
" newDoc = doc.copy()\n",
" del newDoc[term]\n",
" discMatrix.append(newDoc)\n",
" discList.append(avgSim(discMatrix) - avg)\n",
" return discList\n",
"\n",
"# Obtém uma matriz de pesos aplicando discriminação\n",
"def getDiscMatrix(matrix):\n",
" discList = getDiscList(matrix)\n",
" discMatrix = []\n",
" termsRange = getTermsRange(matrix)\n",
" for doc in matrix:\n",
" discDoc = []\n",
" for i in termsRange:\n",
" discDoc.append(doc[i] * discList[i])\n",
" discMatrix.append(discDoc)\n",
" return discMatrix\n",
"\n",
"pd.DataFrame(getDiscMatrix(matrix))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Relação sinal-ruído\n",
"Nesta são feitos alguns cálculos para se obter a relação sinal-ruído da matriz de tf.\n",
"\n",
"É importante fazer o tratamento no cálculo dos logs, pois o valor pode estar zerado."
]
},
{
"cell_type": "code",
"execution_count": 185,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 1368x360 with 2 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Obtém a lista da soma das frequências dos termos\n",
"def getTfkList(matrix):\n",
" tfkList = []\n",
" termsRange = getTermsRange(matrix)\n",
" for i in termsRange:\n",
" tfkList.append(0)\n",
" for doc in matrix:\n",
" for i in termsRange:\n",
" tfkList[i]+= doc[i]\n",
" return tfkList\n",
"\n",
"# Obtém a matriz de probabilidade do termo no documento\n",
"def getProbMatrix(matrix):\n",
" probMatrix = []\n",
" tfkList = getTfkList(matrix)\n",
" for doc in matrix:\n",
" probDoc = []\n",
" for i, term in enumerate(doc):\n",
" probDoc.append(term/tfkList[i])\n",
" probMatrix.append(probDoc)\n",
" return probMatrix\n",
"\n",
"# Obtém a matriz de informação\n",
"def getInfo(matrix):\n",
" probMatrix = getProbMatrix(matrix)\n",
" infoMatrix = []\n",
" for doc in probMatrix:\n",
" infoDoc = []\n",
" for i, term in enumerate(doc):\n",
" infoDoc.append(term if term == 0 else -term * math.log(term, 2))\n",
" infoMatrix.append(infoDoc)\n",
" return infoMatrix\n",
"\n",
"# Obtém a média de informação dos termos\n",
"def getAvgInfo(matrix):\n",
" infoMatrix = getInfo(matrix)\n",
" avgInfo = []\n",
" for i in infoMatrix[0]:\n",
" avgInfo.append(0)\n",
" \n",
" for doc in infoMatrix:\n",
" for i, term in enumerate(doc):\n",
" avgInfo[i]+= term\n",
" return avgInfo\n",
"\n",
"# Obtém o ruído dos termos\n",
"def getNoise(matrix):\n",
" noise = getAvgInfo(matrix)\n",
" for i in range(len(noise)):\n",
" noise[i]*=-1\n",
" return noise\n",
"\n",
"# Obtém o sinal dos termos\n",
"def getSignal(matrix):\n",
" noise = getNoise(matrix)\n",
" tfk = getTfkList(matrix)\n",
" signal = []\n",
" for i, docNoise in enumerate(noise):\n",
" signal.append(math.log(tfk[i], 2)+ docNoise)\n",
" return signal\n",
"\n",
"# Obtém a matriz tf com peso em sinal-ruído\n",
"def getSignalNoiseMatrix(matrix):\n",
" signal = getSignal(matrix)\n",
" sigNoiseMatrix = []\n",
" for doc in matrix:\n",
" sigNoiseDoc = []\n",
" for i, term in enumerate(doc):\n",
" sigNoiseDoc.append(term * signal[i])\n",
" sigNoiseMatrix.append(sigNoiseDoc)\n",
" return sigNoiseMatrix\n",
" \n",
"\n",
"displayMatrix(getSignalNoiseMatrix(matrix), vocabulary)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Análise geral"
]
},
{
"cell_type": "code",
"execution_count": 186,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Matriz tf\n"
]
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 1368x360 with 2 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"print(\"Matriz tf\")\n",
"displayMatrix(matrix, vocabulary)"
]
},
{
"cell_type": "code",
"execution_count": 187,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Matriz TF normalizada\n"
]
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 1368x360 with 2 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"print(\"Matriz TF normalizada\")\n",
"\n",
"displayMatrix(matrix_normalizada, vocabulary)"
]
},
{
"cell_type": "code",
"execution_count": 188,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Matriz TFxIDF\n"
]
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 1368x360 with 2 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"print(\"Matriz TFxIDF\")\n",
"\n",
"displayMatrix(new_matriz, vocabulary)"
]
},
{
"cell_type": "code",
"execution_count": 189,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Matriz Discriminatória\n"
]
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 1368x360 with 2 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"print(\"Matriz Discriminatória\")\n",
"displayMatrix(getDiscMatrix(matrix), vocabulary)"
]
},
{
"cell_type": "code",
"execution_count": 194,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Matriz Sinal-Ruído\n"
]
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 1440x360 with 2 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"print(\"Matriz Sinal-Ruído\")\n",
"\n",
"displayMatrix(getSignalNoiseMatrix(matrix), vocabulary)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Descreva aqui sua análise sobre as técnicas acima. Como a tua matriz mudou, qual método você considera mais interessante (gerou melhor distribuição dos pesos), etc? É uma análise empírica que você fará, apenas com sua impressão mesmo. \n",
"\n",
"Nota: talvez a coleção inicial de documentos não seja a melhor para evidenciar essa diferença das técnicas. Neste caso, você está livre para alterar a coleção e definir as que forem mais interessantes para a sua análise, ok?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Através da execução das técnicas de discriminação e sinal-ruído, é possível notar que apenas o TF pode não ser o suficiente. \n",
"\n",
"A técnica TFxIDF trouxe boas pontuações para os termos nos documentos, possibilitando julgá-los em cima disto.\n",
"\n",
"A técnica de discriminação trouxe resultados interessantes para estes documentos, pois por ela foi possível identificar os termos que são bons discriminantes e os que eram ruíns.\n",
"\n",
"Já a técnica Sinal-Ruído, por conta do tamanho dos documentos aparenta não ter trazido resultados tão bons pois poucos termos tiveram um peso relevante."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment