Skip to content

Instantly share code, notes, and snippets.

@lbourbon
Last active May 4, 2017 22:33
Show Gist options
  • Save lbourbon/5ef76a9c3b6fb848c08199e1c09c9b19 to your computer and use it in GitHub Desktop.
Save lbourbon/5ef76a9c3b6fb848c08199e1c09c9b19 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import pandas as pd \n",
"import numpy as np "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# INTRODUÇÃO"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As unidades básicas de trabalho com Pandas são as <b>'Series'</b> e os <b>'DataFrames'</b>. \n",
"\n",
"Series nada mais são do que um conjunto de elementos de 1 dimensão com índices. Series podem conter basicamente qualquer objeto: int , float, string, etc...\n",
"A forma mais fácil para criar uma series é através da função <i>pd.Series(data=, index=):</i>"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0 5\n",
"1 6\n",
"2 8\n",
"3 -3\n",
"4 9\n",
"dtype: int64 \n",
"\n",
"Campeão Flamengo\n",
"Terceiro Bangu\n",
"Vice Vasco\n",
"dtype: object \n",
"\n",
"0 0.851781\n",
"2 0.086502\n",
"4 0.884759\n",
"6 0.712156\n",
"dtype: float64 \n",
"\n",
"a -9\n",
"b -9\n",
"c -9\n",
"d -9\n",
"dtype: int64\n"
]
}
],
"source": [
"# usando uma lista\n",
"s = pd.Series (data = [5, 6, 8, -3, 9])\n",
"print (s, '\\n')\n",
"\n",
"# usando um dicionário\n",
"d = {'Campeão': 'Flamengo', 'Vice':'Vasco', 'Terceiro':'Bangu'}\n",
"s2 = pd.Series(d)\n",
"print (s2, '\\n')\n",
"\n",
"# usando um numpy array\n",
"dados = np.random.random(4)\n",
"indices = np.arange(0, 8, 2)\n",
"s3 = pd.Series(data = dados, index = indices)\n",
"print(s3, '\\n')\n",
"\n",
"# usando uma variável escalável\n",
"s4 = pd.Series(-9, index=['a', 'b', 'c', 'd'])\n",
"print (s4)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## <u>DataFrames: </u>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" são simplesmente um conjunto de séries que compartilham o mesmo índice. Similiar a uma tabela ou planilha do excel. É objeto Pandas mais utilizado! Por isso é muito importante dominar todas as operações que envolvam dataframes. Existem várias formas de criar dataframes, vamos ver as mais importantes:"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"1) Através da função: <i>pd.DataFrame(data= , index= , columns=)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" Times Cariocas Times Paulistas Times Ruins\n",
"0 Flamengo São Paulo Palmeiras\n",
"1 Fluminense Santos Vasco\n",
"2 Botafogo Corinthians Remo \n",
"\n",
" 60 26 50\n",
"0 0.671725 0.683336 0.662552\n",
"1 0.627958 0.137151 0.603874\n",
"2 0.479381 0.764465 0.963936\n",
"3 0.775946 0.766948 0.934370\n",
"4 0.722976 0.156485 0.384721 \n",
"\n",
" A B C\n",
"2 9 3 6\n",
"3 6 2 9\n",
"4 6 1 6\n"
]
}
],
"source": [
"# Data aceita diferentes inputs:\n",
"\n",
"# Dicionário\n",
"d = {'Times Paulistas': ['São Paulo', 'Santos', 'Corinthians'], \n",
" 'Times Cariocas': ['Flamengo', 'Fluminense', 'Botafogo'], \n",
" 'Times Ruins': ['Palmeiras', 'Vasco', 'Remo']}\n",
"df = pd.DataFrame(d)\n",
"print(df, '\\n')\n",
"\n",
"# Arrays\n",
"dados = np.random.random((5,3))\n",
"df2 = pd.DataFrame(data = dados, index = np.arange(5), columns = np.random.randint(1, 100, 3))\n",
"print (df2, '\\n')\n",
"\n",
"# Listas ou Tuplas\n",
"listas = [[9,3,6], [6,2,9], [6,1,6]]\n",
"df3 = pd.DataFrame(listas, index = (2,3,4), columns = ['A', 'B', 'C'])\n",
"print (df3)"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"2) Por leitura de arquivo de diferentes formatos com os métodos: \n",
" <i>pd.read_csv();\n",
" pd.read_excel();\n",
" pd.read_pickle();\n",
" pd.read_sql();\n",
" pd.read_json();\n",
" pd.read_html() e outros...\n",
" "
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>SepalLengthCm</th>\n",
" <th>SepalWidthCm</th>\n",
" <th>PetalLengthCm</th>\n",
" <th>PetalWidthCm</th>\n",
" <th>Species</th>\n",
" </tr>\n",
" <tr>\n",
" <th>Id</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>5.1</td>\n",
" <td>3.5</td>\n",
" <td>1.4</td>\n",
" <td>0.2</td>\n",
" <td>Iris-setosa</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>4.9</td>\n",
" <td>3.0</td>\n",
" <td>1.4</td>\n",
" <td>0.2</td>\n",
" <td>Iris-setosa</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4.7</td>\n",
" <td>3.2</td>\n",
" <td>1.3</td>\n",
" <td>0.2</td>\n",
" <td>Iris-setosa</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>4.6</td>\n",
" <td>3.1</td>\n",
" <td>1.5</td>\n",
" <td>0.2</td>\n",
" <td>Iris-setosa</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>5.0</td>\n",
" <td>3.6</td>\n",
" <td>1.4</td>\n",
" <td>0.2</td>\n",
" <td>Iris-setosa</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species\n",
"Id \n",
"1 5.1 3.5 1.4 0.2 Iris-setosa\n",
"2 4.9 3.0 1.4 0.2 Iris-setosa\n",
"3 4.7 3.2 1.3 0.2 Iris-setosa\n",
"4 4.6 3.1 1.5 0.2 Iris-setosa\n",
"5 5.0 3.6 1.4 0.2 Iris-setosa"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# basta passar a localização do arquivo para criar um novo dataframe\n",
"\n",
"df = pd.read_csv('iris.csv', index_col='Id')\n",
"df.head() # o método head() retorna as primeiras cinco linhas do dataframe"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"Veja que só passamos como argumento 'iris.csv', isso porque o arquivo csv estava na mesma pasta que o nosso código, caso contrário seria necessário passar todo o caminho, exemplo: \"c:/users/exemplo/pasta/arquivo.csv\".\n",
"O dataset (conjunto de dados) iris.csv foi baixado de kaggle.com e contém características de três espécies de flores.\n",
"O parâmetro index_col foi usado para selecionar uma coluna do arquivo csv que contém o índice (opcional)."
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"Todos os outros métodos são bem semelhantes, o que muda um pouco é o read_html(). Veja o exemplo:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Quantidade de tabelas: 58\n"
]
}
],
"source": [
"# Criamos um dataframe chamado fut e passamos o link que contém a tabela que queremos:\n",
"fut = pd.read_html(\"https://pt.wikipedia.org/wiki/Campeonato_Carioca_de_Futebol\")\n",
"print(\"Quantidade de tabelas:\", len(fut))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Perceba que o Pandas pega todas as tabelas presentes na página, como só queremos uma delas, selecionamos a escolhida:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" <th>3</th>\n",
" <th>4</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Clube</td>\n",
" <td>Títulos</td>\n",
" <td>Vices</td>\n",
" <td>Terceiro lugar</td>\n",
" <td>Quarto lugar</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Flamengo</td>\n",
" <td>33 (1914, 1915, 1920, 1921, 1925, 1927, 1939, ...</td>\n",
" <td>31 (1912(2), 1919, 1922, 1923, 1924(2), 1932, ...</td>\n",
" <td>14 (1913, 1917, 1935(3), 1946, 1957, 1964, 196...</td>\n",
" <td>14 (1916, 1918, 1928, 1945, 1948, 1949, 1951, ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Fluminense</td>\n",
" <td>31 (1906, 1907(1), 1908, 1909, 1911, 1917, 191...</td>\n",
" <td>21 (1910, 1915, 1920, 1925, 1927, 1933(2), 193...</td>\n",
" <td>23 (1922, 1926, 1942, 1948, 1952, 1962, 1965, ...</td>\n",
" <td>19 (1914, 1923, 1929, 1939, 1944, 1947, 1955, ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Vasco da Gama</td>\n",
" <td>24 (1923, 1924(1), 1929, 1934(1), 1936(1), 194...</td>\n",
" <td>24 (1926, 1928, 1930, 1931, 1935(2), 1944, 194...</td>\n",
" <td>13 (1925, 1933(3), 1937, 1940, 1955, 1972, 197...</td>\n",
" <td>18 (1927, 1938, 1941, 1943, 1953, 1954, 1957, ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Botafogo</td>\n",
" <td>20 (1907(1), 1910, 1912(1), 1930, 1932, 1933(1...</td>\n",
" <td>20 (1908, 1909, 1913, 1914, 1916, 1918, 1939, ...</td>\n",
" <td>21 (1919, 1928, 1938, 1941, 1944, 1949, 1951, ...</td>\n",
" <td>26 (1906, 1915, 1920, 1922, 1924(4), 1925, 193...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0 1 \\\n",
"0 Clube Títulos \n",
"1 Flamengo 33 (1914, 1915, 1920, 1921, 1925, 1927, 1939, ... \n",
"2 Fluminense 31 (1906, 1907(1), 1908, 1909, 1911, 1917, 191... \n",
"3 Vasco da Gama 24 (1923, 1924(1), 1929, 1934(1), 1936(1), 194... \n",
"4 Botafogo 20 (1907(1), 1910, 1912(1), 1930, 1932, 1933(1... \n",
"\n",
" 2 \\\n",
"0 Vices \n",
"1 31 (1912(2), 1919, 1922, 1923, 1924(2), 1932, ... \n",
"2 21 (1910, 1915, 1920, 1925, 1927, 1933(2), 193... \n",
"3 24 (1926, 1928, 1930, 1931, 1935(2), 1944, 194... \n",
"4 20 (1908, 1909, 1913, 1914, 1916, 1918, 1939, ... \n",
"\n",
" 3 \\\n",
"0 Terceiro lugar \n",
"1 14 (1913, 1917, 1935(3), 1946, 1957, 1964, 196... \n",
"2 23 (1922, 1926, 1942, 1948, 1952, 1962, 1965, ... \n",
"3 13 (1925, 1933(3), 1937, 1940, 1955, 1972, 197... \n",
"4 21 (1919, 1928, 1938, 1941, 1944, 1949, 1951, ... \n",
"\n",
" 4 \n",
"0 Quarto lugar \n",
"1 14 (1916, 1918, 1928, 1945, 1948, 1949, 1951, ... \n",
"2 19 (1914, 1923, 1929, 1939, 1944, 1947, 1955, ... \n",
"3 18 (1927, 1938, 1941, 1943, 1953, 1954, 1957, ... \n",
"4 26 (1906, 1915, 1920, 1922, 1924(4), 1925, 193... "
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"fut = fut[5] # veremos mais para frente como selecionar os dados de interesse no df e como torná-lo mais apresentável\n",
"fut.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Output:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Pandas dá a opção de salvar nosso df em diferentes formatos: (análogo aos métodos de leitura)\n",
"\n",
" to_csv(); to_excel(), to_pickle(), to_json(), to_sql(), to_html e outros..."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Aquela tabela que lemos no formato .html, podemos salvar assim:\n",
"\n",
"fut.to_excel('Tabela dos Campeões.xlsx') # Excel\n",
"fut.to_csv('Tabela dos Campeões.csv') # Arquivo .csv\n",
"fut.to_pickle('Tabela dos Campeões.pickle') # Pickle"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"# Atributos e Funções:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Podemos obter muitas informações dos nossos dataframes, para isso basta conhecer as principais funções disponíveis:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
".head() e .tail()"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species\n",
"Id \n",
"1 5.1 3.5 1.4 0.2 Iris-setosa\n",
"2 4.9 3.0 1.4 0.2 Iris-setosa\n",
"3 4.7 3.2 1.3 0.2 Iris-setosa\n",
"4 4.6 3.1 1.5 0.2 Iris-setosa\n",
" SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species\n",
"Id \n",
"148 6.5 3.0 5.2 2.0 Iris-virginica\n",
"149 6.2 3.4 5.4 2.3 Iris-virginica\n",
"150 5.9 3.0 5.1 1.8 Iris-virginica\n"
]
}
],
"source": [
"# Criamos nosso dataframe\n",
"df = pd.read_csv('iris.csv', index_col='Id')\n",
"\n",
"# Como já vimos, para retornar as n primeiras usamos:\n",
"print(df.head(4)) # se não passar nenhum valor, o padrão é 5\n",
"\n",
"# Para ver as últimas linhas do nosso df:\n",
"print(df.tail(3))"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Index(['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm',\n",
" 'Species'],\n",
" dtype='object')\n",
"Int64Index([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10,\n",
" ...\n",
" 141, 142, 143, 144, 145, 146, 147, 148, 149, 150],\n",
" dtype='int64', name='Id', length=150)\n"
]
}
],
"source": [
"# Para obter as colunas de seu dataframe:\n",
"\n",
"print (df.columns)\n",
"\n",
"# Para obter os índices de seu dataframe:\n",
"\n",
"print (df.index)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Para obter informações rápidas, use os métodos .info() e .describe()"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"Int64Index: 150 entries, 1 to 150\n",
"Data columns (total 5 columns):\n",
"SepalLengthCm 150 non-null float64\n",
"SepalWidthCm 150 non-null float64\n",
"PetalLengthCm 150 non-null float64\n",
"PetalWidthCm 150 non-null float64\n",
"Species 150 non-null object\n",
"dtypes: float64(4), object(1)\n",
"memory usage: 7.0+ KB\n"
]
}
],
"source": [
"# Retorna quantidade de linhas, colunas, tipo de dados...\n",
"\n",
"df.info()"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>SepalLengthCm</th>\n",
" <th>SepalWidthCm</th>\n",
" <th>PetalLengthCm</th>\n",
" <th>PetalWidthCm</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>count</th>\n",
" <td>150.000000</td>\n",
" <td>150.000000</td>\n",
" <td>150.000000</td>\n",
" <td>150.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>mean</th>\n",
" <td>5.843333</td>\n",
" <td>3.054000</td>\n",
" <td>3.758667</td>\n",
" <td>1.198667</td>\n",
" </tr>\n",
" <tr>\n",
" <th>std</th>\n",
" <td>0.828066</td>\n",
" <td>0.433594</td>\n",
" <td>1.764420</td>\n",
" <td>0.763161</td>\n",
" </tr>\n",
" <tr>\n",
" <th>min</th>\n",
" <td>4.300000</td>\n",
" <td>2.000000</td>\n",
" <td>1.000000</td>\n",
" <td>0.100000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25%</th>\n",
" <td>5.100000</td>\n",
" <td>2.800000</td>\n",
" <td>1.600000</td>\n",
" <td>0.300000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>50%</th>\n",
" <td>5.800000</td>\n",
" <td>3.000000</td>\n",
" <td>4.350000</td>\n",
" <td>1.300000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>75%</th>\n",
" <td>6.400000</td>\n",
" <td>3.300000</td>\n",
" <td>5.100000</td>\n",
" <td>1.800000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>max</th>\n",
" <td>7.900000</td>\n",
" <td>4.400000</td>\n",
" <td>6.900000</td>\n",
" <td>2.500000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm\n",
"count 150.000000 150.000000 150.000000 150.000000\n",
"mean 5.843333 3.054000 3.758667 1.198667\n",
"std 0.828066 0.433594 1.764420 0.763161\n",
"min 4.300000 2.000000 1.000000 0.100000\n",
"25% 5.100000 2.800000 1.600000 0.300000\n",
"50% 5.800000 3.000000 4.350000 1.300000\n",
"75% 6.400000 3.300000 5.100000 1.800000\n",
"max 7.900000 4.400000 6.900000 2.500000"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Retorna um df com contagem, média, soma, desvio padrão, quartis e valores mínimos e máximos de cada coluna :\n",
"\n",
"df.describe()"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"SepalLengthCm 150\n",
"SepalWidthCm 150\n",
"PetalLengthCm 150\n",
"PetalWidthCm 150\n",
"Species 150\n",
"dtype: int64\n",
"SepalLengthCm 5.843333\n",
"SepalWidthCm 3.054000\n",
"PetalLengthCm 3.758667\n",
"PetalWidthCm 1.198667\n",
"dtype: float64\n",
"SepalLengthCm 0.828066\n",
"SepalWidthCm 0.433594\n",
"PetalLengthCm 1.764420\n",
"PetalWidthCm 0.763161\n",
"dtype: float64\n",
"SepalLengthCm 4.3\n",
"SepalWidthCm 2\n",
"PetalLengthCm 1\n",
"PetalWidthCm 0.1\n",
"Species Iris-setosa\n",
"dtype: object\n",
"SepalLengthCm 7.9\n",
"SepalWidthCm 4.4\n",
"PetalLengthCm 6.9\n",
"PetalWidthCm 2.5\n",
"Species Iris-virginica\n",
"dtype: object\n"
]
}
],
"source": [
"# Podemos obter cada informação acima em uma Series, aplicando as funções:\n",
"\n",
"print(df.count())\n",
"print(df.mean())\n",
"print(df.std())\n",
"print(df.min())\n",
"print(df.max())"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"SepalLengthCm 876.5\n",
"SepalWidthCm 458.1\n",
"PetalLengthCm 563.8\n",
"PetalWidthCm 179.8\n",
"Species Iris-setosaIris-setosaIris-setosaIris-setosaIr...\n",
"dtype: object\n",
"SepalLengthCm 2.257440e+114\n",
"SepalWidthCm 1.197477e+72\n",
"PetalLengthCm 3.774489e+76\n",
"PetalWidthCm 2.972714e-12\n",
"dtype: float64\n"
]
}
],
"source": [
"# Para obter a soma e o produto de cada coluna do df:\n",
"\n",
"print(df.sum())\n",
"print(df.prod())"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>SepalLengthCm</th>\n",
" <th>SepalWidthCm</th>\n",
" <th>PetalLengthCm</th>\n",
" <th>PetalWidthCm</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>SepalLengthCm</th>\n",
" <td>1.000000</td>\n",
" <td>-0.109369</td>\n",
" <td>0.871754</td>\n",
" <td>0.817954</td>\n",
" </tr>\n",
" <tr>\n",
" <th>SepalWidthCm</th>\n",
" <td>-0.109369</td>\n",
" <td>1.000000</td>\n",
" <td>-0.420516</td>\n",
" <td>-0.356544</td>\n",
" </tr>\n",
" <tr>\n",
" <th>PetalLengthCm</th>\n",
" <td>0.871754</td>\n",
" <td>-0.420516</td>\n",
" <td>1.000000</td>\n",
" <td>0.962757</td>\n",
" </tr>\n",
" <tr>\n",
" <th>PetalWidthCm</th>\n",
" <td>0.817954</td>\n",
" <td>-0.356544</td>\n",
" <td>0.962757</td>\n",
" <td>1.000000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm\n",
"SepalLengthCm 1.000000 -0.109369 0.871754 0.817954\n",
"SepalWidthCm -0.109369 1.000000 -0.420516 -0.356544\n",
"PetalLengthCm 0.871754 -0.420516 1.000000 0.962757\n",
"PetalWidthCm 0.817954 -0.356544 0.962757 1.000000"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# A função .corr() faz a correlação de todos os atributos, excluindo NaN values:\n",
"\n",
"df.corr()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Para ser aplicado nas colunas (lembre que cada coluna de um df é um Series):"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype=object)"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Para saber quais são os valores únicos que determinada coluna possui, use .unique()\n",
"\n",
"df['Species'].unique()"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"3\n",
"3\n"
]
}
],
"source": [
"# Existem duas formas de saber a quantidade de valores únicos:\n",
"\n",
"print(len(df['Species'].unique()))\n",
"print(df['Species'].nunique())"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"Iris-virginica 50\n",
"Iris-setosa 50\n",
"Iris-versicolor 50\n",
"Name: Species, dtype: int64"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Outro método importante é o .value_counts() , retorna a contagem de cada valor presente na coluna:\n",
"\n",
"df['Species'].value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"# Seleção e Indexação:"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Times_Cariocas</th>\n",
" <th>Times_Paulistas</th>\n",
" <th>Times_Ruins</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>Campeão</th>\n",
" <td>Flamengo</td>\n",
" <td>São Paulo</td>\n",
" <td>Palmeiras</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Vice</th>\n",
" <td>Fluminense</td>\n",
" <td>Santos</td>\n",
" <td>Vasco</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Terceiro</th>\n",
" <td>Botafogo</td>\n",
" <td>Corinthians</td>\n",
" <td>Remo</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Times_Cariocas Times_Paulistas Times_Ruins\n",
"Campeão Flamengo São Paulo Palmeiras\n",
"Vice Fluminense Santos Vasco\n",
"Terceiro Botafogo Corinthians Remo"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Dataframe de Exemplo\n",
"df = pd.DataFrame({'Times_Paulistas': ['São Paulo', 'Santos', 'Corinthians'], \n",
" 'Times_Cariocas': ['Flamengo', 'Fluminense', 'Botafogo'], \n",
" 'Times_Ruins': ['Palmeiras', 'Vasco', 'Remo']}, index = ['Campeão', 'Vice', 'Terceiro'])\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Existem duas formas de selecionar <u>colunas</u> do seu df:"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"Campeão São Paulo\n",
"Vice Santos\n",
"Terceiro Corinthians\n",
"Name: Times_Paulistas, dtype: object"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.Times_Paulistas\n",
"\n",
"# ou, a minha preferência pessoal:\n",
"\n",
"df['Times_Paulistas']"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Times_Paulistas</th>\n",
" <th>Times_Cariocas</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>Campeão</th>\n",
" <td>São Paulo</td>\n",
" <td>Flamengo</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Vice</th>\n",
" <td>Santos</td>\n",
" <td>Fluminense</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Terceiro</th>\n",
" <td>Corinthians</td>\n",
" <td>Botafogo</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Times_Paulistas Times_Cariocas\n",
"Campeão São Paulo Flamengo\n",
"Vice Santos Fluminense\n",
"Terceiro Corinthians Botafogo"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Para pegar mais de uma coluna, basta:\n",
"\n",
"df[['Times_Paulistas', 'Times_Cariocas']] # perceba que estamos passando uma lista de colunas ao invés de uma coluna,\n",
" # por isso os colchetes duplos... Retorna um dataframe!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Existem duas formas para selecionar <u>linhas</u>: pela posição ou pelo nome do index"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"Times_Cariocas Flamengo\n",
"Times_Paulistas São Paulo\n",
"Times_Ruins Palmeiras\n",
"Name: Campeão, dtype: object"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Pela posição, com iloc[]:\n",
"\n",
"df.iloc[0]"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"Times_Cariocas Fluminense\n",
"Times_Paulistas Santos\n",
"Times_Ruins Vasco\n",
"Name: Vice, dtype: object"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Pelo nome do índice:\n",
"\n",
"df.loc['Vice']"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"'Vasco'"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Para escolher uma única célula, use os rótulos da coluna e da linha:\n",
"\n",
"df['Times_Ruins']['Vice']"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Adicionar colunas é muito simples\n",
"\n",
"df['Nova Coluna'] = [1,2,3]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Índices:"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Times_Cariocas</th>\n",
" <th>Times_Paulistas</th>\n",
" <th>Times_Ruins</th>\n",
" </tr>\n",
" <tr>\n",
" <th>Nova Coluna</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Flamengo</td>\n",
" <td>São Paulo</td>\n",
" <td>Palmeiras</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Fluminense</td>\n",
" <td>Santos</td>\n",
" <td>Vasco</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Botafogo</td>\n",
" <td>Corinthians</td>\n",
" <td>Remo</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Times_Cariocas Times_Paulistas Times_Ruins\n",
"Nova Coluna \n",
"1 Flamengo São Paulo Palmeiras\n",
"2 Fluminense Santos Vasco\n",
"3 Botafogo Corinthians Remo"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Para mudar o índice - escolhemos como índice a nova coluna\n",
"\n",
"df.set_index('Nova Coluna') "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"ATENÇÃO!!! A maioria das mudanças que fizermos nos dataframes não é permanente, na verdade o Pandas cria um novo objeto\n",
"e retorna o mesmo. \n",
"Isso evita a perda de informação acidental. Para que a mudança seja permanente, use o parâmetro <i>inplace = True"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Times_Cariocas</th>\n",
" <th>Times_Paulistas</th>\n",
" <th>Times_Ruins</th>\n",
" <th>Nova Coluna</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>Campeão</th>\n",
" <td>Flamengo</td>\n",
" <td>São Paulo</td>\n",
" <td>Palmeiras</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Vice</th>\n",
" <td>Fluminense</td>\n",
" <td>Santos</td>\n",
" <td>Vasco</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Terceiro</th>\n",
" <td>Botafogo</td>\n",
" <td>Corinthians</td>\n",
" <td>Remo</td>\n",
" <td>3</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Times_Cariocas Times_Paulistas Times_Ruins Nova Coluna\n",
"Campeão Flamengo São Paulo Palmeiras 1\n",
"Vice Fluminense Santos Vasco 2\n",
"Terceiro Botafogo Corinthians Remo 3"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Nesse caso não tem sentido mudar o índice, mas veja que a mudança não é permanente\n",
"\n",
"df"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>index</th>\n",
" <th>Times_Cariocas</th>\n",
" <th>Times_Paulistas</th>\n",
" <th>Times_Ruins</th>\n",
" <th>Nova Coluna</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Campeão</td>\n",
" <td>Flamengo</td>\n",
" <td>São Paulo</td>\n",
" <td>Palmeiras</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Vice</td>\n",
" <td>Fluminense</td>\n",
" <td>Santos</td>\n",
" <td>Vasco</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Terceiro</td>\n",
" <td>Botafogo</td>\n",
" <td>Corinthians</td>\n",
" <td>Remo</td>\n",
" <td>3</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" index Times_Cariocas Times_Paulistas Times_Ruins Nova Coluna\n",
"0 Campeão Flamengo São Paulo Palmeiras 1\n",
"1 Vice Fluminense Santos Vasco 2\n",
"2 Terceiro Botafogo Corinthians Remo 3"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Para resetar o índice, fazendo com que a mudança seja permanente\n",
"\n",
"df.reset_index(inplace = True)\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Dando um tapa no visu:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Boa parte do trabalho de um cientista de dados, consiste em organizar e limpar os dados de forma que seja mais fácil analisá-los"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>Títulos</th>\n",
" <th>Vices</th>\n",
" <th>Terceiro lugar</th>\n",
" <th>Quarto lugar</th>\n",
" </tr>\n",
" <tr>\n",
" <th>Clube</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>Flamengo</th>\n",
" <td>1</td>\n",
" <td>33 (1914, 1915, 1920, 1921, 1925, 1927, 1939, ...</td>\n",
" <td>31 (1912(2), 1919, 1922, 1923, 1924(2), 1932, ...</td>\n",
" <td>14 (1913, 1917, 1935(3), 1946, 1957, 1964, 196...</td>\n",
" <td>14 (1916, 1918, 1928, 1945, 1948, 1949, 1951, ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Fluminense</th>\n",
" <td>2</td>\n",
" <td>31 (1906, 1907(1), 1908, 1909, 1911, 1917, 191...</td>\n",
" <td>21 (1910, 1915, 1920, 1925, 1927, 1933(2), 193...</td>\n",
" <td>23 (1922, 1926, 1942, 1948, 1952, 1962, 1965, ...</td>\n",
" <td>19 (1914, 1923, 1929, 1939, 1944, 1947, 1955, ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Vasco da Gama</th>\n",
" <td>3</td>\n",
" <td>24 (1923, 1924(1), 1929, 1934(1), 1936(1), 194...</td>\n",
" <td>24 (1926, 1928, 1930, 1931, 1935(2), 1944, 194...</td>\n",
" <td>13 (1925, 1933(3), 1937, 1940, 1955, 1972, 197...</td>\n",
" <td>18 (1927, 1938, 1941, 1943, 1953, 1954, 1957, ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Botafogo</th>\n",
" <td>4</td>\n",
" <td>20 (1907(1), 1910, 1912(1), 1930, 1932, 1933(1...</td>\n",
" <td>20 (1908, 1909, 1913, 1914, 1916, 1918, 1939, ...</td>\n",
" <td>21 (1919, 1928, 1938, 1941, 1944, 1949, 1951, ...</td>\n",
" <td>26 (1906, 1915, 1920, 1922, 1924(4), 1925, 193...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>America</th>\n",
" <td>5</td>\n",
" <td>7 (1913, 1916, 1922, 1928, 1931, 1935(1) e 1960)</td>\n",
" <td>7 (1911, 1917, 1921, 1929, 1950, 1954 e 1955)</td>\n",
" <td>17 (1908, 1909, 1910, 1912(3), 1914, 1915, 192...</td>\n",
" <td>4 (1946, 1968, 1970 e 1983)</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0 Títulos \\\n",
"Clube \n",
"Flamengo 1 33 (1914, 1915, 1920, 1921, 1925, 1927, 1939, ... \n",
"Fluminense 2 31 (1906, 1907(1), 1908, 1909, 1911, 1917, 191... \n",
"Vasco da Gama 3 24 (1923, 1924(1), 1929, 1934(1), 1936(1), 194... \n",
"Botafogo 4 20 (1907(1), 1910, 1912(1), 1930, 1932, 1933(1... \n",
"America 5 7 (1913, 1916, 1922, 1928, 1931, 1935(1) e 1960) \n",
"\n",
" Vices \\\n",
"Clube \n",
"Flamengo 31 (1912(2), 1919, 1922, 1923, 1924(2), 1932, ... \n",
"Fluminense 21 (1910, 1915, 1920, 1925, 1927, 1933(2), 193... \n",
"Vasco da Gama 24 (1926, 1928, 1930, 1931, 1935(2), 1944, 194... \n",
"Botafogo 20 (1908, 1909, 1913, 1914, 1916, 1918, 1939, ... \n",
"America 7 (1911, 1917, 1921, 1929, 1950, 1954 e 1955) \n",
"\n",
" Terceiro lugar \\\n",
"Clube \n",
"Flamengo 14 (1913, 1917, 1935(3), 1946, 1957, 1964, 196... \n",
"Fluminense 23 (1922, 1926, 1942, 1948, 1952, 1962, 1965, ... \n",
"Vasco da Gama 13 (1925, 1933(3), 1937, 1940, 1955, 1972, 197... \n",
"Botafogo 21 (1919, 1928, 1938, 1941, 1944, 1949, 1951, ... \n",
"America 17 (1908, 1909, 1910, 1912(3), 1914, 1915, 192... \n",
"\n",
" Quarto lugar \n",
"Clube \n",
"Flamengo 14 (1916, 1918, 1928, 1945, 1948, 1949, 1951, ... \n",
"Fluminense 19 (1914, 1923, 1929, 1939, 1944, 1947, 1955, ... \n",
"Vasco da Gama 18 (1927, 1938, 1941, 1943, 1953, 1954, 1957, ... \n",
"Botafogo 26 (1906, 1915, 1920, 1922, 1924(4), 1925, 193... \n",
"America 4 (1946, 1968, 1970 e 1983) "
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Voltando para o nosso antigo dataframe 'fut'\n",
"# escolhemos o header(linha com os nomes das colunas) e o índice no momento de chamar a função \n",
"\n",
"fut = pd.read_csv(\"Tabela dos Campeões.csv\", encoding = \"ISO-8859-1\", header= 1, index_col=1)\n",
"fut.head()"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Deleta as colunas indesejadas\n",
"\n",
"fut.drop(['0', 'Quarto lugar'], axis=1, inplace = True)"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Títulos</th>\n",
" <th>Vices</th>\n",
" <th>Terceiro lugar</th>\n",
" </tr>\n",
" <tr>\n",
" <th>Clube</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>Flamengo</th>\n",
" <td>33 (1914, 1915, 1920, 1921, 1925, 1927, 1939, ...</td>\n",
" <td>31 (1912(2), 1919, 1922, 1923, 1924(2), 1932, ...</td>\n",
" <td>14 (1913, 1917, 1935(3), 1946, 1957, 1964, 196...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Fluminense</th>\n",
" <td>31 (1906, 1907(1), 1908, 1909, 1911, 1917, 191...</td>\n",
" <td>21 (1910, 1915, 1920, 1925, 1927, 1933(2), 193...</td>\n",
" <td>23 (1922, 1926, 1942, 1948, 1952, 1962, 1965, ...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Títulos \\\n",
"Clube \n",
"Flamengo 33 (1914, 1915, 1920, 1921, 1925, 1927, 1939, ... \n",
"Fluminense 31 (1906, 1907(1), 1908, 1909, 1911, 1917, 191... \n",
"\n",
" Vices \\\n",
"Clube \n",
"Flamengo 31 (1912(2), 1919, 1922, 1923, 1924(2), 1932, ... \n",
"Fluminense 21 (1910, 1915, 1920, 1925, 1927, 1933(2), 193... \n",
"\n",
" Terceiro lugar \n",
"Clube \n",
"Flamengo 14 (1913, 1917, 1935(3), 1946, 1957, 1964, 196... \n",
"Fluminense 23 (1922, 1926, 1942, 1948, 1952, 1962, 1965, ... "
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"fut.head(2)"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Pegando apenas o valor de interesse em cada célula - vamos aprender isso em breve\n",
"\n",
"fut['Títulos'] = fut['Títulos'].apply(lambda x: int(x.split()[0]))\n",
"fut['Vices'] = fut['Vices'].apply(lambda x: int(x.split()[0]))\n",
"fut['Terceiro lugar'] = fut['Terceiro lugar'].apply(lambda x: int(x.split()[0]))"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Títulos</th>\n",
" <th>Vices</th>\n",
" <th>Terceiro lugar</th>\n",
" </tr>\n",
" <tr>\n",
" <th>Clube</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>Flamengo</th>\n",
" <td>33</td>\n",
" <td>31</td>\n",
" <td>14</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Fluminense</th>\n",
" <td>31</td>\n",
" <td>21</td>\n",
" <td>23</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Vasco da Gama</th>\n",
" <td>24</td>\n",
" <td>24</td>\n",
" <td>13</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Botafogo</th>\n",
" <td>20</td>\n",
" <td>20</td>\n",
" <td>21</td>\n",
" </tr>\n",
" <tr>\n",
" <th>America</th>\n",
" <td>7</td>\n",
" <td>7</td>\n",
" <td>17</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Títulos Vices Terceiro lugar\n",
"Clube \n",
"Flamengo 33 31 14\n",
"Fluminense 31 21 23\n",
"Vasco da Gama 24 24 13\n",
"Botafogo 20 20 21\n",
"America 7 7 17"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Agora temos um dataframe bem organizado\n",
"\n",
"fut = fut.head(5) # pegamos só as cinco primeiras linhas para facilitar a visualização dos exemplos\n",
"fut"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### SELEÇÃO CONDICIONAL:"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Títulos</th>\n",
" <th>Vices</th>\n",
" <th>Terceiro lugar</th>\n",
" </tr>\n",
" <tr>\n",
" <th>Clube</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>Flamengo</th>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Fluminense</th>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Vasco da Gama</th>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Botafogo</th>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>America</th>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Títulos Vices Terceiro lugar\n",
"Clube \n",
"Flamengo True True False\n",
"Fluminense True False True\n",
"Vasco da Gama True True False\n",
"Botafogo False False False\n",
"America False False False"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Passando uma condição para o df, obtem-se um df com variáveis booleanas, que satisfazem essa condição, exemplo:\n",
"\n",
"fut>22 # retorna True onde os valores forem maiores que 22"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Títulos</th>\n",
" <th>Vices</th>\n",
" <th>Terceiro lugar</th>\n",
" </tr>\n",
" <tr>\n",
" <th>Clube</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>Flamengo</th>\n",
" <td>33.0</td>\n",
" <td>31.0</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Fluminense</th>\n",
" <td>31.0</td>\n",
" <td>NaN</td>\n",
" <td>23.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Vasco da Gama</th>\n",
" <td>24.0</td>\n",
" <td>24.0</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Botafogo</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>America</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Títulos Vices Terceiro lugar\n",
"Clube \n",
"Flamengo 33.0 31.0 NaN\n",
"Fluminense 31.0 NaN 23.0\n",
"Vasco da Gama 24.0 24.0 NaN\n",
"Botafogo NaN NaN NaN\n",
"America NaN NaN NaN"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Para fazer uma selação condicional, passamos a condição como se fosse um índice:\n",
"\n",
"fut[fut>22] # ou seja, quero os valores do df fut, apenas onde esses valores sejam maiores que 22"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"\"\\nDestrinchando essa linha acima:\\n\\nfut['Títulos'].max() retorna o valor máximo da coluna títulos\\nfut['Títulos'] == fut['Títulos'].max() é a condição que estamos passando, ou seja, quero os valores da coluna Títulos, apenas \\nonde o valor seja igual ao valor máximo\\n\\n\""
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Outro exemplo: para saber quem tem mais títulos:\n",
"\n",
"fut['Títulos'][ fut['Títulos'] == fut['Títulos'].max() ]\n",
"\n",
"'''\n",
"Destrinchando essa linha acima:\n",
"\n",
"fut['Títulos'].max() retorna o valor máximo da coluna títulos\n",
"fut['Títulos'] == fut['Títulos'].max() é a condição que estamos passando, ou seja, quero os valores da coluna Títulos, apenas \n",
"onde o valor seja igual ao valor máximo\n",
"\n",
"''' \n"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"Clube\n",
"America 7\n",
"Botafogo 20\n",
"Fluminense 21\n",
"Vasco da Gama 24\n",
"Flamengo 31\n",
"Name: Vices, dtype: int64"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Um método mais simples para comparar valores de uma coluna:\n",
"\n",
"fut['Vices'].sort_values()\n",
"\n",
"# Perceba que o Vasco está em segundo lugar, ou seja, ele é o vice dos vices! rsrsrsrs"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Criando uma nova coluna\n",
"\n",
"fut['Pódios'] = fut['Títulos'] + fut['Vices'] + fut['Terceiro lugar']"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Títulos</th>\n",
" <th>Vices</th>\n",
" <th>Terceiro lugar</th>\n",
" <th>Pódios</th>\n",
" </tr>\n",
" <tr>\n",
" <th>Clube</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>Flamengo</th>\n",
" <td>33</td>\n",
" <td>31</td>\n",
" <td>14</td>\n",
" <td>78</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Fluminense</th>\n",
" <td>31</td>\n",
" <td>21</td>\n",
" <td>23</td>\n",
" <td>75</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Vasco da Gama</th>\n",
" <td>24</td>\n",
" <td>24</td>\n",
" <td>13</td>\n",
" <td>61</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Botafogo</th>\n",
" <td>20</td>\n",
" <td>20</td>\n",
" <td>21</td>\n",
" <td>61</td>\n",
" </tr>\n",
" <tr>\n",
" <th>America</th>\n",
" <td>7</td>\n",
" <td>7</td>\n",
" <td>17</td>\n",
" <td>31</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Títulos Vices Terceiro lugar Pódios\n",
"Clube \n",
"Flamengo 33 31 14 78\n",
"Fluminense 31 21 23 75\n",
"Vasco da Gama 24 24 13 61\n",
"Botafogo 20 20 21 61\n",
"America 7 7 17 31"
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"fut"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Python [conda env:k35]",
"language": "python",
"name": "conda-env-k35-py"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.2"
}
},
"nbformat": 4,
"nbformat_minor": 1
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment