Skip to content

Instantly share code, notes, and snippets.

@jazzido
Last active January 29, 2022 14:49
Show Gist options
  • Star 3 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save jazzido/d067c834d0990fe3cf932cf6e7ec1881 to your computer and use it in GitHub Desktop.
Save jazzido/d067c834d0990fe3cf932cf6e7ec1881 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Reproduciendo el cuadro de población según ingreso individual usando los microdatos de EPH"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Cada vez que el [INDEC](http://www.indec.gob.ar) publica datos de la *Encuesta Permanente de Hogares* (EPH), hay repercusiones mediáticas de algunas de las variables que mide esa encuesta. El ingreso suele ser de particular interés para la cobertura periodística, por ejemplo en [esta nota de La Nación](http://www.lanacion.com.ar/1973472-la-mitad-de-los-argentinos-tiene-ingresos-inferiores-a-8000-por-mes).\n",
"\n",
"Esos artículos, en general, usan como fuente a los \"cuadros\" que publica el INDEC. Esos cuadros contienen estadísticas calculadas a partir de los registros individuales de la EPH. Es decir, las repuestas de cada individuo que fue encuestado. A esas tablas, en la jerga, se las llama *microdatos*.\n",
"\n",
"Inspirado por [el post](https://medium.com/@fernandezpablo/an%C3%A1lisis-de-deciles-de-ingresos-d6fddc885aff#.uec4863wk) de [Pablo Fernández](https://twitter.com/fernandezpablo) en el que analizó los cuadros de población según escala de ingreso individual para el tercer trimestre de 2016, voy a intentar reproducirlos a partir de los *microdatos* de la EPH. Vamos a usar el lenguaje de programación Python, y la librería de análisis y manipulación de datos [pandas](http://pandas.pydata.org/)."
]
},
{
"cell_type": "code",
"execution_count": 167,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A diferencia de otros organismos de estadística pública, como [las APIs del Census Bureau de Estados Unidos](http://www.census.gov/data/developers/data-sets.html), el INDEC no ofrece demasiada sistematización para obtener las bases de datos. El sistema de distribución de esa información es un rudimentario *download* de un archivo ZIP, que contiene las tablas en formato TXT o Excel. Vamos a trabajar con [el último *dataset* disponible a la fecha, que corresponde al segundo trimestre de 2016](http://www.indec.gov.ar/ftp/cuadros/menusuperior/eph/EPH_usu_2doTrim_2016_xls.zip).\n",
"\n",
"Una vez abierto el archivo ZIP, leemos la tabla de respuestas de individuos."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"eph = pd.read_excel('usu_individual_T216.xls')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Queremos reproducir un [*cuadro de población total según escala de ingreso individual*](http://www.indec.gov.ar/ftp/cuadros/sociedad/pob_ingreso_individual_2t16.xls). Consultamos el documento de [*Diseño de Registro y Estructura para las bases preliminares Hogar y Personas*](http://www.indec.gov.ar/ftp/cuadros/menusuperior/eph/EPH_registro_2_trim_2016.pdf) y vemos que la variable que contiene el monto de ingreso total individual se llama $P47T$. Filtramos la tabla y obtenemos los registros que declaran algún ingreso. Esto es, aquellos en los que $P47T > 0$."
]
},
{
"cell_type": "code",
"execution_count": 145,
"metadata": {
"collapsed": false,
"scrolled": false
},
"outputs": [],
"source": [
"with_income = eph[eph['P47T'] > 0]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Dado que las encuestas como la EPH contienen una muestra de la población, las variables suelen estar ponderadas (*weighted*) por un factor de ajuste que debemos usar para el cálculo de estadísticas. Para tratar el ingreso total individual, la EPH provee la variable $PONDII$, documentada en el [Anexo I del documento de diseño](http://www.indec.gov.ar/ftp/cuadros/menusuperior/eph/EPH_registro_2_trim_2016.pdf).\n",
"\n",
"El *decil* o *grupo decílico* para el ingreso total individual, ya está provisto por la EPH en la variable $DECINDR$, pero no contiene los *límites* de cada uno. Los calculamos con la ayuda de la librería [`weightedcalcs`](https://github.com/jsvine/weightedcalcs)."
]
},
{
"cell_type": "code",
"execution_count": 178,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import weightedcalcs\n",
"upper = []\n",
"lower = []\n",
"\n",
"wc = weightedcalcs.Calculator('PONDII')\n",
"lowerb = with_income['P47T'].min()\n",
"for q in range(1,11):\n",
" upperb = wc.quantile(with_income, 'P47T', q/10.0)\n",
" lower.append(lowerb)\n",
" upper.append(upperb)\n",
" lowerb = upperb\n",
"deciles = pd.DataFrame({'lbound': lower, 'ubound': upper}).reset_index().rename(columns={'index': 'decile'})\n",
"deciles['decile'] = deciles['decile'] + 1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Verificamos que las cotas de los deciles son iguales a los del cuadro que estamos replicando:"
]
},
{
"cell_type": "code",
"execution_count": 179,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>decile</th>\n",
" <th>lbound</th>\n",
" <th>ubound</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>22.0</td>\n",
" <td>2470.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>2470.0</td>\n",
" <td>4000.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>4000.0</td>\n",
" <td>4800.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4</td>\n",
" <td>4800.0</td>\n",
" <td>6000.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5</td>\n",
" <td>6000.0</td>\n",
" <td>7200.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>6</td>\n",
" <td>7200.0</td>\n",
" <td>9000.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>7</td>\n",
" <td>9000.0</td>\n",
" <td>10500.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>8</td>\n",
" <td>10500.0</td>\n",
" <td>14000.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>9</td>\n",
" <td>14000.0</td>\n",
" <td>20000.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>10</td>\n",
" <td>20000.0</td>\n",
" <td>715000.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" decile lbound ubound\n",
"0 1 22.0 2470.0\n",
"1 2 2470.0 4000.0\n",
"2 3 4000.0 4800.0\n",
"3 4 4800.0 6000.0\n",
"4 5 6000.0 7200.0\n",
"5 6 7200.0 9000.0\n",
"6 7 9000.0 10500.0\n",
"7 8 10500.0 14000.0\n",
"8 9 14000.0 20000.0\n",
"9 10 20000.0 715000.0"
]
},
"execution_count": 179,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"deciles"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Usamos el *grupo decílico* provisto en la tabla para agregarle los intervalos de cada uno que acabamos de calcular."
]
},
{
"cell_type": "code",
"execution_count": 188,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"with_income = with_income.merge(deciles, left_on='DECINDR', right_on='decile')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Sumando el ponderador ($PONDII$) para cada grupo decílico, obtenemos la población de cada uno. Creamos un nuevo `DataFrame` en el que almacenaremos las variables del cuadro que estamos replicando."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"cuadro = pd.DataFrame(with_income.groupby('DECINDR')['PONDII'].sum()).rename(columns={'PONDII': 'decile_population'})"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Agregamos una variable con el ingreso *total* de cada decil."
]
},
{
"cell_type": "code",
"execution_count": 198,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"with_income.loc[:, 'weighted_income'] = with_income['P47T'] * with_income['PONDII']\n",
"cuadro['decile_total_income'] = pd.DataFrame(with_income.groupby('DECINDR')['weighted_income'].sum() / 1000)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Agregamos otra con el porcentaje del ingreso de cada decil sobre el ingreso total"
]
},
{
"cell_type": "code",
"execution_count": 201,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"cuadro['decile_percentage_income'] = (cuadro['decile_total_income'] / cuadro['decile_total_income'].sum()) * 100"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Calculamos el ingreso medio por decil"
]
},
{
"cell_type": "code",
"execution_count": 203,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"cuadro['decile_mean_income'] = cuadro['decile_total_income'] / cuadro['decile_population']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finalmente, agregamos las cotas de cada decil"
]
},
{
"cell_type": "code",
"execution_count": 208,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"cuadro = cuadro.reset_index().merge(deciles, left_on='DECINDR', right_on='decile').rename(columns={'lbound': 'decile_lower_bound', 'ubound': 'decile_upper_bound'})\n",
"del cuadro['decile']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Como decía la revista Anteojito, acá el *modelo terminado*:"
]
},
{
"cell_type": "code",
"execution_count": 209,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>DECINDR</th>\n",
" <th>decile_population</th>\n",
" <th>decile_total_income</th>\n",
" <th>decile_percentage_income</th>\n",
" <th>decile_mean_income</th>\n",
" <th>decile_lower_bound</th>\n",
" <th>decile_upper_bound</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>1618265</td>\n",
" <td>2.121558e+06</td>\n",
" <td>1.350246</td>\n",
" <td>1.311008</td>\n",
" <td>22.0</td>\n",
" <td>2470.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>1616771</td>\n",
" <td>5.337631e+06</td>\n",
" <td>3.397086</td>\n",
" <td>3.301414</td>\n",
" <td>2470.0</td>\n",
" <td>4000.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>1617945</td>\n",
" <td>7.243509e+06</td>\n",
" <td>4.610064</td>\n",
" <td>4.476981</td>\n",
" <td>4000.0</td>\n",
" <td>4800.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4</td>\n",
" <td>1615707</td>\n",
" <td>8.257147e+06</td>\n",
" <td>5.255185</td>\n",
" <td>5.110547</td>\n",
" <td>4800.0</td>\n",
" <td>6000.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5</td>\n",
" <td>1617192</td>\n",
" <td>1.046035e+07</td>\n",
" <td>6.657395</td>\n",
" <td>6.468220</td>\n",
" <td>6000.0</td>\n",
" <td>7200.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>6</td>\n",
" <td>1617391</td>\n",
" <td>1.318178e+07</td>\n",
" <td>8.389421</td>\n",
" <td>8.150026</td>\n",
" <td>7200.0</td>\n",
" <td>9000.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>7</td>\n",
" <td>1617847</td>\n",
" <td>1.577172e+07</td>\n",
" <td>10.037767</td>\n",
" <td>9.748588</td>\n",
" <td>9000.0</td>\n",
" <td>10500.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>8</td>\n",
" <td>1616376</td>\n",
" <td>1.957398e+07</td>\n",
" <td>12.457680</td>\n",
" <td>12.109795</td>\n",
" <td>10500.0</td>\n",
" <td>14000.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>9</td>\n",
" <td>1617180</td>\n",
" <td>2.605983e+07</td>\n",
" <td>16.585536</td>\n",
" <td>16.114364</td>\n",
" <td>14000.0</td>\n",
" <td>20000.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>10</td>\n",
" <td>1617037</td>\n",
" <td>4.911631e+07</td>\n",
" <td>31.259620</td>\n",
" <td>30.374264</td>\n",
" <td>20000.0</td>\n",
" <td>715000.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" DECINDR decile_population decile_total_income decile_percentage_income \\\n",
"0 1 1618265 2.121558e+06 1.350246 \n",
"1 2 1616771 5.337631e+06 3.397086 \n",
"2 3 1617945 7.243509e+06 4.610064 \n",
"3 4 1615707 8.257147e+06 5.255185 \n",
"4 5 1617192 1.046035e+07 6.657395 \n",
"5 6 1617391 1.318178e+07 8.389421 \n",
"6 7 1617847 1.577172e+07 10.037767 \n",
"7 8 1616376 1.957398e+07 12.457680 \n",
"8 9 1617180 2.605983e+07 16.585536 \n",
"9 10 1617037 4.911631e+07 31.259620 \n",
"\n",
" decile_mean_income decile_lower_bound decile_upper_bound \n",
"0 1.311008 22.0 2470.0 \n",
"1 3.301414 2470.0 4000.0 \n",
"2 4.476981 4000.0 4800.0 \n",
"3 5.110547 4800.0 6000.0 \n",
"4 6.468220 6000.0 7200.0 \n",
"5 8.150026 7200.0 9000.0 \n",
"6 9.748588 9000.0 10500.0 \n",
"7 12.109795 10500.0 14000.0 \n",
"8 16.114364 14000.0 20000.0 \n",
"9 30.374264 20000.0 715000.0 "
]
},
"execution_count": 209,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cuadro"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Por supuesto, este ejercicio es apenas un ejemplo de lo que se puede hacer con los *microdatos* de la Encuesta Permanente de Hogares. El procesamiento de otras variables queda como actividad para el lector entusiasta."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.12"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment