Skip to content

Instantly share code, notes, and snippets.

@JamesSaxon
Created January 18, 2022 22:55
Show Gist options
  • Save JamesSaxon/1a0ba15276a1a6d864f9b868a61baa28 to your computer and use it in GitHub Desktop.
Save JamesSaxon/1a0ba15276a1a6d864f9b868a61baa28 to your computer and use it in GitHub Desktop.
CPS NTIA Issues
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Issue with the NTIA supplement to the CPS, for San Francisco\n",
"\n",
"## Sources\n",
"* Data: https://www.ntia.doc.gov/page/download-digital-nation-datasets\n",
"* Map (useful for checks!): https://www.ntia.doc.gov/data/digital-nation-data-explorer#sel=internetAtHome&disp=map\n",
"* Docs: https://www.ntia.doc.gov/files/ntia/publications/november-2019-techdocs.pdf\n",
"* Universes & code examples: https://www.ntia.doc.gov/files/ntia/data_central_downloads/code/create-ntia-tables-stata.zip\n",
"\n",
"\n",
"## The issue with San Francisco.\n",
"\n",
"Let's check in on households in CA and SF (the problem shown is similar for person-based estimates). We will compare the \"internet use by anyone in the household\" (internetAtHome) value of the NTIA estimate for CA, which is 79.9% as can be seen here:\n",
"* https://www.ntia.doc.gov/data/digital-nation-data-explorer#sel=internetAtHome&disp=map"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(0.799, 0.7992982448757221, 0.6540205571069156)"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cps_test = pd.read_csv(\"data/nov19-cps.csv\")\n",
"\n",
"# Householders\n",
"cps_test[\"isHouseholder\"] = (cps_test.perrp > 0) & (cps_test.perrp < 3) & \\\n",
" (cps_test.hrhtype > 0) & (cps_test.hrhtype < 9)\n",
"\n",
"# Households in California, and those in SF county/city.\n",
"ca_households = cps_test.query(\"(gestfips == 6) & isHouseholder\")\n",
"sf_households = ca_households.query(\"gtco == 75\")\n",
"\n",
"# Get the \"official\" estimate and compare to our calculted quantity for CA, \n",
"# and then SF (no official available)\n",
"ca_ntia_official = 0.799\n",
"ca_internet = ca_households.query(\"heinhome == 1\").hwhhwgt.sum() / ca_households.hwhhwgt.sum()\n",
"sf_internet = sf_households.query(\"heinhome == 1\").hwhhwgt.sum() / sf_households.hwhhwgt.sum()\n",
"\n",
"ca_ntia_official, ca_internet, sf_internet"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"San Francisco is 15% worse than California as a whole!? Are the errors just enormous?? No: just 5%."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.049732365088150265"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"hh = sf_households.filter(regex = \"hhwgt[1-9]\\d*\", axis = 1).sum()\n",
"hh_with_internet = sf_households.query(\"heinhome == 1\").filter(regex = \"hhwgt[1-9]\\d*\", axis = 1).sum()\n",
"\n",
"sf_internet_reps = hh_with_internet / hh\n",
"\n",
"np.sqrt((4 / 160) * ((sf_internet_reps - sf_internet)**2).sum())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So I exactly reproduce the \"correct\" answer, but SF county shows up as quite significantly lower internet use at home than the rest of CA.\n",
"\n",
"This simply does not feel correct to me, and indeed, we get a different impression from the ACS.\n",
"\n",
"Dallas is the other outlier in the ACS / CPS comparison.\n",
"\n",
"Is my definition of SF county just erroneous? (Note that the \"usual\" definition in the CPS NTIA Universe is (`prtage >= 3`) and non-Armed Forces household (`prpertyp != 3`). But I will not apply that.)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'pwsswgt': {'sum': 804606.1993000001, 'count': 200.0}}"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cps_test.query(\"(gestfips == 6) & (gtco == 75) & (pwsswgt > 0)\").agg({\"pwsswgt\" : [\"sum\", \"count\"]}).to_dict()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This isn't perfect: the current number is around 880k, but it's not far off. The outliers by this metric are Dallas and Fort Worth (are they swapped?) and San Jose (also low -- so, not swapped within CBSA with SF)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Are there negative weights? (This isn't quantum mechanics, but...)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.0"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(cps_test.pwsswgt < 0).mean()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"My replicate weights, following the procedures below, are *half* of Rafi's... but he assures me that it's 1.96, not 2: he quotes 95% CI's and I quote SEs.\n",
"\n",
"* https://cps.ipums.org/cps/repwt.shtml\n",
"* https://www2.census.gov/programs-surveys/cps/datasets/2020/march/2020_ASEC_Replicate_Weight_Usage_Instructions.docx"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.006545476572600432"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ca_internet = ca_households.query(\"heinhome == 1\").hwhhwgt.sum() / ca_households.hwhhwgt.sum()\n",
"hh = ca_households.filter(regex = \"hhwgt[1-9]\\d*\", axis = 1).sum()\n",
"hh_with_internet = ca_households.query(\"heinhome == 1\").filter(regex = \"hhwgt[1-9]\\d*\", axis = 1).sum()\n",
"\n",
"ca_internet_reps = hh_with_internet / hh\n",
"\n",
"np.sqrt((4 / 160) * ((ca_internet_reps - ca_internet)**2).sum())"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.023113497617789587"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ar_households = cps_test.query(\"(gestfips == 5) & isHouseholder\")\n",
"\n",
"ar_internet = ar_households.query(\"heinhome == 1\").hwhhwgt.sum() / ar_households.hwhhwgt.sum()\n",
"hh = ar_households.filter(regex = \"hhwgt[1-9]\\d*\", axis = 1).sum()\n",
"hh_with_internet = ar_households.query(\"heinhome == 1\").filter(regex = \"hhwgt[1-9]\\d*\", axis = 1).sum()\n",
"\n",
"ar_internet_reps = hh_with_internet / hh\n",
"\n",
"np.sqrt((4 / 160) * ((ar_internet_reps - ar_internet)**2).sum())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Compare CA and AR error as as 1.3% and 4.5%."
]
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment