Skip to content

Instantly share code, notes, and snippets.

@johansigfrids
Last active December 30, 2015 20:38
Show Gist options
  • Save johansigfrids/7881705 to your computer and use it in GitHub Desktop.
Save johansigfrids/7881705 to your computer and use it in GitHub Desktop.
My Term Project for Global Warming MOOC
Display the source blob
Display the rendered blob
Raw
{
"metadata": {
"language": "Julia",
"name": "Term Project"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": "The probability Michael Crichton picked his 15 temperature series randomly"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Johan Sigfrids"
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": "Introduction"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "In 2004 Michael Crichton released the techno-thriller State of Fear [1]. He used this fiction book as a pulpit to trash the science behind global warming. One of the key pieces of evidence he presents is a series of 19 temperature series for 15 different parts of the United States. Of these 15 locations only 5 show a warming trend. The difference between the number of series and number of locations is because he shows multiple series of different lengths for the same location.\n\nI decided to find out how likely it is that you can randomly pick temperature series for 15 weather stations in US and only get 5 that show warming. 15 because I focused on only the longest series for each location. If the probability is low it would suggest that the data Crichton used was cherry picked. "
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": "Data and data munging"
},
{
"cell_type": "code",
"collapsed": false,
"input": "using DataFrames, GLM, Datetime, Distributions",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 1
},
{
"cell_type": "markdown",
"metadata": {},
"source": "The software I am using for this is Julia 0.2 [2], a new upcoming programming language in the data science scene. Along with Julia itself I use four packages for Julia: DataFrames 0.4.1 [3], a package for handling heterogenous types of data similar to the data frames available in R, GLM 0.2.2 [4], a package for fitting linear and generalized linear models, Datetime 0.1.2 [5], a package for handling dates and times, and Distributions 0.2.11 [6]. \n\nI put in a data request at NOAA's National Climate Data Center [7] for the monthly mean temperature data for every single weather station in the US. I got back a 100MB csv file with a little over 5 million records."
},
{
"cell_type": "code",
"collapsed": false,
"input": "# Import the data into a DataFrame\nraw_data = readtable(\"237753.csv\");\n# Show the first 20 rows of the data\nraw_data[1:20,:]",
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 2,
"text": "20x3 DataFrame:\n STATION DATE MNTM\n[1,] \"COOP:509551\" 195003 339\n[2,] \"COOP:509551\" 195004 355\n[3,] \"COOP:509551\" 195005 387\n[4,] \"COOP:509551\" 195006 430\n[5,] \"COOP:509551\" 196009 462\n[6,] \"COOP:509551\" 196010 374\n[7,] \"COOP:509551\" 196101 336\n[8,] \"COOP:509551\" 196103 306\n[9,] \"COOP:509551\" 196104 371\n[10,] \"COOP:509551\" 196105 417\n[11,] \"COOP:509551\" 196106 456\n[12,] \"COOP:509551\" 196110 391\n[13,] \"COOP:509551\" 196111 353\n[14,] \"COOP:509551\" 196203 315\n[15,] \"COOP:509551\" 196204 366\n[16,] \"COOP:509551\" 196205 396\n[17,] \"COOP:509551\" 196206 457\n[18,] \"COOP:509551\" 196207 505\n[19,] \"COOP:509551\" 196208 502\n[20,] \"COOP:509551\" 196209 443\n"
}
],
"prompt_number": 2
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Each row in the dataset represents a single temperature measurement for a single weather station. The format is: first a COOP code unique to each station, then a year/month stamp in the format YYYYMM, and last the temperature in Farenheit with the decimal separator missing. So 339 means 33.9 Farenheit. The data for each weather stations goes back as far as is available at NOAA.\n\nThen I need to munge the data so that it is in such a format that I can estimate the trend for each weather station. I convert the temperature into Celcius, with the appropriate decimal separator. I also split up the data so that each weather station is in its own column. I also add a column for month so that I can control for the seasonal effect. "
},
{
"cell_type": "code",
"collapsed": false,
"input": "# Helper function to convert the temperature it Celcius.\ntemp(f) = (f/10. - 32.)/1.8\n# Helper function to calculate the row in the dataframe based on a date.\ncalcindex(dt, beginyear) = (div(dt, 100) - beginyear)* 12 + (dt % 100)\n# Helper function to convert station names into a valid format for \n# dataframe column name.\nmakename(str) = string(str[1:4],str[6:end])\n\n# Function that munges the data into a format appropriate for running \n# regressions on each station.\nfunction mungedata(raw_data)\n # Extract the first and last years for which there is data.\n oldest_date_string = minimum(raw_data[\"DATE\"])\n newest_date_string = maximum(raw_data[\"DATE\"])\n oldest_date = date(div(oldest_date_string,100),oldest_date_string%100)\n newest_date = date(div(newest_date_string,100),newest_date_string%100)\n oldest_year = year(oldest_date)\n newest_year = year(newest_date)\n # Create a new dataframe with columns for dates, months and time.\n df = DataFrame(date = [date(oldest_year,1):month(1):date(newest_year,12)])\n df[\"month\"] = PooledDataArray(rep(\n [\"Jan\",\"Feb\",\"Mar\",\"Apr\",\"May\",\"Jun\",\"Jul\",\"Aug\",\"Sep\",\"Oct\",\"Nov\",\"Dec\"],\n newest_year - oldest_year + 1\n ))\n df[\"time\"] = [1:nrow(df)]\n stations = ASCIIString[]\n # For each station add a column to the dataframe with its temperature \n # recordings.\n for station in groupby(raw_data, \"STATION\")\n # Convert the COOP code into a format that is a valid DataFrame column \n # name.\n station_name = makename(station[1,\"STATION\"])\n # Create a new column in the new DataFrame for the weather station.\n df[station_name] = DataArray(Float64, nrow(df))\n # Add the station to the list of stations.\n append!(stations, [station_name])\n for i in 1:nrow(station)\n # Calculate the index in the new DataFrame for temperature based on \n # the date.\n index = calcindex(station[i,\"DATE\"], oldest_year)\n # Convert the temperature to Celcius and inset it into the new \n # DataFrame.\n df[index, station_name] = temp(station[i,\"MNTM\"])\n end\n end\n return df, stations\nend\n\ndata, stations = mungedata(raw_data);",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 2
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Here is a small sample of the resulting DataFrame (NA represents missing data):"
},
{
"cell_type": "code",
"collapsed": false,
"input": "# Display a sample form the new DataFrame\ndata[1400:1420,2:7]",
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 3,
"text": "21x6 DataFrame:\n month time COOP010063 COOP010116 COOP010148 COOP010160\n[1,] \"Aug\" 1400 25.5 NA NA 25.7222\n[2,] \"Sep\" 1401 21.2222 NA NA 21.8889\n[3,] \"Oct\" 1402 14.6111 NA NA 15.7222\n[4,] \"Nov\" 1403 13.7222 NA NA 14.8889\n[5,] \"Dec\" 1404 8.66667 NA NA 10.2778\n[6,] \"Jan\" 1405 6.11111 NA NA 7.5\n[7,] \"Feb\" 1406 6.11111 NA NA 7.16667\n[8,] \"Mar\" 1407 11.1111 NA NA 12.6111\n[9,] \"Apr\" 1408 17.0556 NA NA 18.7222\n[10,] \"May\" 1409 19.3889 NA NA 20.8333\n[11,] \"Jun\" 1410 23.9444 NA NA 24.6667\n[12,] \"Jul\" 1411 26.2222 27.6111 NA NA\n[13,] \"Aug\" 1412 NA NA NA 26.6667\n[14,] \"Sep\" 1413 24.5 25.8889 NA 24.6111\n[15,] \"Oct\" 1414 18.1111 20.2222 NA 19.7222\n[16,] \"Nov\" 1415 8.66667 11.2222 NA 10.2778\n[17,] \"Dec\" 1416 5.44444 7.88889 NA 6.44444\n[18,] \"Jan\" 1417 1.66667 4.72222 NA 4.11111\n[19,] \"Feb\" 1418 6.44444 9.94444 NA 8.61111\n[20,] \"Mar\" 1419 NA 14.1667 NA 13.8333\n[21,] \"Apr\" 1420 16.3889 17.8333 NA 16.8333\n"
}
],
"prompt_number": 3
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": "Data analysis"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "In order to determine if the temperature series is sloping downwards I use linear regression, which appears to be the same thing Crichton used in his book to classify the temperature series he displayed as increasing or decreasing. I estimate a function of the form:\n\n$$ temp = \\beta_0 + \\beta_1 time + \\beta_2 I_{month=Jan} + \\beta_3 I_{month=Feb} +\\\\ \\beta_4 I_{month=Mar} + \\beta_5 I_{month=Apr} + \\beta_6 I_{month=May} + \\beta_7 I_{month=Jun} +\\\\ \\beta_8 I_{month=Jul} + \\beta_9 I_{month=Aug} + \\beta_{10} I_{month=Sep} + \\beta_{11} I_{month=Oct} +\\\\ \\beta_{12} I_{month=Nov} $$\n\n$ \\beta_1 $ will be the slope coefficient for time. If it is negative then the trend is downwards. It is interpreted as an estimate of how much change in temperature is to be expected in that location if everything else were held constant. $ I $ is an indicator function which takes the value $ 1 $ if the test in the subscript is true, and $ 0 $ otherwise. So $ I_{month=Jan} $ is equal to $1$ if and only if the month is January. These functions add separate constant offsets to the slope for each month to account for the seasonal patterns. December is missing in order to avoid colinearity with the constant $ \\beta_0 $. \n\nI only calculate these slopes for weather stations that have at least 360 data points, which is a minimum of 30 years. I figure I need at least that much to be able to estimate a good long term trend in the data and not get swamped with local weather variations. "
},
{
"cell_type": "code",
"collapsed": false,
"input": "# Function for running a regression on each station and returning the \n# slope coefficient.\nfunction calcslopes(data, stations)\n coefs = DataArray(Float64,length(stations))\n for i in 1:length(stations)\n # Skip stations with fewer than 360 points of data.\n if length(removeNA(data[stations[i]])) < 360\n continue\n end\n # Construct a expression for each station of the form: \n # station ~ time + month.\n expr = Expr(:call, \n :~, \n symbol(stations[i]), \n Expr(:call, \n :+, \n :time, \n :month\n )\n )\n # Run the regression and extract the coefficient for time.\n coefs[i] = coef(lm(expr,data))[2]\n end\n return coefs\nend\nslopes = calcslopes(data, stations);",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 3
},
{
"cell_type": "markdown",
"metadata": {},
"source": "The total number of slopes I calculated, after discarding those with less than 360 data points, was 5781."
},
{
"cell_type": "code",
"collapsed": false,
"input": "# The total number of slopws calculated.\ntotalslopes = length(removeNA(slopes))",
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 4,
"text": "5781"
}
],
"prompt_number": 4
},
{
"cell_type": "markdown",
"metadata": {},
"source": "The number of slopes that were negative, indicating a downward trend was 2224. "
},
{
"cell_type": "code",
"collapsed": false,
"input": "# Count the number of positive slopes.\npositiveslopes = count((x)->x>0,removeNA(slopes))\n# Count the number of negative slopes.\nnegativeslopes = count((x)->x<0,removeNA(slopes))",
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 5,
"text": "2224"
}
],
"prompt_number": 5
},
{
"cell_type": "code",
"collapsed": false,
"input": "negativeslopes/totalslopes",
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 6,
"text": "0.3847085279363432"
}
],
"prompt_number": 6
},
{
"cell_type": "markdown",
"metadata": {},
"source": "So roughly 38% of the weather stations in the US have seen temperatures trending downward. "
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": "Results"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Crichton presents temperature series for these 15 locations:\n\n1. Pasadena, CA 1930-2000, upward trending\n2. Berkely, CA 1930-2000, upward trending\n3. Death Valley, CA 1933-2000, upward trending\n4. McGill, NV 1930-2000, downward trending\n5. Guthire, OK 1930-2000, downward trending\n6. Boulder, CO 1930-1997, downward trending\n7. Truman, MO 1931-2000, downward trending\n8. Greenville, SC 1930-2000, downward trending\n9. Ann Arbor, MI 1930-2000, downward trending\n10. Charleston, SC 1930-2000, upward trending\n11. Syracuse, NY 1930-2000, downward trending\n12. Oswego, NY 1930-2000, downward trending\n13. West Point, NY, 1826-2000, upward trending\n14. New York, NY, 1822-2000, upward trending\n15. Albany, NY 1930-2000, downward trending\n\nTo calculate the probability of picking these trends by chance I need to apply probability theory. The probability of getting 10 downward-sloping trends when picking 15 trends at random follows a hypergeometric distribution."
},
{
"cell_type": "code",
"collapsed": false,
"input": "pdf(Hypergeometric(negativeslopes, positiveslopes, 15),10)",
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 7,
"text": "0.01871333418088134"
}
],
"prompt_number": 7
},
{
"cell_type": "markdown",
"metadata": {},
"source": "The probability of getting that exactly 10 downward-sloping when picking 15 is 1.9%."
},
{
"cell_type": "code",
"collapsed": false,
"input": "pdf(Hypergeometric(negativeslopes, positiveslopes, 15),10)+\npdf(Hypergeometric(negativeslopes, positiveslopes, 15),11)+\npdf(Hypergeometric(negativeslopes, positiveslopes, 15),12)+\npdf(Hypergeometric(negativeslopes, positiveslopes, 15),13)+\npdf(Hypergeometric(negativeslopes, positiveslopes, 15),14)+\npdf(Hypergeometric(negativeslopes, positiveslopes, 15),15)",
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 8,
"text": "0.02528649975507994"
}
],
"prompt_number": 8
},
{
"cell_type": "markdown",
"metadata": {},
"source": "The probability of getting 10 or more downward-sloping trends when picking 15 is only 2.5%\n\nBased on these results I would say it is probable that Crichton cherry picked his data in order to get enough downward-sloping temperature trends to show off. "
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": "References"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "[1] http://www.crichton-official.com/books-stateoffear.html\n\n[2] http://julialang.org/\n\n[3] https://github.com/JuliaStats/DataFrames.jl\n\n[4] https://github.com/JuliaStats/GLM.jl\n\n[5] https://github.com/karbarcca/Datetime.jl\n\n[6] https://github.com/JuliaStats/Distributions.jl\n\n[7] http://www.ncdc.noaa.gov/"
}
],
"metadata": {}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment