jmcastagnetto/repdata_project2.Rmd

## repdata_project2.Rmd
---
title: "The human health and financial impact of storms (1950-2011)"
author: "Jesus M. Castagnetto"
date: "2014-08-22"
output: html_document
---

**Course: Reproducible Research (Coursera), Assignment 2**

```{r}
# knitr configuration
library(knitr)
opts_knit$set(progress=FALSE)
opts_chunk$set(echo=TRUE, message=FALSE, tidy=TRUE, comment=NA,
               cache=TRUE, fig.path="figure/", fig.keep="high",
               fig.width=10, fig.height=6,
               fig.align="center")
# load required libs
library(dplyr, quietly=TRUE, warn.conflicts=FALSE)
library(ggplot2, quietly=TRUE, warn.conflicts=FALSE)
library(pander, quietly=TRUE, warn.conflicts=FALSE)
library(gridExtra, quietly=TRUE, warn.conflicts=FALSE)
```

## Synopsis

Storms have a (usually negative) effect on human populations, and in this analysis
we study the overall impact on public health (in terms of deaths or
injuries caused by the events), as well as the financial losses incurred
by damage to property and crops.

The data being used was provided in the "Reproducible Research" course (Coursera,
August 2014 session), based on the NOAA NCDC Storm Events database. The data was
cleaned up, recoding the event types to comply with the official list of event
classes mentioned in the accompanying documentation.

In terms of the effects of storms in human health, the results show that
*tornadoes are the most deleterious*, causing about 62% of
the deaths and injuries registered in the data set: 97,043 people or were
injured over the 1950-2011 time period, ~1,500 people/year. In fact, 10 of the event
types (out of 50 types considered by NOAA) are responsible for over 92% of
human victims, in descending order: Tornadoes, lightning, excessive heat,
flooding (including flash floods), thunderstorms, winter/ice storms, high winds
and wildfire.

An analysis of the financial impact of storms, indicate that *floods are the
number one threat to properties*, representing about 150.2 billions[^billions]
USD over the 1950-2011 period (a rate of loss of ~2.5 billion USD a year).

A similar analysis indicate that *drought and floods are responsible for more than
half of the losses for damaged crops*, for a total of 24.83 billions USD over
the 1993-2011 period (~ 1.3 billions USD a year).

## Data Processing

### Data Source

The data for this analysis has been given as part of August 2014 session of the
course "Reproducible Research" (Coursera), and comprised of records from the
NOAA Storm Database, and ancillary documentation.

- Storm data set (https://d396qusza40orc.cloudfront.net/repdata/data/StormData.csv.bz2)
- Documentation:
    - National Weather Service Storm Data Documentation (https://d396qusza40orc.cloudfront.net/repdata/peer2_doc/pd01016005curr.pdf)
    - National Climatic Data Center Storm Events FAQ (https://d396qusza40orc.cloudfront.net/repdata/peer2_doc/NCDC%20Storm%20Events-FAQ%20Page.pdf)

The data set and its documentation were dowloaded using the following code

```{r eval=FALSE}
datasrc <- "https://d396qusza40orc.cloudfront.net/repdata/data/StormData.csv.bz2"
download.file(url=datasrc, destfile="StormData.csv.bz2", method="curl")
datadoc <- "https://d396qusza40orc.cloudfront.net/repdata/peer2_doc/pd01016005curr.pdf"
download.file(url=datadoc, destfile="pd01016005curr.pdf", method="curl")
datafaq <- "https://d396qusza40orc.cloudfront.net/repdata/peer2_doc/NCDC%20Storm%20Events-FAQ%20Page.pdf"
download.file(url=datadoc, destfile="NCDC Storm Events-FAQ Page.pdf", method="curl")
```

### Data exploration

To get an idea of the structure of the data set, the first 10 lines of the
data file were read.

```{r}
tmp1 <- read.csv("StormData.csv.bz2", nrows=10)
str(tmp1)
```

The data set has `r ncol(tmp1)` columns, several of those are relevant to the analysis at
hand, namely those that indicate the date the event was reported (`BGN_DATE`),
in what US State the even occurred (`STATE`), the event type (`EVTYPE`), the
number of people dying (`FATALITIES`) or being injured (`INJURIES`) due to the
event, the economical cost of the damages (`PROPDMG`, `PROPDMGEXP`, `CROPDMG`,
and `CROPDMGEXP`)

```{r}
# removing temporary data frame
rm(tmp1)
```

### Reading, Cleaning and Normalizing the data

To simplify the analysis (and save time and memory), only the relevant columns
will be read from the data file:


```{r loaddata}
storm <- read.csv("StormData.csv.bz2",
                  colClasses=c("NULL", "character", rep("NULL",4),
                               rep("character",2), rep("NULL",14),
                               rep("numeric", 3), "character",
                               "numeric", "character", rep("NULL",9))
                  )
str(storm)
```

The read data set, contains `r nrow(storm)` rows and `r ncol(storm)` columns.

According to the original data source, the event type column (`EVTYPE`)
comprises a definite and limited vocabulary of 50 valid event names.

The list of valid events was extracted from the file `pd01016005curr.pdf`
(section 2.1.1 "Storm Data Event Table"), and saved to a csv file
(`valid_events.csv`)

```{r}
valid_events <- read.csv("valid_events.csv", stringsAsFactors=FALSE)
valid_events$Event.Name <- toupper(valid_events$Event.Name)
# number of valid event names
n_valid <- length(valid_events$Event.Name)
# number of unique event names in the data set
n_evtype <- length(unique(storm$EVTYPE))
goodrows <- subset(storm, EVTYPE %in% valid_events$Event.Name)$EVTYPE
fraction_good <- 100*length(goodrows)/nrow(storm)
```

The crucial `EVTYPE` column did contain more than the documented
`r length(valid_events$Event.Name)`
unique values, in fact it had `r length(unique(storm$EVTYPE))`
possible distinct values.

Not only that, but only about `r round(fraction_good, 1)`% of the records
in the data set correspond to the documented vocabulary for the event type.

Also, there seems to be a great diversity in the way some events have been
recorded over the years, including misspellings, case-mixing, combination
of a an event with some sort of numeric value, etc.

```{r}
set.seed(567)
sample(unique(subset(storm, !EVTYPE %in% valid_events$Event.Name)$EVTYPE), 20)
```

Therefore, some serious data cleanup needs to be done on this column.

```{r}
# normalize all to uppercase
storm$EVTYPE <- toupper(storm$EVTYPE)
events <- storm$EVTYPE
# replace extraneous chars by a single space
events <- gsub("( ){1,}"," ", gsub("[^A-Z0-9 ]"," ", events))
# FLOOD related events
events[grepl("COASTAL|STORM SURGE", events)] <- "COASTAL FLOOD"
events[grepl("FLASH", events)] <- "FLASH FLOOD"
events[!grepl("FLASH|COASTAL", events) & grepl("FLOOD", events)] <- "FLOOD"
events[grepl("STREAM|URBAN", events)] <- "FLOOD"
# HEAT related events
events[grepl("HEAT|DRY", events)] <- "EXCESSIVE HEAT"
events[grepl("HOT|WARM", events)] <- "EXCESSIVE HEAT"
events[grepl("RECORD (HIGH|.*TEMP)|HIGH TEMPERA", events)] <- "EXCESSIVE HEAT"
# COLD related events
events[grepl("SLEET", events)] <- "SLEET"
events[grepl("BLIZZARD", events)] <- "BLIZZARD"
events[grepl("EXTREME", events) & grepl("CHILL|COLD", events)] <- "EXTREME COLD/WIND CHILL"
events[!grepl("EXTREME", events) & grepl("CHILL|COLD", events)] <- "COLD/WIND CHILL"
events[grepl("LAKE", events) & grepl("SNOW", events)] <- "LAKE-EFFECT SNOW"
events[!grepl("LAKE", events) & grepl("SNOW", events)] <- "HEAVY SNOW"
events[grepl("FROST|FREEZE", events)] <- "FROST/FREEZE"
events[!grepl("FROST", events) & grepl("FREEZE", events)] <- "SLEET"
events[grepl("FREEZ", events) & grepl("RAIN", events)] <- "SLEET"
events[grepl("DRIZZLE", events)] <- "SLEET"
events[grepl("(RECORD LOW|LOW TEMP)", events)] <- "EXTREME COLD/WIND CHILL"
events[grepl("GLAZE", events)] <- "EXTREME COLD/WIND CHILL"
events[grepl("ICE", events)] <- "ICE STORM"
events[grepl("WINT", events)] <- "WINTER STORM"
events[grepl("HAIL", events)] <- "HAIL"
# WIND, RAIN and LIGHTING related events
events <- gsub("WINDS", "WIND", events)
events[!grepl("DERSTORM WIND", events) & grepl("THUN|TSTM", events)] <- "LIGHTNING"
events[grepl("LIGHT|LIGN", events)] <- "LIGHTNING"
events[grepl("DERSTORM WIND", events)] <- "THUNDERSTORM WIND"
events[grepl("TORN", events)] <- "TORNADO"
events[grepl("SPOUT", events)] <- "WATERSPOUT"
events[grepl("HURRICANE|TYPHOON", events)] <- "HURRICANE (TYPHOON)"
events[grepl("FIRE", events)] <- "WILDFIRE"
events[!grepl("MARINE", events) & grepl("HIGH WIND", events)] <- "HIGH WIND"
events[grepl("GUST", events)] <- "STRONG WIND"
events[!grepl("COLD|MARINE|THUNDER|STRONG|HIGH", events) & grepl("WIND", events)] <- "STRONG WIND"
events[grepl("FUNNEL", events)] <- "FUNNEL CLOUD"
events[grepl("TROPICAL STORM", events)] <- "TROPICAL STORM"
events[!grepl("FREEZIN", events) & grepl("FOG|VOG", events)] <- "DENSE FOG"
events[grepl("WET|RAIN|SHOWER|PRECIP", events)] <- "HEAVY RAIN"
# DUST related events
events[grepl("DUST DEVEL", events)] <- "DUST DEVIL"
events[!grepl("DEVIL", events) & grepl("DUST", events)] <- "DUST STORM"
# MARINE EVENTS
events[grepl("RIP CURRENT", events)] <- "RIP CURRENT"
events[!grepl("LOW", events) & grepl("TIDE|WAVE|SWELL", events)] <- "STORM SURGE/TIDE"
events[grepl("SURF", events)] <- "HIGH SURF"
# MISC events
events[grepl("VOLCAN", events)] <- "VOLCANIC ASH"
# Not a storm, but is there, so we will classify it
events[grepl("(MUD|LAND|ROCK).*SLIDE", events)] <- "LANDSLIDE"
# everything else
events[grepl("SUMMARY", events)] <- "OTHER/UNKOWN"
events[!events %in% c("LANDSLIDE", "OTHER", valid_events$Event.Name)] <- "OTHER/UNKNOWN"

# re-assign the cleaned up column values
storm$EVTYPE <- events
```

To be able to accumulate by year, a variable was created to store the value
extracted from the `BGN_DATE` column

```{r}
storm$BGN_DATE <- as.Date(storm$BGN_DATE, format="%m/%d/%Y %H:%M:%S")
storm$year <- as.POSIXlt(storm$BGN_DATE)$year + 1900
```

Finally, to be able to estimate the monetary cost due to damage caused
by the storms, we have to examine the appropriate columns.

```{r results='asis'}
# values in
tp <- table(storm$PROPDMGEXP)
tc <- table(storm$CROPDMGEXP)
pander(tp, caption="*Property damage 'exponents'*")
pander(tc, caption="*Crop damage 'exponents'*")
```

It would seem that there are a mixture of coding standards for these columns,
and the great majority of the "exponents" (multipliers really) correspond
to a coding such that:

- H or h: x100
- K or k: x1000
- M or m: x1'000,000
- B or b: x1,000'000,000

The meaning of the other codes is not clear. Even after checking the documentation
on the site that was the source for the data, several incompatible definitions could
be glimpsed:

- The numbers are exponents base 10 ($10^n$)
- The numbers are really an artifact of conversion to CSV, and are part of the
  decimals of the `PROPDMG` column. Even if we accept this interpretation, there
  is the issue as to whether the units are in thousands, millions, or billions
  [^usbillions]
- The numbers are used to indicate categories that represent ranges of values,
  according to a 1959 document ("STORM DATA", May 1959, Volume I No. 5)[^1959code]:
    - Cat. 1 --> Less than USD 50
    - Cat. 2 --> USD 50 to USD 500
    - Cat. 3 --> USD 500 to USD 5,000
    - Cat. 4 --> USD 5,000 to USD 50,000
    - Cat. 5 --> USD 50,000 to USD 500,000
    - Cat. 6 --> USD 500,000 to USD 5'000,000
    - Cat. 7 --> USD 5'000,000 to USD 50'000,000
    - Cat. 8 --> USD 50'000,000 to USD 500'000,000
    - Cat. 9 --> USD 500'000,000 to USD 5,000'000,000

[^usbillions]: As this data comes from USA's NOAA, we are using the "billion"
definition commonly employed there, namely $10^9$, equivalent to the ISO *Giga-*

[^1959code]: http://www1.ncdc.noaa.gov/pub/data/swdi/stormevents/pub-pdf/storm_1959_05.pdf

The bottomline is that is not feasible to apply only one interpretation to the
numeric codes in these columns.

To assertain if it would be possible to omit them in the analysis, we calculated
the percentage of these codes in the column, from among the records that have
a value for `PROPDMG`

```{r results='asis'}
storm$PROPDMGEXP <- toupper(storm$PROPDMGEXP)
storm$CROPDMGEXP <- toupper(storm$CROPDMGEXP)
pdmg_storm <- subset(storm, PROPDMG > 0)
cdmg_storm <- subset(storm, CROPDMG > 0)

undef_p <- 100*sum(!pdmg_storm$PROPDMGEXP %in% c("B", "H", "K", "M")) / nrow(pdmg_storm)
undef_c <- 100*sum(!cdmg_storm$CROPDMGEXP %in% c("B", "H", "K", "M")) / nrow(cdmg_storm)

badcode_storm <- data.frame(column=c("PROPDMGEXP", "CROPDMGEXP"),
                 percent=c(
                    paste0(round(undef_p,3), "%"),
                    paste0(round(undef_c,3), "%")))
colnames(badcode_storm) <- c("Column", "Percent of undefined codes")
pander(badcode_storm)
```

As can be seen, the fraction of records with uninterpretable codes is very
small (<< 1%), thus we can safely drop them from the respective data frames.

```{r}
pdmg_storm <- subset(pdmg_storm, PROPDMGEXP %in% c("B", "H", "K", "M"))
cdmg_storm <- subset(cdmg_storm, CROPDMGEXP %in% c("B", "H", "K", "M"))
```

## Results

In the cleaned up `storm` data set, there is an unequal distribution of the
reported events during the period under analysis, as can be seen from the
table below

```{r results='asis'}
# summary of all events
t_event <- storm %>% group_by(EVTYPE) %>% summarise(total=n()) %>%
    mutate(perc_total=100*total/sum(total)) %>% arrange(desc(total))
# top 10
top10_events <- t_event[1:10,]
top10_percent <- sum(top10_events$perc_total)
colnames(top10_events) <- c("Event Class", "Frequency", "Percentage of reports")
pander(top10_events, caption="*Top 10 events reported in the storm data set*", round=2)
```

The top 10 events in the data set (*vide supra*) are responsible for
`r round(top10_percent,2)`% of the reports from 1950--2011

### Human health impact

The data contains columns that can help us measure the impact of storms
in Public Health, understood in term of the number of victims that suffer
death or injury as a result of one of these events.

```{r}
# records indicating impact on human health
hi_storm <- subset(storm, FATALITIES > 0 | INJURIES > 0)
hi_storm$victims <- hi_storm$FATALITIES + hi_storm$INJURIES
```

About `r round(100*nrow(hi_storm) / nrow(storm), 2)`% of the records
in the Storm data indicate that there were human victims.

In the table below we can see the top 10 storm types (events) that
impacted more human health in the time period under consideration

```{r results='asis'}
hi_table <- hi_storm %>% group_by(EVTYPE) %>%
    summarise(c_tot=sum(victims)) %>%
    arrange(desc(c_tot)) %>%
    mutate(c_perc=100*c_tot/sum(c_tot), c_cummperc=cumsum(c_perc))
colnames(hi_table) <- c("Event type", "Deaths/Injuries", "Percent", "Cumm Percent")
pander(hi_table[1:10,],
       caption = "Top 10 causes of death or injury due to storms [1950-2011]",
       round=1)
```

We can see that the top 10 causes comprise about `r round(hi_table[10, 4], 1)`%
of all the victims affected in all those years, and that Tornadoes are by far the
most important cause of death or injuries to humans.

The impact on humans has not been constant over the years, in fact there have
been major events that went outside the norm, as can be seen in the graph below.

```{r}
t_ph_year <- hi_storm %>% group_by(year) %>%
    summarize(t_fatal=sum(FATALITIES), t_injur=sum(INJURIES))
pdeath <- ggplot(t_ph_year, aes(x=year, y=t_fatal)) +
    geom_line(stat="identity", col="black", size=1.5) +
    xlab("") + ylab("Number of deaths") +
    theme_bw() +
    theme(axis.text.x=element_blank(), axis.ticks.x=element_blank(),
          plot.margin=unit(c(1,1,-1,1), "cm"))
pinjur <- ggplot(t_ph_year, aes(x=year, y=t_injur)) +
    geom_line(stat="identity", col="red", size=1.5) +
    xlab("Year") + ylab("Number of injuries") +
    theme_bw() +
    theme(plot.margin=unit(c(0,1,0,1), "cm"))
grid.arrange(pdeath, pinjur, main = "Deaths and injuries due to storms [1950-2011]")
```

The graph shows events such as 1995's Chicago Heat Wave [^1995chicago] (maximum
value in the top chart) which, during the month of July of that year, caused
about many deaths in a period of only five days[^1995paper].

```{r results='asis'}
chicago1995 <- subset(hi_storm, STATE=="IL" &
                         (BGN_DATE >= "1995-07-12" & BGN_DATE <= "1995-07-16"))
chicago1995$BGN_DATE <- as.character(chicago1995$BGN_DATE)
pander(as.list(chicago1995[,1:5]))
```

Also of note are 1998's South Texas floods [^1998texas],[^1998noaatx], that
in October of that year caused a great number of injuries and death. This
event is responsible for the maximum value in the injuries plot.

```{r results='asis'}
texas1998 <- subset(hi_storm, STATE=="TX" & EVTYPE=="FLOOD" & year==1998 &
                        months(hi_storm$BGN_DATE, abbreviate=TRUE)=="Oct") %>%
    group_by(BGN_DATE, STATE, EVTYPE) %>%
    summarise(tot_fatal=sum(FATALITIES), tot_injur=sum(INJURIES))
texas1998$BGN_DATE <- as.character(texas1998$BGN_DATE)
colnames(texas1998) <- c("Date", "State", "Event", "Total deaths", "Total injuries")
pander(texas1998[,1:5], split.tables=120)
```

[^1995chicago]: http://en.wikipedia.org/wiki/1995_Chicago_heat_wave
[^1995paper]: http://www.annals.org/cgi/content/abstract/129/3/173
[^1998texas]: http://en.wikipedia.org/wiki/October_1998_Texas_Flooding
[^1998noaatx]: http://www.nws.noaa.gov/om/assessments/pdfs/txflood.pdf

### Financial impact

The storms have also had a negative financial impact due to damage produced to
property and crops.

About `r round(100*nrow(pdmg_storm)/nrow(storm), 2)`% of records in the data set
include an estimate for the property damage, and `r round(100*nrow(cdmg_storm)/nrow(storm), 2)`%
have data on the cost of damage to crops.

To evaluate the costs, we will add a column that traduces the character
code into a multiplier, which will allow us to calculate the appropriate
amount in each event.

The top 10 events in terms of property damage are listed in the table below,
with flooding being the number one source of property loss.

```{r results='asis'}
mults <- list("H"=10^2, "K"=10^3, "M"=10^6, "B"=10^9)
pdmg_storm$multiplier <- sapply(pdmg_storm$PROPDMGEXP,
                            function(x){
                                return(mults[[as.character(x)]])
                            })
pdmg_storm$amount <- pdmg_storm$PROPDMG * pdmg_storm$multiplier
pdmg_storm$type <- "Property damage"
summ_pdmg <- pdmg_storm %>% group_by(EVTYPE) %>%
    summarise(total=sum(amount)/10^9) %>%
    mutate(percent=100*total/sum(total)) %>%
    arrange(desc(total))
colnames(summ_pdmg) <- c("Event", "Cost (in Billions USD)", "Percent from total")
pander(summ_pdmg[1:10, ], round=2)
```

And the correspoding events for crop damage shows that drought and flooding
(two counterposed atmospheric events) are responsible for more that 50% of
losses to crops.

```{r results='asis'}
cdmg_storm$multiplier <- sapply(cdmg_storm$CROPDMGEXP,
                            function(x){
                                return(mults[[as.character(x)]])
                            })
cdmg_storm$amount <- cdmg_storm$CROPDMG * cdmg_storm$multiplier
cdmg_storm$type <- "Crop damage"
summ_cdmg <- cdmg_storm %>% group_by(EVTYPE) %>%
    summarise(total=sum(amount)/10^9) %>%
    mutate(percent=100*total/sum(total)) %>%
    arrange(desc(total))
colnames(summ_cdmg) <- c("Event", "Cost (in Billions USD)", "Percent from total")
pander(summ_cdmg[1:10, ], round=2)
```

When looking at the total losses per year over the study period, we observe
a definite growth trend due to property damage by storms. Whereas, for crops
there has been a decrease, at least since `r min(cdmg_storm$year)`,
which is the first record of such losses in the data set.

For illustration purposes (because it might not be the best model for this data),
we are superimposing a linear estimate, mainly to drive home the possible
underlying trend.

```{r}
dmg <- rbind(
    pdmg_storm %>% select(STATE, EVTYPE, year, amount, type),
    cdmg_storm %>% select(STATE, EVTYPE, year, amount, type)
    )
dmg$type <- as.factor(dmg$type)
summyr_dmg <- dmg %>% group_by(type, year) %>%
    summarise(yr_damage=sum(amount)) %>%
    arrange(type, year)
dmg_yr_plot <- ggplot(summyr_dmg, aes(x=year, y=yr_damage/10^9, colour=type)) +
    geom_point(aes(group=type)) + scale_y_log10() + geom_smooth(method="lm") +
    ylab("Damage in billions of USD") + xlab("Year") +
    ggtitle("Property and Crop Damage caused by storms [1950-2011]") +
    scale_color_discrete(guide=FALSE) +
    facet_wrap(~type, nrow=1) + theme_bw()
dmg_yr_plot
```

Some non-exhaustive reasons could be advanced for these trends:

1. There has been a steady increase in the size and density of urban populations,
   so storms can generate a bigger financial loss beacuse over the same area.
2. There has been a decrease in the population of rural areas, and an increase
   in the yield of crops due to modernization of the agricultural methods.
3. There is better forecasting and/or advanced warning of impending storms, so
   farmers can take measures to minimize loss.

## Reproducibility information

```{r}
sessionInfo()
```

The source code for this document can be found at the URL:
https://gist.github.com/jmcastagnetto/c4f9dad8f7b0fc146198

## valid_events.csv
"Event Name"
Astronomical Low Tide
Avalanche
Blizzard
Coastal Flood
Cold/Wind Chill
Debris Flow
Dense Fog
Dense Smoke
Drought
Dust Devil
Dust Storm
Excessive Heat
Extreme Cold/Wind Chill
Flash Flood
Flood
Frost/Freeze
Funnel Cloud
Freezing Fog
Hail
Heat
Heavy Rain
Heavy Snow
High Surf
High Wind
Designator
Event Name
Hurricane (Typhoon)
Ice Storm
Lake-Effect Snow
Lakeshore Flood
Lightning
Marine Hail
Marine High Wind
Marine Strong Wind
Marine Thunderstorm Wind
Rip Current
Seiche
Sleet
Storm Surge/Tide
Strong Wind
Thunderstorm Wind
Tornado
Tropical Depression
Tropical Storm
Tsunami
Volcanic Ash
Waterspout
Wildfire
Winter Storm
Winter Weather
	---
	title: "The human health and financial impact of storms (1950-2011)"
	author: "Jesus M. Castagnetto"
	date: "2014-08-22"
	output: html_document
	---

	Course: Reproducible Research (Coursera), Assignment 2

	```{r}
	# knitr configuration
	library(knitr)
	opts_knit$set(progress=FALSE)
	opts_chunk$set(echo=TRUE, message=FALSE, tidy=TRUE, comment=NA,
	cache=TRUE, fig.path="figure/", fig.keep="high",
	fig.width=10, fig.height=6,
	fig.align="center")
	# load required libs
	library(dplyr, quietly=TRUE, warn.conflicts=FALSE)
	library(ggplot2, quietly=TRUE, warn.conflicts=FALSE)
	library(pander, quietly=TRUE, warn.conflicts=FALSE)
	library(gridExtra, quietly=TRUE, warn.conflicts=FALSE)
	```

	## Synopsis

	Storms have a (usually negative) effect on human populations, and in this analysis
	we study the overall impact on public health (in terms of deaths or
	injuries caused by the events), as well as the financial losses incurred
	by damage to property and crops.

	The data being used was provided in the "Reproducible Research" course (Coursera,
	August 2014 session), based on the NOAA NCDC Storm Events database. The data was
	cleaned up, recoding the event types to comply with the official list of event
	classes mentioned in the accompanying documentation.

	In terms of the effects of storms in human health, the results show that
	tornadoes are the most deleterious, causing about 62% of
	the deaths and injuries registered in the data set: 97,043 people or were
	injured over the 1950-2011 time period, ~1,500 people/year. In fact, 10 of the event
	types (out of 50 types considered by NOAA) are responsible for over 92% of
	human victims, in descending order: Tornadoes, lightning, excessive heat,
	flooding (including flash floods), thunderstorms, winter/ice storms, high winds
	and wildfire.

	An analysis of the financial impact of storms, indicate that *floods are the
	number one threat to properties*, representing about 150.2 billions[^billions]
	USD over the 1950-2011 period (a rate of loss of ~2.5 billion USD a year).

	A similar analysis indicate that *drought and floods are responsible for more than
	half of the losses for damaged crops*, for a total of 24.83 billions USD over
	the 1993-2011 period (~ 1.3 billions USD a year).

	## Data Processing

	### Data Source

	The data for this analysis has been given as part of August 2014 session of the
	course "Reproducible Research" (Coursera), and comprised of records from the
	NOAA Storm Database, and ancillary documentation.

	- Storm data set (https://d396qusza40orc.cloudfront.net/repdata/data/StormData.csv.bz2)
	- Documentation:
	- National Weather Service Storm Data Documentation (https://d396qusza40orc.cloudfront.net/repdata/peer2_doc/pd01016005curr.pdf)
	- National Climatic Data Center Storm Events FAQ (https://d396qusza40orc.cloudfront.net/repdata/peer2_doc/NCDC%20Storm%20Events-FAQ%20Page.pdf)

	The data set and its documentation were dowloaded using the following code

	```{r eval=FALSE}
	datasrc <- "https://d396qusza40orc.cloudfront.net/repdata/data/StormData.csv.bz2"
	download.file(url=datasrc, destfile="StormData.csv.bz2", method="curl")
	datadoc <- "https://d396qusza40orc.cloudfront.net/repdata/peer2_doc/pd01016005curr.pdf"
	download.file(url=datadoc, destfile="pd01016005curr.pdf", method="curl")
	datafaq <- "https://d396qusza40orc.cloudfront.net/repdata/peer2_doc/NCDC%20Storm%20Events-FAQ%20Page.pdf"
	download.file(url=datadoc, destfile="NCDC Storm Events-FAQ Page.pdf", method="curl")
	```

	### Data exploration

	To get an idea of the structure of the data set, the first 10 lines of the
	data file were read.

	```{r}
	tmp1 <- read.csv("StormData.csv.bz2", nrows=10)
	str(tmp1)
	```

	The data set has `r ncol(tmp1)` columns, several of those are relevant to the analysis at
	hand, namely those that indicate the date the event was reported (`BGN_DATE`),
	in what US State the even occurred (`STATE`), the event type (`EVTYPE`), the
	number of people dying (`FATALITIES`) or being injured (`INJURIES`) due to the
	event, the economical cost of the damages (`PROPDMG`, `PROPDMGEXP`, `CROPDMG`,
	and `CROPDMGEXP`)

	```{r}
	# removing temporary data frame
	rm(tmp1)
	```

	### Reading, Cleaning and Normalizing the data

	To simplify the analysis (and save time and memory), only the relevant columns
	will be read from the data file:


	```{r loaddata}
	storm <- read.csv("StormData.csv.bz2",
	colClasses=c("NULL", "character", rep("NULL",4),
	rep("character",2), rep("NULL",14),
	rep("numeric", 3), "character",
	"numeric", "character", rep("NULL",9))
	)
	str(storm)
	```

	The read data set, contains `r nrow(storm)` rows and `r ncol(storm)` columns.

	According to the original data source, the event type column (`EVTYPE`)
	comprises a definite and limited vocabulary of 50 valid event names.

	The list of valid events was extracted from the file `pd01016005curr.pdf`
	(section 2.1.1 "Storm Data Event Table"), and saved to a csv file
	(`valid_events.csv`)

	```{r}
	valid_events <- read.csv("valid_events.csv", stringsAsFactors=FALSE)
	valid_events$Event.Name <- toupper(valid_events$Event.Name)
	# number of valid event names
	n_valid <- length(valid_events$Event.Name)
	# number of unique event names in the data set
	n_evtype <- length(unique(storm$EVTYPE))
	goodrows <- subset(storm, EVTYPE %in% valid_events$Event.Name)$EVTYPE
	fraction_good <- 100*length(goodrows)/nrow(storm)
	```

	The crucial `EVTYPE` column did contain more than the documented
	`r length(valid_events$Event.Name)`
	unique values, in fact it had `r length(unique(storm$EVTYPE))`
	possible distinct values.

	Not only that, but only about `r round(fraction_good, 1)`% of the records
	in the data set correspond to the documented vocabulary for the event type.

	Also, there seems to be a great diversity in the way some events have been
	recorded over the years, including misspellings, case-mixing, combination
	of a an event with some sort of numeric value, etc.

	```{r}
	set.seed(567)
	sample(unique(subset(storm, !EVTYPE %in% valid_events$Event.Name)$EVTYPE), 20)
	```

	Therefore, some serious data cleanup needs to be done on this column.

	```{r}
	# normalize all to uppercase
	storm$EVTYPE <- toupper(storm$EVTYPE)
	events <- storm$EVTYPE
	# replace extraneous chars by a single space
	events <- gsub("( ){1,}"," ", gsub("[^A-Z0-9 ]"," ", events))
	# FLOOD related events
	events[grepl("COASTAL\|STORM SURGE", events)] <- "COASTAL FLOOD"
	events[grepl("FLASH", events)] <- "FLASH FLOOD"
	events[!grepl("FLASH\|COASTAL", events) & grepl("FLOOD", events)] <- "FLOOD"
	events[grepl("STREAM\|URBAN", events)] <- "FLOOD"
	# HEAT related events
	events[grepl("HEAT\|DRY", events)] <- "EXCESSIVE HEAT"
	events[grepl("HOT\|WARM", events)] <- "EXCESSIVE HEAT"
	events[grepl("RECORD (HIGH\|.*TEMP)\|HIGH TEMPERA", events)] <- "EXCESSIVE HEAT"
	# COLD related events
	events[grepl("SLEET", events)] <- "SLEET"
	events[grepl("BLIZZARD", events)] <- "BLIZZARD"
	events[grepl("EXTREME", events) & grepl("CHILL\|COLD", events)] <- "EXTREME COLD/WIND CHILL"
	events[!grepl("EXTREME", events) & grepl("CHILL\|COLD", events)] <- "COLD/WIND CHILL"
	events[grepl("LAKE", events) & grepl("SNOW", events)] <- "LAKE-EFFECT SNOW"
	events[!grepl("LAKE", events) & grepl("SNOW", events)] <- "HEAVY SNOW"
	events[grepl("FROST\|FREEZE", events)] <- "FROST/FREEZE"
	events[!grepl("FROST", events) & grepl("FREEZE", events)] <- "SLEET"
	events[grepl("FREEZ", events) & grepl("RAIN", events)] <- "SLEET"
	events[grepl("DRIZZLE", events)] <- "SLEET"
	events[grepl("(RECORD LOW\|LOW TEMP)", events)] <- "EXTREME COLD/WIND CHILL"
	events[grepl("GLAZE", events)] <- "EXTREME COLD/WIND CHILL"
	events[grepl("ICE", events)] <- "ICE STORM"
	events[grepl("WINT", events)] <- "WINTER STORM"
	events[grepl("HAIL", events)] <- "HAIL"
	# WIND, RAIN and LIGHTING related events
	events <- gsub("WINDS", "WIND", events)
	events[!grepl("DERSTORM WIND", events) & grepl("THUN\|TSTM", events)] <- "LIGHTNING"
	events[grepl("LIGHT\|LIGN", events)] <- "LIGHTNING"
	events[grepl("DERSTORM WIND", events)] <- "THUNDERSTORM WIND"
	events[grepl("TORN", events)] <- "TORNADO"
	events[grepl("SPOUT", events)] <- "WATERSPOUT"
	events[grepl("HURRICANE\|TYPHOON", events)] <- "HURRICANE (TYPHOON)"
	events[grepl("FIRE", events)] <- "WILDFIRE"
	events[!grepl("MARINE", events) & grepl("HIGH WIND", events)] <- "HIGH WIND"
	events[grepl("GUST", events)] <- "STRONG WIND"
	events[!grepl("COLD\|MARINE\|THUNDER\|STRONG\|HIGH", events) & grepl("WIND", events)] <- "STRONG WIND"
	events[grepl("FUNNEL", events)] <- "FUNNEL CLOUD"
	events[grepl("TROPICAL STORM", events)] <- "TROPICAL STORM"
	events[!grepl("FREEZIN", events) & grepl("FOG\|VOG", events)] <- "DENSE FOG"
	events[grepl("WET\|RAIN\|SHOWER\|PRECIP", events)] <- "HEAVY RAIN"
	# DUST related events
	events[grepl("DUST DEVEL", events)] <- "DUST DEVIL"
	events[!grepl("DEVIL", events) & grepl("DUST", events)] <- "DUST STORM"
	# MARINE EVENTS
	events[grepl("RIP CURRENT", events)] <- "RIP CURRENT"
	events[!grepl("LOW", events) & grepl("TIDE\|WAVE\|SWELL", events)] <- "STORM SURGE/TIDE"
	events[grepl("SURF", events)] <- "HIGH SURF"
	# MISC events
	events[grepl("VOLCAN", events)] <- "VOLCANIC ASH"
	# Not a storm, but is there, so we will classify it
	events[grepl("(MUD\|LAND\|ROCK).*SLIDE", events)] <- "LANDSLIDE"
	# everything else
	events[grepl("SUMMARY", events)] <- "OTHER/UNKOWN"
	events[!events %in% c("LANDSLIDE", "OTHER", valid_events$Event.Name)] <- "OTHER/UNKNOWN"

	# re-assign the cleaned up column values
	storm$EVTYPE <- events
	```

	To be able to accumulate by year, a variable was created to store the value
	extracted from the `BGN_DATE` column

	```{r}
	storm$BGN_DATE <- as.Date(storm$BGN_DATE, format="%m/%d/%Y %H:%M:%S")
	storm$year <- as.POSIXlt(storm$BGN_DATE)$year + 1900
	```

	Finally, to be able to estimate the monetary cost due to damage caused
	by the storms, we have to examine the appropriate columns.

	```{r results='asis'}
	# values in
	tp <- table(storm$PROPDMGEXP)
	tc <- table(storm$CROPDMGEXP)
	pander(tp, caption="Property damage 'exponents'")
	pander(tc, caption="Crop damage 'exponents'")
	```

	It would seem that there are a mixture of coding standards for these columns,
	and the great majority of the "exponents" (multipliers really) correspond
	to a coding such that:

	- H or h: x100
	- K or k: x1000
	- M or m: x1'000,000
	- B or b: x1,000'000,000

	The meaning of the other codes is not clear. Even after checking the documentation
	on the site that was the source for the data, several incompatible definitions could
	be glimpsed:

	- The numbers are exponents base 10 ($10^n$)
	- The numbers are really an artifact of conversion to CSV, and are part of the
	decimals of the `PROPDMG` column. Even if we accept this interpretation, there
	is the issue as to whether the units are in thousands, millions, or billions
	[^usbillions]
	- The numbers are used to indicate categories that represent ranges of values,
	according to a 1959 document ("STORM DATA", May 1959, Volume I No. 5)[^1959code]:
	- Cat. 1 --> Less than USD 50
	- Cat. 2 --> USD 50 to USD 500
	- Cat. 3 --> USD 500 to USD 5,000
	- Cat. 4 --> USD 5,000 to USD 50,000
	- Cat. 5 --> USD 50,000 to USD 500,000
	- Cat. 6 --> USD 500,000 to USD 5'000,000
	- Cat. 7 --> USD 5'000,000 to USD 50'000,000
	- Cat. 8 --> USD 50'000,000 to USD 500'000,000
	- Cat. 9 --> USD 500'000,000 to USD 5,000'000,000

	[^usbillions]: As this data comes from USA's NOAA, we are using the "billion"
	definition commonly employed there, namely $10^9$, equivalent to the ISO Giga-

	[^1959code]: http://www1.ncdc.noaa.gov/pub/data/swdi/stormevents/pub-pdf/storm_1959_05.pdf

	The bottomline is that is not feasible to apply only one interpretation to the
	numeric codes in these columns.

	To assertain if it would be possible to omit them in the analysis, we calculated
	the percentage of these codes in the column, from among the records that have
	a value for `PROPDMG`

	```{r results='asis'}
	storm$PROPDMGEXP <- toupper(storm$PROPDMGEXP)
	storm$CROPDMGEXP <- toupper(storm$CROPDMGEXP)
	pdmg_storm <- subset(storm, PROPDMG > 0)
	cdmg_storm <- subset(storm, CROPDMG > 0)

	undef_p <- 100*sum(!pdmg_storm$PROPDMGEXP %in% c("B", "H", "K", "M")) / nrow(pdmg_storm)
	undef_c <- 100*sum(!cdmg_storm$CROPDMGEXP %in% c("B", "H", "K", "M")) / nrow(cdmg_storm)

	badcode_storm <- data.frame(column=c("PROPDMGEXP", "CROPDMGEXP"),
	percent=c(
	paste0(round(undef_p,3), "%"),
	paste0(round(undef_c,3), "%")))
	colnames(badcode_storm) <- c("Column", "Percent of undefined codes")
	pander(badcode_storm)
	```

	As can be seen, the fraction of records with uninterpretable codes is very
	small (<< 1%), thus we can safely drop them from the respective data frames.

	```{r}
	pdmg_storm <- subset(pdmg_storm, PROPDMGEXP %in% c("B", "H", "K", "M"))
	cdmg_storm <- subset(cdmg_storm, CROPDMGEXP %in% c("B", "H", "K", "M"))
	```

	## Results

	In the cleaned up `storm` data set, there is an unequal distribution of the
	reported events during the period under analysis, as can be seen from the
	table below

	```{r results='asis'}
	# summary of all events
	t_event <- storm %>% group_by(EVTYPE) %>% summarise(total=n()) %>%
	mutate(perc_total=100*total/sum(total)) %>% arrange(desc(total))
	# top 10
	top10_events <- t_event[1:10,]
	top10_percent <- sum(top10_events$perc_total)
	colnames(top10_events) <- c("Event Class", "Frequency", "Percentage of reports")
	pander(top10_events, caption="Top 10 events reported in the storm data set", round=2)
	```

	The top 10 events in the data set (vide supra) are responsible for
	`r round(top10_percent,2)`% of the reports from 1950--2011

	### Human health impact

	The data contains columns that can help us measure the impact of storms
	in Public Health, understood in term of the number of victims that suffer
	death or injury as a result of one of these events.

	```{r}
	# records indicating impact on human health
	hi_storm <- subset(storm, FATALITIES > 0 \| INJURIES > 0)
	hi_storm$victims <- hi_storm$FATALITIES + hi_storm$INJURIES
	```

	About `r round(100*nrow(hi_storm) / nrow(storm), 2)`% of the records
	in the Storm data indicate that there were human victims.

	In the table below we can see the top 10 storm types (events) that
	impacted more human health in the time period under consideration

	```{r results='asis'}
	hi_table <- hi_storm %>% group_by(EVTYPE) %>%
	summarise(c_tot=sum(victims)) %>%
	arrange(desc(c_tot)) %>%
	mutate(c_perc=100*c_tot/sum(c_tot), c_cummperc=cumsum(c_perc))
	colnames(hi_table) <- c("Event type", "Deaths/Injuries", "Percent", "Cumm Percent")
	pander(hi_table[1:10,],
	caption = "Top 10 causes of death or injury due to storms [1950-2011]",
	round=1)
	```

	We can see that the top 10 causes comprise about `r round(hi_table[10, 4], 1)`%
	of all the victims affected in all those years, and that Tornadoes are by far the
	most important cause of death or injuries to humans.

	The impact on humans has not been constant over the years, in fact there have
	been major events that went outside the norm, as can be seen in the graph below.

	```{r}
	t_ph_year <- hi_storm %>% group_by(year) %>%
	summarize(t_fatal=sum(FATALITIES), t_injur=sum(INJURIES))
	pdeath <- ggplot(t_ph_year, aes(x=year, y=t_fatal)) +
	geom_line(stat="identity", col="black", size=1.5) +
	xlab("") + ylab("Number of deaths") +
	theme_bw() +
	theme(axis.text.x=element_blank(), axis.ticks.x=element_blank(),
	plot.margin=unit(c(1,1,-1,1), "cm"))
	pinjur <- ggplot(t_ph_year, aes(x=year, y=t_injur)) +
	geom_line(stat="identity", col="red", size=1.5) +
	xlab("Year") + ylab("Number of injuries") +
	theme_bw() +
	theme(plot.margin=unit(c(0,1,0,1), "cm"))
	grid.arrange(pdeath, pinjur, main = "Deaths and injuries due to storms [1950-2011]")
	```

	The graph shows events such as 1995's Chicago Heat Wave [^1995chicago] (maximum
	value in the top chart) which, during the month of July of that year, caused
	about many deaths in a period of only five days[^1995paper].

	```{r results='asis'}
	chicago1995 <- subset(hi_storm, STATE=="IL" &
	(BGN_DATE >= "1995-07-12" & BGN_DATE <= "1995-07-16"))
	chicago1995$BGN_DATE <- as.character(chicago1995$BGN_DATE)
	pander(as.list(chicago1995[,1:5]))
	```

	Also of note are 1998's South Texas floods [^1998texas],[^1998noaatx], that
	in October of that year caused a great number of injuries and death. This
	event is responsible for the maximum value in the injuries plot.

	```{r results='asis'}
	texas1998 <- subset(hi_storm, STATE=="TX" & EVTYPE=="FLOOD" & year==1998 &
	months(hi_storm$BGN_DATE, abbreviate=TRUE)=="Oct") %>%
	group_by(BGN_DATE, STATE, EVTYPE) %>%
	summarise(tot_fatal=sum(FATALITIES), tot_injur=sum(INJURIES))
	texas1998$BGN_DATE <- as.character(texas1998$BGN_DATE)
	colnames(texas1998) <- c("Date", "State", "Event", "Total deaths", "Total injuries")
	pander(texas1998[,1:5], split.tables=120)
	```

	[^1995chicago]: http://en.wikipedia.org/wiki/1995_Chicago_heat_wave
	[^1995paper]: http://www.annals.org/cgi/content/abstract/129/3/173
	[^1998texas]: http://en.wikipedia.org/wiki/October_1998_Texas_Flooding
	[^1998noaatx]: http://www.nws.noaa.gov/om/assessments/pdfs/txflood.pdf

	### Financial impact

	The storms have also had a negative financial impact due to damage produced to
	property and crops.

	About `r round(100*nrow(pdmg_storm)/nrow(storm), 2)`% of records in the data set
	include an estimate for the property damage, and `r round(100*nrow(cdmg_storm)/nrow(storm), 2)`%
	have data on the cost of damage to crops.

	To evaluate the costs, we will add a column that traduces the character
	code into a multiplier, which will allow us to calculate the appropriate
	amount in each event.

	The top 10 events in terms of property damage are listed in the table below,
	with flooding being the number one source of property loss.

	```{r results='asis'}
	mults <- list("H"=10^2, "K"=10^3, "M"=10^6, "B"=10^9)
	pdmg_storm$multiplier <- sapply(pdmg_storm$PROPDMGEXP,
	function(x){
	return(mults[[as.character(x)]])
	})
	pdmg_storm$amount <- pdmg_storm$PROPDMG * pdmg_storm$multiplier
	pdmg_storm$type <- "Property damage"
	summ_pdmg <- pdmg_storm %>% group_by(EVTYPE) %>%
	summarise(total=sum(amount)/10^9) %>%
	mutate(percent=100*total/sum(total)) %>%
	arrange(desc(total))
	colnames(summ_pdmg) <- c("Event", "Cost (in Billions USD)", "Percent from total")
	pander(summ_pdmg[1:10, ], round=2)
	```

	And the correspoding events for crop damage shows that drought and flooding
	(two counterposed atmospheric events) are responsible for more that 50% of
	losses to crops.

	```{r results='asis'}
	cdmg_storm$multiplier <- sapply(cdmg_storm$CROPDMGEXP,
	function(x){
	return(mults[[as.character(x)]])
	})
	cdmg_storm$amount <- cdmg_storm$CROPDMG * cdmg_storm$multiplier
	cdmg_storm$type <- "Crop damage"
	summ_cdmg <- cdmg_storm %>% group_by(EVTYPE) %>%
	summarise(total=sum(amount)/10^9) %>%
	mutate(percent=100*total/sum(total)) %>%
	arrange(desc(total))
	colnames(summ_cdmg) <- c("Event", "Cost (in Billions USD)", "Percent from total")
	pander(summ_cdmg[1:10, ], round=2)
	```

	When looking at the total losses per year over the study period, we observe
	a definite growth trend due to property damage by storms. Whereas, for crops
	there has been a decrease, at least since `r min(cdmg_storm$year)`,
	which is the first record of such losses in the data set.

	For illustration purposes (because it might not be the best model for this data),
	we are superimposing a linear estimate, mainly to drive home the possible
	underlying trend.

	```{r}
	dmg <- rbind(
	pdmg_storm %>% select(STATE, EVTYPE, year, amount, type),
	cdmg_storm %>% select(STATE, EVTYPE, year, amount, type)
	)
	dmg$type <- as.factor(dmg$type)
	summyr_dmg <- dmg %>% group_by(type, year) %>%
	summarise(yr_damage=sum(amount)) %>%
	arrange(type, year)
	dmg_yr_plot <- ggplot(summyr_dmg, aes(x=year, y=yr_damage/10^9, colour=type)) +
	geom_point(aes(group=type)) + scale_y_log10() + geom_smooth(method="lm") +
	ylab("Damage in billions of USD") + xlab("Year") +
	ggtitle("Property and Crop Damage caused by storms [1950-2011]") +
	scale_color_discrete(guide=FALSE) +
	facet_wrap(~type, nrow=1) + theme_bw()
	dmg_yr_plot
	```

	Some non-exhaustive reasons could be advanced for these trends:

	1. There has been a steady increase in the size and density of urban populations,
	so storms can generate a bigger financial loss beacuse over the same area.
	2. There has been a decrease in the population of rural areas, and an increase
	in the yield of crops due to modernization of the agricultural methods.
	3. There is better forecasting and/or advanced warning of impending storms, so
	farmers can take measures to minimize loss.

	## Reproducibility information

	```{r}
	sessionInfo()
	```

	The source code for this document can be found at the URL:
	https://gist.github.com/jmcastagnetto/c4f9dad8f7b0fc146198
	"Event Name"
	Astronomical Low Tide
	Avalanche
	Blizzard
	Coastal Flood
	Cold/Wind Chill
	Debris Flow
	Dense Fog
	Dense Smoke
	Drought
	Dust Devil
	Dust Storm
	Excessive Heat
	Extreme Cold/Wind Chill
	Flash Flood
	Flood
	Frost/Freeze
	Funnel Cloud
	Freezing Fog
	Hail
	Heat
	Heavy Rain
	Heavy Snow
	High Surf
	High Wind
	Designator
	Event Name
	Hurricane (Typhoon)
	Ice Storm
	Lake-Effect Snow
	Lakeshore Flood
	Lightning
	Marine Hail
	Marine High Wind
	Marine Strong Wind
	Marine Thunderstorm Wind
	Rip Current
	Seiche
	Sleet
	Storm Surge/Tide
	Strong Wind
	Thunderstorm Wind
	Tornado
	Tropical Depression
	Tropical Storm
	Tsunami
	Volcanic Ash
	Waterspout
	Wildfire
	Winter Storm
	Winter Weather