psychemedia/bowelCancer.Rnw

## bowelCancer.Rnw
\documentclass[a4paper]{article}
\SweaveOpts{echo=FALSE, keep.source=TRUE}
\usepackage{a4wide}
\usepackage{color}
\usepackage{hyperref}

\begin{document}
\section{Example of  self-documenting data journalism notes}

This is an example of using Sweave to combine code and output from the R statistical programming environment and the LaTeX document processing environment to generate a self-documenting script in which the actual code used to do stats and generate statistical graphics is displayed along the charts it directly produces.

\subsection{Getting Started...}
The aim is to try to replicate a graphic included by Ben Goldacre in his article \emph{DIY statistical analysis: experience the thrill of touching real data}\footnote{\url{http://www.guardian.co.uk/commentisfree/2011/oct/28/bad-science-diy-data-analysis}}.


<< echo = T >>=
# The << echo = T >>= identifies an R code region;
# echo=T means run the code, and print what happens when it's run
# In the code area, lines beginning with a # are comment lines and are not executed

#First, we need to load in the XML library that contains the scraper function
library(XML)

#Now we scrape the table
srcURL='http://www.guardian.co.uk/commentisfree/2011/oct/28/bad-science-diy-data-analysis'
cancerdata=data.frame(
  readHTMLTable( srcURL, which=1, header=c('Area','Rate','Population','Number') ) )

#The @ symbol on its own at the start of a line marks the end of a code block
@

The format is simple: {\tt readHTMLTable(url,which=TABLENUMBER)} ({\tt TABLENUMBER} is used to extract the N'th table in the page.) The header part labels the columns (the data pulled in from the HTML table itself contains all sorts of clutter).

We can inspect the data we've imported as follows:

<< echo = T >>=
#Look at the whole table (the whole table is quite long,
# so don't disply it/comment out the command for now instead.
#cancerdata
#If you are using RStudio, you can inspect the data using the command: View(cancerdata))
#Look at the column headers
names(cancerdata)
#If there's a problem with the header command not doing it's stuff,
# uncomment the next two lines to fix it and check the fix...
#names(cancerdata)<-c('Area','Rate','Population','Number')
#names(cancerdata)
#Look at the first 10 rows
head(cancerdata)
#Look at the last 10 rows
tail(cancerdata)
#What sort of datatype is in the Number column?
class(cancerdata$Number)
@


The last line, {\tt class(cancerdata\$Number)}, identifies the data as type \emph{factor}. In order to do stats and plot graphs, we need the Number, Rate and Population columns to contain actual numbers. (Factors organise data according to categories; when the table is loaded in, the data is loaded in as strings of characters; rather than seeing each number as a number, it's identified as a category.) The


<< echo=T >>=
#Convert the numerical columns to a numeric datatype
cancerdata$Rate =
  as.numeric(levels(cancerdata$Rate)[as.numeric(cancerdata$Rate)])
cancerdata$Population =
  as.numeric(levels(cancerdata$Population)[as.integer(cancerdata$Population)])
cancerdata$Number =
  as.numeric(levels(cancerdata$Number)[as.integer(cancerdata$Number)])
#Just check it worked…
class(cancerdata$Number)
class(cancerdata$Rate)
class(cancerdata$Population)
head(cancerdata)
@

We can now plot the data as a simple scatterplot using the {\tt plot} command (figure \ref{fig:simpleplot}) or we can add a title to the graph and tweak the axis labels (figure \ref{fig:simpleplot2}).

\begin{figure}
\begin{center}
<<label=fig_simpleplot,fig=TRUE,echo=T>>=
#Plot the Number of deaths by the Population
plot(Number ~ Population, data=cancerdata)
@
\end{center}
\caption{Vanilla scatter plot}
\label{fig:simpleplot}
\end{figure}

\begin{figure}
\begin{center}
<<label=fig_simpleplot2,fig=TRUE,echo=T>>=
#Plot the Number of deaths by the Population.
#Add in a title (main) and tweak the y-axis label (ylab).
plot(Number ~ Population, data=cancerdata,
     main='Bowel Cancer Occurrence by Population', ylab='Number of deaths')
@
\end{center}
\caption{Vanilla scatter plot}
\label{fig:simpleplot2}
\end{figure}

The {\tt plot} command is great for generating quick charts. If we want a bit more control over the charts we produce, the {\tt ggplot2} library is the way to go. (ggplot2 isn't part of the standard R bundle, so you'll need to install the package yourself if you haven't already installed it. In RStudio, find the Packages tab, click Install Packages, search for ggplot2 and then install it, along with its dependencies...). You can see the sort of chart ggplot creates out of the box in figure \ref{fig:simpleggplot}.

\begin{figure}
\begin{center}
<<label=fig_simpleggplot,fig=TRUE,echo=T>>=
require(ggplot2)
#Plot the Number of deaths by the Population
p=ggplot(cancerdata)+geom_point(aes(x=Population, y=Number))
print(p)
@
\end{center}
\caption{A rather prettier plot}
\label{fig:simpleggplot}
\end{figure}

\newpage
\subsection{Generating the Funnel Plot}

Doing a bit of searching for the ``funnel plot'' chart type used to display the data in Goldacre's article, I came across a post on Cross Validated, the Stack Overflow/Stack Exchange site dedicated to statistics related Q\&A: \emph{How to draw funnel plot using ggplot2 in R?}\footnote{\url{http://stats.stackexchange.com/questions/5195/how-to-draw-funnel-plot-using-ggplot2-in-r/5210\#5210}}

The meta-analysis answer seemed to produce the similar chart type, so I had a go at cribbing the code, with confidence limits set at the 95\% and 99.9\% levels. Note that I needed to do a couple of things:

\begin{enumerate}
\item work out what values to use where! I did this by looking at the ggplot code to see what was plotted. p was on the y-axis and should be used to present the death rate. The data provides this as a rate per 100,000, so we need to divide by 100, 000 to make it a rate in the range 0..1. The x-axis is the population.
\item change the range and width of samples used to create the curves
\item change the y-axis range.
\end{enumerate}

You can see the result in figure \ref{fig:simpleggplot}.

\begin{figure}
\begin{center}
<<label=fig_funnelplot,fig=TRUE,echo=T>>=
#TH: funnel plot code from:
#stats.stackexchange.com/questions/5195/how-to-draw-funnel-plot-using-ggplot2-in-r/5210#5210
#TH: Use our cancerdata
number=cancerdata$Population
#TH: The rate is given as a 'per 100,000' value, so normalise it
p=cancerdata$Rate/100000

p.se <- sqrt((p*(1-p)) / (number))
df <- data.frame(p, number, p.se, Area=cancerdata$Area)

## common effect (fixed effect model)
p.fem <- weighted.mean(p, 1/p.se^2)

## lower and upper limits for 95% and 99.9% CI, based on FEM estimator
#TH: I'm going to alter the spacing of the samples used to generate the curves
number.seq <- seq(1000, max(number), 1000)
number.ll95 <- p.fem - 1.96 * sqrt((p.fem*(1-p.fem)) / (number.seq))
number.ul95 <- p.fem + 1.96 * sqrt((p.fem*(1-p.fem)) / (number.seq))
number.ll999 <- p.fem - 3.29 * sqrt((p.fem*(1-p.fem)) / (number.seq))
number.ul999 <- p.fem + 3.29 * sqrt((p.fem*(1-p.fem)) / (number.seq))
dfCI <- data.frame(number.ll95, number.ul95, number.ll999, number.ul999, number.seq, p.fem)

## draw plot
#TH: note that we need to tweak the limits of the y-axis
fp <- ggplot(aes(x = number, y = p), data = df) +
geom_point(shape = 1) +
geom_line(aes(x = number.seq, y = number.ll95), data = dfCI) +
geom_line(aes(x = number.seq, y = number.ul95), data = dfCI) +
geom_line(aes(x = number.seq, y = number.ll999, linetype = 2), data = dfCI) +
geom_line(aes(x = number.seq, y = number.ul999, linetype = 2), data = dfCI) +
geom_hline(aes(yintercept = p.fem), data = dfCI) +
xlab("Population") + ylab("Bowel cancer death rate") + theme_bw()

#Automatically set the maximum y-axis value to be just a bit larger than the max data value
fp=fp+scale_y_continuous(limits = c(0,1.1*max(p)))

#Label the outlier point
fp=fp+geom_text(aes(x = number, y = p,label=Area),size=3,data=subset(df,p>0.0003))

print(fp)
@
\end{center}
\caption{A rather prettier plot}
\label{fig:funnelplot}
\end{figure}

\end{document}
	\documentclass[a4paper]{article}
	\SweaveOpts{echo=FALSE, keep.source=TRUE}
	\usepackage{a4wide}
	\usepackage{color}
	\usepackage{hyperref}

	\begin{document}
	\section{Example of self-documenting data journalism notes}

	This is an example of using Sweave to combine code and output from the R statistical programming environment and the LaTeX document processing environment to generate a self-documenting script in which the actual code used to do stats and generate statistical graphics is displayed along the charts it directly produces.

	\subsection{Getting Started...}
	The aim is to try to replicate a graphic included by Ben Goldacre in his article \emph{DIY statistical analysis: experience the thrill of touching real data}\footnote{\url{http://www.guardian.co.uk/commentisfree/2011/oct/28/bad-science-diy-data-analysis}}.


	<< echo = T >>=
	# The << echo = T >>= identifies an R code region;
	# echo=T means run the code, and print what happens when it's run
	# In the code area, lines beginning with a # are comment lines and are not executed

	#First, we need to load in the XML library that contains the scraper function
	library(XML)

	#Now we scrape the table
	srcURL='http://www.guardian.co.uk/commentisfree/2011/oct/28/bad-science-diy-data-analysis'
	cancerdata=data.frame(
	readHTMLTable( srcURL, which=1, header=c('Area','Rate','Population','Number') ) )

	#The @ symbol on its own at the start of a line marks the end of a code block
	@

	The format is simple: {\tt readHTMLTable(url,which=TABLENUMBER)} ({\tt TABLENUMBER} is used to extract the N'th table in the page.) The header part labels the columns (the data pulled in from the HTML table itself contains all sorts of clutter).

	We can inspect the data we've imported as follows:

	<< echo = T >>=
	#Look at the whole table (the whole table is quite long,
	# so don't disply it/comment out the command for now instead.
	#cancerdata
	#If you are using RStudio, you can inspect the data using the command: View(cancerdata))
	#Look at the column headers
	names(cancerdata)
	#If there's a problem with the header command not doing it's stuff,
	# uncomment the next two lines to fix it and check the fix...
	#names(cancerdata)<-c('Area','Rate','Population','Number')
	#names(cancerdata)
	#Look at the first 10 rows
	head(cancerdata)
	#Look at the last 10 rows
	tail(cancerdata)
	#What sort of datatype is in the Number column?
	class(cancerdata$Number)
	@


	The last line, {\tt class(cancerdata\$Number)}, identifies the data as type \emph{factor}. In order to do stats and plot graphs, we need the Number, Rate and Population columns to contain actual numbers. (Factors organise data according to categories; when the table is loaded in, the data is loaded in as strings of characters; rather than seeing each number as a number, it's identified as a category.) The


	<< echo=T >>=
	#Convert the numerical columns to a numeric datatype
	cancerdata$Rate =
	as.numeric(levels(cancerdata$Rate)[as.numeric(cancerdata$Rate)])
	cancerdata$Population =
	as.numeric(levels(cancerdata$Population)[as.integer(cancerdata$Population)])
	cancerdata$Number =
	as.numeric(levels(cancerdata$Number)[as.integer(cancerdata$Number)])
	#Just check it worked…
	class(cancerdata$Number)
	class(cancerdata$Rate)
	class(cancerdata$Population)
	head(cancerdata)
	@

	We can now plot the data as a simple scatterplot using the {\tt plot} command (figure \ref{fig:simpleplot}) or we can add a title to the graph and tweak the axis labels (figure \ref{fig:simpleplot2}).

	\begin{figure}
	\begin{center}
	<<label=fig_simpleplot,fig=TRUE,echo=T>>=
	#Plot the Number of deaths by the Population
	plot(Number ~ Population, data=cancerdata)
	@
	\end{center}
	\caption{Vanilla scatter plot}
	\label{fig:simpleplot}
	\end{figure}

	\begin{figure}
	\begin{center}
	<<label=fig_simpleplot2,fig=TRUE,echo=T>>=
	#Plot the Number of deaths by the Population.
	#Add in a title (main) and tweak the y-axis label (ylab).
	plot(Number ~ Population, data=cancerdata,
	main='Bowel Cancer Occurrence by Population', ylab='Number of deaths')
	@
	\end{center}
	\caption{Vanilla scatter plot}
	\label{fig:simpleplot2}
	\end{figure}

	The {\tt plot} command is great for generating quick charts. If we want a bit more control over the charts we produce, the {\tt ggplot2} library is the way to go. (ggplot2 isn't part of the standard R bundle, so you'll need to install the package yourself if you haven't already installed it. In RStudio, find the Packages tab, click Install Packages, search for ggplot2 and then install it, along with its dependencies...). You can see the sort of chart ggplot creates out of the box in figure \ref{fig:simpleggplot}.

	\begin{figure}
	\begin{center}
	<<label=fig_simpleggplot,fig=TRUE,echo=T>>=
	require(ggplot2)
	#Plot the Number of deaths by the Population
	p=ggplot(cancerdata)+geom_point(aes(x=Population, y=Number))
	print(p)
	@
	\end{center}
	\caption{A rather prettier plot}
	\label{fig:simpleggplot}
	\end{figure}

	\newpage
	\subsection{Generating the Funnel Plot}

	Doing a bit of searching for the ``funnel plot'' chart type used to display the data in Goldacre's article, I came across a post on Cross Validated, the Stack Overflow/Stack Exchange site dedicated to statistics related Q\&A: \emph{How to draw funnel plot using ggplot2 in R?}\footnote{\url{http://stats.stackexchange.com/questions/5195/how-to-draw-funnel-plot-using-ggplot2-in-r/5210\#5210}}

	The meta-analysis answer seemed to produce the similar chart type, so I had a go at cribbing the code, with confidence limits set at the 95\% and 99.9\% levels. Note that I needed to do a couple of things:

	\begin{enumerate}
	\item work out what values to use where! I did this by looking at the ggplot code to see what was plotted. p was on the y-axis and should be used to present the death rate. The data provides this as a rate per 100,000, so we need to divide by 100, 000 to make it a rate in the range 0..1. The x-axis is the population.
	\item change the range and width of samples used to create the curves
	\item change the y-axis range.
	\end{enumerate}

	You can see the result in figure \ref{fig:simpleggplot}.

	\begin{figure}
	\begin{center}
	<<label=fig_funnelplot,fig=TRUE,echo=T>>=
	#TH: funnel plot code from:
	#stats.stackexchange.com/questions/5195/how-to-draw-funnel-plot-using-ggplot2-in-r/5210#5210
	#TH: Use our cancerdata
	number=cancerdata$Population
	#TH: The rate is given as a 'per 100,000' value, so normalise it
	p=cancerdata$Rate/100000

	p.se <- sqrt((p*(1-p)) / (number))
	df <- data.frame(p, number, p.se, Area=cancerdata$Area)

	## common effect (fixed effect model)
	p.fem <- weighted.mean(p, 1/p.se^2)

	## lower and upper limits for 95% and 99.9% CI, based on FEM estimator
	#TH: I'm going to alter the spacing of the samples used to generate the curves
	number.seq <- seq(1000, max(number), 1000)
	number.ll95 <- p.fem - 1.96 * sqrt((p.fem*(1-p.fem)) / (number.seq))
	number.ul95 <- p.fem + 1.96 * sqrt((p.fem*(1-p.fem)) / (number.seq))
	number.ll999 <- p.fem - 3.29 * sqrt((p.fem*(1-p.fem)) / (number.seq))
	number.ul999 <- p.fem + 3.29 * sqrt((p.fem*(1-p.fem)) / (number.seq))
	dfCI <- data.frame(number.ll95, number.ul95, number.ll999, number.ul999, number.seq, p.fem)

	## draw plot
	#TH: note that we need to tweak the limits of the y-axis
	fp <- ggplot(aes(x = number, y = p), data = df) +
	geom_point(shape = 1) +
	geom_line(aes(x = number.seq, y = number.ll95), data = dfCI) +
	geom_line(aes(x = number.seq, y = number.ul95), data = dfCI) +
	geom_line(aes(x = number.seq, y = number.ll999, linetype = 2), data = dfCI) +
	geom_line(aes(x = number.seq, y = number.ul999, linetype = 2), data = dfCI) +
	geom_hline(aes(yintercept = p.fem), data = dfCI) +
	xlab("Population") + ylab("Bowel cancer death rate") + theme_bw()

	#Automatically set the maximum y-axis value to be just a bit larger than the max data value
	fp=fp+scale_y_continuous(limits = c(0,1.1*max(p)))

	#Label the outlier point
	fp=fp+geom_text(aes(x = number, y = p,label=Area),size=3,data=subset(df,p>0.0003))

	print(fp)
	@
	\end{center}
	\caption{A rather prettier plot}
	\label{fig:funnelplot}
	\end{figure}

	\end{document}