mamacneil/Lecture_10.ipynb Secret

## Lecture_10.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Lecture 10 - Plotting\n",
    "\n",
    "In parallel with statistical (or machine learning) skills, the ability to produce clear, informative graphics is among the most important skills a biologist can develop. While conceptually easier to grasp than modelling, good scientific graphics require as much (or more) time to produce than their underlying analyses. Similar to modelling itself, approaches to graphics vary from out of the box, **high-level** approaches, such as [ggplot](http://ggplot.yhathq.com), that make intelligent assumptions about how to make things look good, and **low-level** approaches, such as the *base* plotting package in R, that make very few assumptions while allowing for maximum flexibility. While high-level packages are a great innovation that help save time and can produce production-quality graphics, low-level skills are ultimately more powerful, putting no limits on what you can produce. \n",
    "\n",
    "An example from 2015:\n",
    "\n",
    "<img src=\"Figure_2.png\" alt=\"Drawing\" style=\"width: 800px;\"/>\n",
    "\n",
    "This is a somewhat complex plot to produce in anything but a base-level plotting package - each point, line, shade, and colour has been custom edited, making automation meaningless."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So if low-level graphics are the way to produce final, custom graphics, why learn anything else? Because final graphics are only one component of the analytical process useful to biologists; what comes first is data exploration, looking and thinking about complex data to see what the key patterns are and to consider things you may not have thought of before you designed the study. This figure (stolen from [Sean Anderson's webiste](http://seananderson.ca)) illustrates these tradeoffs nicely:\n",
    "\n",
    "<img src=\"gg-vs-base.png\" alt=\"Drawing\" style=\"width: 800px;\"/>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here you can see that R base graphics is initially more time consuming than high level graphics in something like ggplot, and that base graphics generally don't scale well (in terms of your time) as data becomes more complex. As Anderson articulately states:\n",
    "\n",
    "> Good graphical displays of data require rapid iteration and lots of exploration. If it takes you hours to code a plot in base graphics, you're unlikely to throw it out and explore other ways of visualizing the data, and you're unlikely to explore all the dimensions of the data.\n",
    "\n",
    "To get started, we will do some low-level R base package plotting, then will get into ggplot to see how to use both systems.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Base R plotting\n",
    "\n",
    "The 'base' package in R - that this, the functions that come pre-loaded with every installation - inludes all the low-level plotting functions that underpin how R produces graphics. High-level graphics packages manipulate these functions (under the hood) to produce graphics that guess at what a 'good' should look like, with a minimum number of commands. But these underlying commands are important if you want to re-produce a good graphic. \n",
    "\n",
    "**NB**: some R users advocate for creating your graphics in R and then manipulating them in something like Adobe Photoshop or Illustrator before submission. In some cases this is necessary (essential even) but in my experience of revision after revision, it is far better to **do as much as possible in your plot scripts**, because every time you revise an image you'll have to re-open your graphics program and do all the tweaks over again. So basic plotting..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "To be on the same page, let's open up the baseball batting data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Import batting data\n",
    "mlbdata = na.omit(read.csv(\"mlb2017_batting.txt\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "The most basic plotting function is, unsurprisingly, `plot()`. So let's plot something and see what happens:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "head(mlbdata)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Make plots a resonable size\n",
    "options(repr.plot.width=5, repr.plot.height=5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Plot batting averages\n",
    "plot(mlbdata$BA)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "What's going on? We passed the plot a single column of batting averages but there is an index value on the x-axis too. This is because `plot()` is for **biplots** x vs y plots of two variables in 2 dimensional-space. So let's put something more interesting on the x-axis:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Plot batting averages\n",
    "plot(mlbdata$BA,mlbdata$OBP)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "Now we have batting average on the x-axis vs on-base percentage on the y-axis. Those labels look awful though, so we need to add new ones:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Plot batting averages\n",
    "plot(mlbdata$BA,mlbdata$OBP, xlab=\"Batting average\", ylab=\"On-base percentage\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "Looking better, but how about those zeros? Pitchers? Let's add some colour to see:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Plot batting averages\n",
    "plot(mlbdata$BA,mlbdata$OBP, xlab=\"Batting average\", ylab=\"On-base percentage\", col=grepl(\"1\",mlbdata$Pos.Summary)*1+1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "That looks ok, but the colours are ugly:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Plot batting averages\n",
    "plot(mlbdata$BA,mlbdata$OBP, xlab=\"Batting average\", ylab=\"On-base percentage\", col=c(\"darkgrey\",\"dodgerblue\")[grepl(\"1\",mlbdata$Pos.Summary)*1+1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "So mostly pitchers (in blue) who didn't get a hit but ended up on base with a walk (or were hit by the pitcher); what about those along the 1:1 line?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Plot batting averages\n",
    "plot(mlbdata$BA,mlbdata$OBP, xlab=\"Batting average\", ylab=\"On-base percentage\", col=c(\"darkgrey\",\"dodgerblue\")[grepl(\"1\",mlbdata$Pos.Summary)*1+1])\n",
    "abline(0,1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "But it looks like there is at least one player with a higher batting average than on-base percentage - how is this possible? And who is it?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "---\n",
    "# Task 1\n",
    "---\n",
    "\n",
    "Use the `text()` function to plot the name of the player(s) who have a higher batting average than on-base percentage next to their point on the plot above."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Your answer here (feel free to add cells to complete your answer)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "In truth there are DOZENS of paramters that we can manipulate with base plotting in R - literally everything is up for grabs - so much so that listing them here is kind of redundant (a really good set of comprehensive examples from Murray Logan is available here: http://users.monash.edu.au/~murray/AIMS-R-users/ws/ws11.html). So a few principles:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Figure dimensions\n",
    "\n",
    "Not the most exciting topic you might say but having control of the output size of your figures is a huge deal if you're trying to get published in a [tabloid journal](https://www.sciencemag.org), where  dimensions are **VERY SPECIFIC**:http://www.sciencemag.org/authors/instructions-preparing-initial-manuscript (BTW they even have specific $\\LaTeX$ [instructions](http://www.sciencemag.org/authors/preparing-manuscripts-using-latex)). In any case, it matters."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "\n",
    "\n",
    "<img src=\"figureanatomy1.png\" alt=\"Drawing\" style=\"width: 600px;\"/>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "\n",
    "<img src=\"figureanatomy2.png\" alt=\"Drawing\" style=\"width: 600px;\"/>\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "<table class=\"paramsTablea\">\n",
    "\t<tr>\n",
    "\t  <th>Parameter</th><th>Value</th><th>Description</th>\n",
    "\t</tr>\n",
    "\t<tr>\n",
    "\t  <td class=\"Rc\">din,fin,pin</td><td class=\"Rc\">=c(width,height)</td><td>Dimensions (width and height) of the device, figure and plotting regions (in inches)</td>\n",
    "\t</tr>\n",
    "\t<tr>\n",
    "\t  <td class=\"Rc\">fig</td><td class=\"Rc\">=c(left,right,bottom,top)</td><td>Coordinates of the figure region within the device.  Coordinates expressed as a fraction of the device region.</td>\n",
    "\t</tr>\n",
    "\t<tr>\n",
    "\t  <td class=\"Rc\">mai,mar</td><td class=\"Rc\">=c(bottom,left,top,right)</td><td>Size of each of the four figure margins in inches and lines of text (relative to current font size).</td>\n",
    "\t</tr>\n",
    "\t<tr>\n",
    "\t  <td class=\"Rc\">mfg</td><td class=\"Rc\">=c(row,column)</td><td>Position of the currently active figure within a grid of figures defined by either mfcol or mfrow.</td>\n",
    "\t</tr>\n",
    "\t<tr>\n",
    "\t  <td class=\"Rc\">mfcol,mfrow</td><td class=\"Rc\">=c(rows,columns)</td><td>Number of rows and columns in a multi-figure grid.</td>\n",
    "\t</tr>\n",
    "\t<tr>\n",
    "\t  <td class=\"Rc\">new</td><td class=\"Rc\">=TRUE or =FALSE</td><td>Indicates whether to treat the current figure region as a new frame (and thus begin a new plot over the top of the previous plot (TRUE) or to allow a new high level plotting function to clear the figure region first (FALSE).</td>\n",
    "\t</tr>\n",
    "\t<tr>\n",
    "\t  <td class=\"Rc\">oma,omd,omi</td><td class=\"Rc\">=c(bottom,left,top,right)</td><td>Size of each of the four outer margins in lines of text (relative to current font size), inches and as a fraction of the device region dimensions</td>\n",
    "\t</tr>\n",
    "\t<tr>\n",
    "\t  <td class=\"Rc\">plt</td><td class=\"Rc\">=c(left,right,bottom,top)</td><td>Coordinates of the plotting region expressed as a fraction of the device region.</td>\n",
    "\t</tr>\n",
    "\t<tr>\n",
    "\t  <td class=\"Rc\">pty</td><td class=\"Rc\">=\"s\" or \"m\"</td><td>Type of plotting region within the figure region.  Is the plotting region a square (=\"s\") or is it maximized (=\"m\") to fit within the shape of the figure region.</td>\n",
    "\t</tr>\n",
    "\t<tr>\n",
    "\t  <td class=\"Rc\">usr</td><td class=\"Rc\">=c(left,right,bottom,top)</td><td>Coordinates of the plotting region corresponding to the axes limits of the plot.</td>\n",
    "\t</tr>\n",
    "  </table>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "From this table and the figures above you can see which dimensions affect which attributes of figure output, with particular emphasis on the fact that there are options for output in inches and output in relative dimensions. If you're exporting a file for publication, USE INCHES. The reason is illustrated here: "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "<img src=\"Picture1.png\" alt=\"Drawing\" style=\"width: 600px;\"/>\n",
    "\n",
    "<img src=\"Picture2.png\" alt=\"Drawing\" style=\"width: 600px;\"/>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "These two figures have the same data, with the same point sizes but the one on top is relative while the one below is in inches. Deep stuff, but a major deal if you're publishing a paper with one common legend. \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "---\n",
    "# Task 2\n",
    "---\n",
    "\n",
    "Plot the baseball hitting plot from Task 1 in the same aspect ratio as the typical movie theatre, at the recommended figure resolution for **final** publication of a jpeg in *Science*."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Your answer here (feel free to add cells to complete your answer)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## High-level plotting in the base package\n",
    "\n",
    "Much of what we've looked at so far is low-level in the base package, but the base package is not without it's own virtues. Specifically there are a set of high(er) level plotting functions that are workhorses for using graphics in R."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# hist"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# boxplot"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# scatterplot"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Scatterplot matricies"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Gridded"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Contour"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "# Task 3\n",
    "---\n",
    "\n",
    "Create a contour plot of batting average vs on-base percentage for the MLB 2017 batting data, where the height of the contours is slugging average."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Your answer here (feel free to add cells to complete your answer)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# ggplotting"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Well, here we are - at the alter of H Wickham himself (praises be). The thing that Hadley Wickham is best known for is the ggplot package for R - a very high level graphics package that implemented the theory of entire book called [*The Grammar of  Graphics*](https://www.springer.com/us/book/9781475731002), which presents\n",
    "\n",
    "> a unique foundation for producing almost every quantitative graphic found in scientific journals, newspapers, statistical packages, and data visualization systems\n",
    "\n",
    "The history of how this came to be is outlined by the author Leland Wilkinson:\n",
    "\n",
    "> Before writing the graphics for SYSTAT in the 1980's, I began by teaching a seminar in statistical graphics and collecting as many different quantitative graphics as I could find. I was determined to produce a package that could draw every statistical graphic I had ever seen. The structure of the program was a collection of procedures named after the basic graph types they produced. The graphics code was roughly one and a half megabytes in size. In the early 1990's, I redesigned the SYSTAT graphics package using object-based technology. I intended to produce a more comprehensive and dynamic package. I accomplished this by embedding graphical elements in a tree structure. Rendering graphics was done by walking the tree and editing worked by adding and deleting nodes. The code size fell to under a megabyte. In the late 1990's, I collaborated with Dan Rope at the Bureau of Labor Statistics and Dan Carr at George Mason University to produce a graphics production library called GPL, this time in Java. Our goal was to develop graphics components. This book was nourished by that project. So far, the GPL code size is under half a megabyte.\n",
    "\n",
    "This is unbeliveable. To have derived a comprehensive theory for all scientific graphics is an astonishing feat. It also lends a particular flavour to things, which some find more or less to their taste (less to mine) but it notheless is one of the great books of 20th Century scientific computing. So what does this look like in `ggplot`?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "### Core ggplot\n",
    "\n",
    "(It's a good idea to open up the `ggplot` [documentation](https://ggplot2.tidyverse.org/reference/) and keep it in another browser tab.)\n",
    "\n",
    "`ggplot` has a core syntax of a minimal three elements:\n",
    "\n",
    "1. `ggplot()` - initalize the plot with some data\n",
    "2. `aes()` - specify figure aesthetics\n",
    "3. `(<gg>)` - specify kind of plot and other aspects\n",
    "\n",
    "The first two arguments set up the plot and the approximate theme of how it will look. The third argument could be many elements long, but it specifies the type of plot as well. As outlined by Wilkinson, grammar graphics are made \"*by walking the tree and editing worked by adding and deleting nodes*\", leading to the distinctive `ggplot` approach of piping elements using a `+` sign:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "library(ggplot2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pitcher = grepl(\"1\",mlbdata$Pos.Summary)*1+1\n",
    "ggplot(data=mlbdata)+aes(BA,OBP,colour=pitcher)+geom_point()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "This additive quality is similar to piping, where you start with a basic element and add complexity one element at a time. As a result **later entries trump earlier ones.** You can see by comparison with our default base plot for the same data that `ggplot` makes some decent assumptions about what looks nice. But maybe we want to change that:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pitcher = c(\"red\",\"blue\")[grepl(\"1\",mlbdata$Pos.Summary)*1+1]\n",
    "ggplot(data=mlbdata)+aes(BA,OBP)+geom_point(colour=pitcher)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "Lovely! \n",
    "\n",
    "**Note** nesting the `aes()` argument within one of the other ggplot functions will map those aesthetic arguments  to the data, assigning the colour or shape automatically. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Geoms\n",
    "\n",
    "There's a bit of lingo to `ggplot` that can take some getting used to. Saying 'aesthetics' a lot for example. Geoms is another - `geoms` in ggplot refers to geometric objects, or what we typically think of in a plot. There are rather a lot of them:\n",
    "\n",
    "`geom_abline() geom_hline() geom_vline()` - Reference lines: horizontal, vertical, and diagonal\n",
    "\n",
    "`geom_bar() geom_col() stat_count()` - Bar charts\n",
    "\n",
    "`geom_bin2d() stat_bin_2d()` - Heatmap of 2d bin counts\n",
    "\n",
    "`geom_blank()` - Draw nothing\n",
    "\n",
    "`geom_boxplot() stat_boxplot()` - A box and whiskers plot (in the style of Tukey)\n",
    "\n",
    "`geom_contour() stat_contour()` - 2d contours of a 3d surface\n",
    "\n",
    "`geom_count() stat_sum()` - Count overlapping points\n",
    "\n",
    "`geom_density() stat_density()` - Smoothed density estimates\n",
    "\n",
    "`geom_density_2d() stat_density_2d()` - Contours of a 2d density estimate\n",
    "\n",
    "`geom_dotplot()` - Dot plot \n",
    "\n",
    "`geom_errorbarh()` - Horizontal error bars\n",
    "\n",
    "`geom_hex() stat_bin_hex()` - Hexagonal heatmap of 2d bin counts\n",
    "\n",
    "`geom_freqpoly() geom_histogram() stat_bin()` - Histograms and frequency polygons\n",
    "\n",
    "`geom_jitter()` - Jittered points\n",
    "\n",
    "`geom_crossbar() geom_errorbar() geom_linerange() geom_pointrange()` - Vertical intervals: lines, crossbars & errorbars\n",
    "\n",
    "`geom_map()` -Polygons from a reference map\n",
    "\n",
    "`geom_path() geom_line() geom_step()` - Connect observations\n",
    "\n",
    "`geom_point()` - Points\n",
    "\n",
    "`geom_polygon()` -  Polygons\n",
    "\n",
    "`geom_qq_line() stat_qq_line() geom_qq() stat_qq()` - A quantile-quantile plot\n",
    "\n",
    "`geom_quantile() stat_quantile()` - Quantile regression\n",
    "\n",
    "`geom_ribbon() geom_area()` - Ribbons and area plots\n",
    "\t\n",
    "`geom_rug()` - Rug plots in the margins\n",
    "\t\n",
    "`geom_segment() geom_curve()` - Line segments and curves\n",
    "\n",
    "`geom_smooth() stat_smooth()` - Smoothed conditional means\n",
    "\t\n",
    "`geom_spoke()` - Line segments parameterised by location, direction and distance\n",
    "\n",
    "`geom_label() geom_text()` - Text\n",
    "\t\n",
    "`geom_raster() geom_rect() geom_tile()`- Rectangles\n",
    "\t\n",
    "`geom_violin() stat_ydensity()` - Violin plot\n",
    "\t\n",
    "`stat_sf() geom_sf() geom_sf_label() geom_sf_text() coord_sf()` - Visualise sf objects"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The grammar of graphics approach is poweful and useful, particularly for getting to somwhere decent quickly. There are numerous places to see example ggplots (and steal code):\n",
    "\n",
    "1. [Top 50 ggplot2 Visualizations](http://r-statistics.co/Top50-Ggplot2-Visualizations-MasterList-R-Code.html)\n",
    "2. [R4stats.com examples](https://r4stats.com/examples/graphics-ggplot2/)\n",
    "3. [STHA](http://www.sthda.com/english/wiki/be-awesome-in-ggplot2-a-practical-guide-to-be-highly-effective-r-software-and-data-visualization)\n",
    "\n",
    "These are just a few of the galleries that you can scroll through to get figure ideas. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "# Task 4\n",
    "---\n",
    "\n",
    "Find a creative way to ggplot MLB 2017 games played vs batting average, highlighting players who played more than 100 games in 2017. Use any geom and aesthetic you like."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Your answer here (feel free to add cells to complete your answer)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Themes\n",
    "\n",
    "A key advantage of ggplot is the ability to carry aesthetic styles throughout every plot in a series, in order to keep things like colours and line thicknesses consistent. This is very useful if you're producing a thesis, paper, or a book. Every aspect of a plot can be set and tweaked, which can be tedious, but getting things looking the way you want is important. \n",
    "\n",
    "`ggplot2` comes with some pre-loaded themes:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Classic theme\n",
    "ggplot(data=mlbdata,aes(BA,OBP))+geom_point(colour=pitcher)+theme_classic()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Light theme\n",
    "ggplot(data=mlbdata,aes(BA,OBP))+geom_point(colour=pitcher)+theme_light()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Dark theme\n",
    "ggplot(data=mlbdata,aes(BA,OBP))+geom_point(colour=pitcher)+theme_dark()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Minimal theme\n",
    "ggplot(data=mlbdata,aes(BA,OBP))+geom_point(colour=pitcher)+theme_minimal()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Looking through, you can see there have been various decisions made about lines and background colours etc., all of which you can set yourself:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Alter the minimal theme\n",
    "theme_example = function (base_size = 11, base_family = \"Helvetica\") {\n",
    "  theme_minimal(base_size = base_size, base_family = base_family) %+replace%\n",
    "    theme(axis.text = element_text(colour = \"grey50\"),\n",
    "          axis.title.x = element_text(colour = \"dodgerblue\"),\n",
    "          axis.title.y = element_text(colour = \"grey50\", angle = 90),\n",
    "  )\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Example theme\n",
    "ggplot(data=mlbdata,aes(BA,OBP))+geom_point(colour=pitcher)+theme_example()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "All the arguments to a theme are available here: https://ggplot2.tidyverse.org/reference/theme.html\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "# Task 5\n",
    "---\n",
    "\n",
    "Modify a theme to suit your tastes and plot anything from the MLB 2017 batting data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Your answer here (feel free to add cells to complete your answer)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# What have you learned and what's next?\n",
    "\n",
    "The point of today's lab was to understand some basic plotting information\n",
    "\n",
    "**You should at this point be comfortable:**\n",
    " 1. Knowing the difference between high-level and low-level plotting\n",
    " 2. Be able to alter plot attributes\n",
    " 3. Bulid a ggplot\n",
    " 4. Create your own ggplot theme\n",
    "\n",
    "Next week we will get into the good stuff and discuss **Tufte!**\n",
    "\n",
    "\n",
    "---\n",
    "# ** A bientôt ** !"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "R",
   "language": "R",
   "name": "ir"
  },
  "language_info": {
   "codemirror_mode": "r",
   "file_extension": ".r",
   "mimetype": "text/x-r-source",
   "name": "R",
   "pygments_lexer": "r",
   "version": "3.4.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}