Skip to content

Instantly share code, notes, and snippets.

@mamacneil mamacneil/Lecture_6.ipynb Secret
Created Oct 10, 2018

Embed
What would you like to do?
Lecture 6
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"# Lecture 6 - Temporal Data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Time is a surprisingly difficult thing to get right sometimes. The concept that seems straightforward, then Einstein comes along and tell us that [it's all relative](https://www.independent.co.uk/news/science/einsteins-theory-is-proved-and-it-is-bad-news-if-you-own-a-penthouse-2088195.html). While far more linear on a computer, dealing with time does sometimes feel like you're solving equations for quantium gravity. \n",
"\n",
"<img src=\"Einstein.jpg\" alt=\"airplane\" width=\"500\"/>\n",
"\n",
"The problem is that so many people store time differently. French Canada uses 24-hr notation, while us anglos tend to use 12-hour notation. Then there's that annonying American habit of using month-day-year, rather than day-month-year. Time zones are a pain. This is just the start of where problems begin. So let's start with some notation."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Date coding notation in R\n",
"\n",
"There are several options for dates and times, some of which have both. It is generally recommended to stick to the simplest level you need, so stick with dates if you just have dates, and times with times. The date-only package is `as.Date()`.\n",
"\n",
"Because there are so many ways to keep track of time, high-level programming languages have date-time notation assoacited with them to handle and convert dates and times into a standardized format. In `as.Date()`, the notation is:\n",
"\n",
"| Code | Value |\n",
"| ------------- |:-------------:|\n",
"| %d | Day of the month (decimal number) |\n",
"| %m | Month (decimal number) |\n",
"| %b | Month (abbreviated) |\n",
"| %B | Month (full name) |\n",
"| %y | Year (2 digit) |\n",
"| %Y | Year (4 digit) |\n",
"\n",
"For each piece of data one might look at, a date object can be created by passing a string into the `as.Date()` function and specifying what it looks like (I also had to add the timezone on my computer):"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"as.Date('15/1/2001', format='%d/%m/%Y', tz='AST')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Having a standardized format makes it much easier to manipulate things that are dates. For example:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"date1 = as.Date('15/1/2001', format='%d/%m/%Y', tz=\"America/Halifax\")\n",
"date2 = as.Date('15/5/2001', format='%d/%m/%Y', tz='AST')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"date1-date2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It's a good idea to specify the timezone where the time was collected as this will anchor time into a universal standard that can get ambiguous quickly (try sampling reefs on a trip across the Pacific and keeping track of what timezone you're in, or even what hemisphere, once you come back and look at the data). Timezones are a common pitfall, as names we use locally may not apply universally (everyone wants Eastern Standard Time it seems). The [full list](https://en.wikipedia.org/wiki/List_of_tz_database_time_zones) is long and can also be printed in R with:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"str(OlsonNames())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Dates are weird because there are a mixture of units of uneven size - 12 months, with a range of days in them that vary by year are not easy to follow. Let's say you want to know how many days there are between a series of sampling dates. Calculating this by hand would be a nightmare to have to do unless there was a date object to work it out for you. For example:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"sample_dates = as.Date(c('2010-07-22', '2011-04-20', '2012-10-06', '2013-09-16', '2014-11-01', '2015-12-09', '2016-10-23', '2017-01-01', '2018-02-19'))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Inter-date differences\n",
"diff(sample_dates)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"# Task 1\n",
"---\n",
"\n",
"Figure out how many days passed between each of the assassinations of JFK, Malcolm X, MLK, and Bobby Kennedy in the 1960's."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Your answer here (feel free to add cells to complete your answer)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Dates can also be created, not just manipulated, using basic R functions such as `seq()`, which makes a sequence of things:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"my_dates = seq(date1, length=20, by='week')\n",
"my_dates"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"diff(my_dates)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In most date systems, dates are really stored as integers, with some specific day in history being day zero. Excel famously uses 1900 but [wrongly identifying it as a leap year](https://en.wikipedia.org/wiki/Leap_year_bug) so as to maintain the Microsoft obsession with backward compatability. In R, 1 January 1970 is year zero, following the tradition of [Unix](https://www.wired.com/2001/09/unix-tick-tocks-to-a-billion/).\n",
"\n",
"To look at the integer day for a datetime:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"unclass(my_dates)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"diff(unclass(my_dates))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In addition to turning dates into numbers, `as.Date()` can also turn numbers into words. A few convenience functions:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"weekdays(sample_dates)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"months(sample_dates)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"quarters(sample_dates)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There is also this odd thing:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"julian(sample_dates)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Which returns the days since time zero:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"unclass(sample_dates)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"'Julian' here refers to the Julian calendar declared by Julias Caesar in 46 BC, use of which has continued from early adoption by astronomers due to the coincidence of three astronomical cycles on Monday, January 1, 4713 BC (which preceded any dates in recorded history)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Dates *and* times\n",
"\n",
"If you also have times in your data (datetime data), you can create a `POSIX` object. The name POSIX is an acronym for [Portable Operating System Interface](https://en.wikipedia.org/wiki/POSIX) (not sure where the 'X' came from) which is a set of standards for maintaining compatability of computer systems. POSIX notation adds additional arguments to how things are specified:\n",
"\n",
"| Code | Meaning | Code | Meaning |\n",
"| ------------- |:-------------:|:-------------:|:-------------:|\n",
"| %a | Abbreviated weekday | %A | Full weekday |\n",
"| %b | Abbreviated month | %B | Full month |\n",
"| %c | Locale-specific date and time | %d | Decimal date |\n",
"| %H | Decimal hours (24 hour) | %I | Decimal hours (12 hour) |\n",
"| %j | Decimal day of the year | %m | Decimal month |\n",
"| %M | Decimal minute | %p | Locale-specific AM/PM |\n",
"| %S | Decimal second | %U | Decimal week of the year (starting on Sunday) |\n",
"| %w | Decimal Weekday (0=Sunday) | %W | Decimal week of the year (starting on Monday) |\n",
"| %x | Locale-specific Date | %X | Locale-specific Time |\n",
"| %y | 2-digit year | %Y | 4-digit year |\n",
"| %z | Offset from GMT | %Z | Time zone (character) |\n",
"\n",
"Reflecting the various time components and conventions that people use gobally.\n",
"\n",
"In R, the POSIX conversions for datetime objects is handled by two functions:\n",
"\n",
"1. `as.POSIXct()` catenate time - which creates an atomic object of the number of seconds since time zero\n",
"2. `as.POSIXlt()` list time - which is a list of time attributes\n",
"\n",
"We can have a look at thse things by creating two, seemingly identical objects:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"time1 = as.POSIXct(\"2013-07-24 23:55:26\")\n",
"time2 = as.POSIXlt(\"2013-07-24 23:55:26\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"time1"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"time2"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"unclass(time1)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"unclass(time2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Because lists have more overhead computationally, unless you need the mixed categories, the best course is to stick with using `as.POSIXct`, where all the conversions are handled behind the scenes."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`POSIXct` objects work a little more intutively than the `as.Date` objects above:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"time1-time2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The fact that these work with seconds mean you can add to them coherently, provided you convert to seconds first:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"time1"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"time1+24*60*60"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"# Task 2\n",
"---\n",
"\n",
"Creat two `POSIXct` objects to calculate the number of **decimal hours** between the launch and landing of Apollo 11. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Your answer here (feel free to add cells to complete your answer)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Importantly, `POSIXct` objects will keep track of daylight savings time, which is applied willy-nilly among provinces, states, and countries."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"as.POSIXct(\"2013-03-10 08:32:07\") - as.POSIXct(\"2013-03-09 23:55:26\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note this will use the local timezone on the computer unless otherwise specfied, and that POSIX-style timezones are a HUGE pain to use, so be forewarned it might take some time to track down the correct POSIX timezone. Best practice is to just use GMT (or equivalently UMT) when you're recording scientific data that needs a time component. Greenwich Mean Time doesn't change with the seasons and is the time at the [Shepherd Gate Clock](https://www.rmg.co.uk/see-do/we-recommend/attractions/shepherd-gate-clock), at the Royal Observatory in Greenwich, UK. This is worth visiting if you're in London, as it is anchor location of the prime meridian, against which all time and longitude is anchored globally. It also houses John Harrison's chronometers, which are the basis of how longitude was established. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## `strptime()`\n",
"\n",
"Finally there is also the `strptime` function, which is an internal workhorse function to take a string and convert it into a time data type:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Create data frame\n",
"events = data.frame(\n",
" time=c(\"2014-01-23 14:28:21\",\"2014-01-23 14:28:55\",\"2014-01-23 14:29:02\",\"2014-01-23 14:31:18\"),\n",
" speed=c(2.0,2.2,3.4,5.5))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"events"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"summary(events)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Change time to an actual time object\n",
"events$time = strptime(events$time,\"%Y-%m-%d %H:%M:%S\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"summary(events)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The problem with strptime is that it makes some assumptions that might mess things up for you if they go undetected. For example:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Create some times\n",
"early = strptime(\"2000-01-01 00:00:00\",\"%Y-%m-%d %H:%M:%S\")\n",
"late1 = strptime(\"2000-01-01 00:00:20\",\"%Y-%m-%d %H:%M:%S\")\n",
"\n",
"early-late1"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Create some other times\n",
"early = strptime(\"2000-01-01 00:00:00\",\"%Y-%m-%d %H:%M:%S\")\n",
"late2 = strptime(\"2000-01-01 1:00:00\",\"%Y-%m-%d %H:%M:%S\")\n",
"\n",
"early-late2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This might mess you up if you're scripting to extract times for example:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"as.numeric(early-late1)\n",
"as.numeric(early-late2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## System internals\n",
"\n",
"\n",
"R will also provide information about what it is borrowing from your operating system to decide what time it is:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"Sys.time()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Notice what time zone is specified, and how this might (or might not) differ from what your current local time is. That is given by"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"date()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Temporal dataframes\n",
"\n",
"Most often, we won't create dates and times by hand. Instead we will import them from a flat file. [Here](http://www.urban-climate.net/content/data/9-data) we can download daily wind and rainfall data for London. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"# Task 3\n",
"---\n",
"\n",
"Download weather data for Southwark, London for 2016 and, using the Date.and.Time column timestamp, calculate the average length of time (in hours) between gale force (i.e. >34 kts) maximum gust records."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Your answer here (feel free to add cells to complete your answer)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Plotting time"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Once a datetime object has been created, it becomes easier to plot as well."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"datex = data.frame('time'=as.POSIXct(c(\"2009-03-07 12:00\", \"2009-03-08 12:00\", \"2009-03-28 12:00\", \"2009-03-29 12:00\", \"2009-10-24 12:00\", \"2009-10-25 12:00\", \"2009-10-31 12:00\", \"2009-11-01 12:00\")), 'values' = c(2, 4, 2, 6, 3, 6, 1, 1))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"datex"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": false
},
"outputs": [],
"source": [
"plot(datex$time, datex$values, type='l', ylim=c(0,10), xlab='Date', ylab='Value')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"# Task 4\n",
"---\n",
"\n",
"Using the Southwark 2016 data, plot wind gusts through time. Use `abline()` to plot a line at >34 kts."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Your answer here (feel free to add cells to complete your answer)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Time filtering"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Frequently we like to select key dates in a dataset, and filtering by date or by time is required. Fortunately we can apply the same kinds of boolean operators as we saw in Lecture 2:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"date_seq = seq(c(ISOdate(2000,3,20)), by = \"DSTday\", length.out = 31)\n",
"date_seq\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"date_seq[date_seq>\"2000-03-24\" & date_seq<\"2000-04-04\"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"# Task 5\n",
"---\n",
"\n",
"Again, using the Southwark 2016 data, plot the **maximum daily** wind gust speed for the month of January."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Your answer here (feel free to add cells to complete your answer)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# What have you learned and what's next?\n",
"\n",
"The point of today's lab was to learn about dates and times, and how to do them right\n",
"\n",
"**You should at this point be comfortable:**\n",
" 1. Understanding that local customs make time hard\n",
" 2. Knowing how to create a date or time object\n",
" 3. Performing operations on date or time objects\n",
" 4. Filtering time\n",
"\n",
"Next week we will work with words - DNA is a string!\n",
"\n",
"\n",
"---\n",
"# ** A bientôt ** !"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "R",
"language": "R",
"name": "ir"
},
"language_info": {
"codemirror_mode": "r",
"file_extension": ".r",
"mimetype": "text/x-r-source",
"name": "R",
"pygments_lexer": "r",
"version": "3.4.3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.