Skip to content

Instantly share code, notes, and snippets.

@mpettis
Created September 28, 2016 19:38
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save mpettis/17da73874abceae57bd8394d9cf525bb to your computer and use it in GitHub Desktop.
Save mpettis/17da73874abceae57bd8394d9cf525bb to your computer and use it in GitHub Desktop.
Timezone Presentation given at HON

Introduction

Like all those possessing a library, Aurelian was aware that he was guilty of not knowing his in its entirety.

-- Jorge Luis Borges

Setup

knitr::opts_chunk$set(echo = TRUE, fig.width=12, fig.height=9, warnings=FALSE)
options(width=116)

suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(lubridate))

Datetime formats

Formatting dates

Dates and datetimes can come to you in a variety of formats and schemes. Below is a brief, incomplete list of examples. We'll address the funky syntax in a later section with the assumption you are not familiar with it.

Let's look at the date: July 4, 2016, 9pm in different formats:

  ## First, make a single datetime object.
  ## Again, we'll address the funky second argument later -- for now, it is a convenience I am using
dt4th <- strptime("7/4/2016 09:00:00 PM", "%m/%d/%Y %r")

  ## Write it back out in originally inputted format
strftime(dt4th, "%m/%d/%Y %r")

## [1] "07/04/2016 09:00:00 PM"

  ## US date format, military time
strftime(dt4th, "%m/%d/%Y %T")

## [1] "07/04/2016 21:00:00"

  ## European date convention
strftime(dt4th, "%d/%m/%Y %T")

## [1] "04/07/2016 21:00:00"

  ## ISO localtime
strftime(dt4th, "%FT%T")

## [1] "2016-07-04T21:00:00"

  ## With words, with day of week
strftime(dt4th, "%A, %B %d, %Y")

## [1] "Monday, July 04, 2016"

  ## With words, European style, 0-padded
strftime(dt4th, "%d %B, %Y, %I %p")

## [1] "04 July, 2016, 09 PM"

  ## An epoch time, which is the number of seconds since midnight, Jan 1, 1970
strftime(dt4th, "%s")

## [1] "1467684000"

  ## Default system rendering of date to console
dt4th

## [1] "2016-07-04 21:00:00 CDT"

There are many ways to express dates, and they can be found in text, in logfiles, and pretty much any data you have to deal with. However, a lot of the variety comes just from rearranging the parts of the date and time, using different separators, and using either numbers or words for months, and the like. And so, a mini-language was used to describe those different parts of the date, which are used when either outputting a formatted datetime (like above) or when trying to read in a date of a certain format, which we will get to. That mini-language is expressed with the %-letter parts of the above, and the separtors are there as-is.

For example, here is a short, incomplete list of what those specifiers above mean:

Token Meaning
%m Month as decimal number (01-12).
%d Day of the month as decimal number (01-31).
%Y Year, as decimal number (4 digits, like '2016')
%I Hours as decimal number (01-12).
%p AM or PM string.

That mini-language comes from old C strptime/strftime functions to simplify this problem. You can see a reference to it here:

http://man7.org/linux/man-pages/man3/strptime.3.html

That page has a guide to how to interpret the special % formatting tokens. Also, this can be found in the help pages of your language of choice. For sure, both R and Python implement this spec for reading and formatting dates and datetimes with their implementation of strptime/strftime.

Parsing datetimes

Those formatting codes work both ways. Used as above, you can control the format the representation of your datetime. But you can also use them to identify the format of a string that represents a time, and let your language figure out how to translate that into it's own internal representation of that datetime. You saw a hint of that at the beginning when I first created the dt4th object with strptime().

This is highly convenient. In the absence of this, or not knowing this, you may resort to using regular expressions to pull apart the portions of the datetime string to get the day, month, year, hour, minute second, and put together a bunch of complex logic to construct your date. Don't do this! This is reinventing a wheel that people a few generations back already invented and hardened. You will likely invent some error-prone ersatz code.

Let's try the reverse of above: parse various strings into datetimes, and show that they give you the same things:

  ## A common format
  ##
  ## In R, typing the name at the console will return a formatted value
  ## of the thing the name points to, and if it is a certain class, there
  ## are default formats that are used.  We will use this property to show
  ## that we can parse from different string structures and get the same
  ## underlying object.
dt_string <- "7/4/2016 09:00:00 PM"
dt_obj <- strptime(dt_string, "%m/%d/%Y %r")
dt_obj

## [1] "2016-07-04 21:00:00 CDT"

  ## European, military time
dt_string <- "04/07/2016 21:00:00"
dt_obj <- strptime(dt_string, "%d/%m/%Y %T")
dt_obj

## [1] "2016-07-04 21:00:00 CDT"

  ## With words
dt_string <- "July 04, 2016 21:00:00"
dt_obj <- strptime(dt_string, "%B %d, %Y %T")
dt_obj

## [1] "2016-07-04 21:00:00 CDT"

  ## A Unix epoch time
dt_string <- "1467684000"
dt_obj <- strptime(dt_string, "%s")
dt_obj

## [1] "2016-07-04 21:00:00 CDT"

Note that they all return a date in the same format. Under the covers, there is a datetime object that your programming language is representing in some internal data structure, but when it shows you the date, it formats in in some default format that usually is easy for humans to read.

Again, use this power to parse dates from ingested data when you can. Operations such as:

  • Finding minutes, hours, or days between dates.
  • Snapping datetime stamps to top of the hour, midnight of the day, or first of the month.
  • Checking if a datetime is between two other datetimes
  • Making a sequence of dates or datetimes at constant intervals (minute, hour, day, etc.)

are usually easier, and less error prone, with datetime objects than with strings.

For R, there is a great package (among many) called lubridate that makes a lot of this easy. A vignette on that can be found here:

https://cran.r-project.org/web/packages/lubridate/vignettes/lubridate.html

Here, you can find even more convenience functions that hide some of the complexity of this mini-language, and bundle up some of the mini-language concepts for easy parsing. Using those tools is it's own presentation topic. However, I'll show one convenience function to give the flavor:

lubridate::ymd_hms("2016-07-04T21:00:00")

## [1] "2016-07-04 21:00:00 UTC"

lubridate::ymd_hms("2016/07/04 09:00:00 PM")

## [1] "2016-07-04 21:00:00 UTC"

Note how you didn't need to specify the full mini-language % format string, with components and separators and whether or not it is AM/PM or military. This particular convenience function expects a date in roughly 'year-month-day-hour-minute-second' format and works out what the separators and time specifics are. Very handy.

Internal structures, briefly

It is worth a quick aside to see what is meant by "internal datetime representation" as apart from the formatted string you see. Mainly, I will show that in Linux and related OS'es, there are two common but different representations.

Unix and flavors have a standard called POSIX which help keep the flavors fairly standardized in many aspects so that it isn't too hard to make those systems easily compatible with each other. One thing POSIX proscribes is datetime internal representations, and there are actually two standards.

The first addressed is the less used one, called POSIXlt. It internally stores the datetime as a list with slots for the different parts of the datetime. Let's take a look:

dt4th <- strptime("7/4/2016 09:00:00 PM", "%m/%d/%Y %r")

dt4th %>% as.POSIXlt() %>% unclass()

## $sec
## [1] 0
## 
## $min
## [1] 0
## 
## $hour
## [1] 21
## 
## $mday
## [1] 4
## 
## $mon
## [1] 6
## 
## $year
## [1] 116
## 
## $wday
## [1] 1
## 
## $yday
## [1] 185
## 
## $isdst
## [1] 1
## 
## $zone
## [1] "CDT"
## 
## $gmtoff
## [1] NA

You can see that there is a slot for different types of information, like the day, month, year, timezone, isdst ("is it DST for this datetime?"), etc. That is the underlying structure of information that must be manipulated to give you your final formatted datetime string.

A second, more compact structure is POSIXct:

dt4th <- strptime("7/4/2016 09:00:00 PM", "%m/%d/%Y %r")

dt4th %>% as.POSIXct() %>% unclass()

## [1] 1467684000
## attr(,"tzone")
## [1] ""

Note that this is a single number. It represents the number of seconds this datetime is from midnight of January 1, 1970. A single number from which functions must now convert into your nice datetime formatted string.

There are other formats, such as Microsoft's, which is commonin Excel, and is the number of days since Jan 1, 1900, with the decimal parts corresponding to fractions of days, which can be converted to hours, minutes, seconds, etc. You can find a bit more on that here: https://support.microsoft.com/en-us/kb/214094

The rub:

In this system, the serial number 1 represents 1/1/1900 12:00:00 a.m. Times are stored as decimal numbers between .0 and .99999, where .0 is 00:00:00 and .99999 is 23:59:59. The date integers and time decimal fractions can be combined to create numbers that have a decimal and an integer portion. For example, the number 32331.06 represents the date and time 7/7/1988 1:26:24 a.m.

And, to parse a raw microsoft datetime in R, which I knew was in GMT, but wanted to see in the 'America/New_York` timezone, I did the following. It gets a little ahead of the timezone discussion below, but worth having the example somewhere...

  ## Microsoft datetime
dt_ms <- 41140
dt_ms %>%
  as.Date(origin = "1899-12-30") %>%
  {.POSIXct(unclass(.)*86400, tz="GMT")} %>%
  lubridate::force_tz("America/New_York")

## [1] "2012-08-19 EDT"

See:

You might see why standards are desired.

Timezones

OK, the main event... dealing with timestamps. And, hand-in-glove, daylight savings time.

Timezones are complicated beasts, and easy to get wrong. In order to see what local time an event happens for someone in timezone A when it happens at timezone B local time, there has to be adding or subtracting of time that happens. People have tried to simplify this complexity with different schemes so that you don't have to explicitly do the adding or subtracting, which is error prone. I'll try to lay out how that can generally work here.

Quick quiz

  • Is Central Standard Time a timezone?
  • What timezone is Arizona in?
  • Are Southern hemisphere regions that roughly line up in longitude with Northern hemisphere regions often of the same timezone?

Answers:

  • No.
  • By Olson database identifier, 'America/Phoenix'. What does that mean though?
  • No.

Basics, and brief overview

The earth is a sphere that spins, and the sun shines on it. This makes for complications.

People like to work in local time. Noon is generally thought of as when the sun is directly overhead. Being a spinning sphere, at any given moment in time, it is noon someplace on earth, but other parts are not noon. On the opposite side of the globe from noon is actually midnight. So when you tell someone that something happened at noon, you have to specify where as well so a person can translate into a meaningful time for themselves.

So there is a tension. We want to work with local times, but how local is good enough? When it is noon for you, 100 miles away it may be noon a few minutes later. Should they operate on a different local time than you, offset by those few minutes? The convention is no, and we break the earth roughly into 24 zones in which we consider everybody in the same zone to have the same time, regardless of whether or not the sun is directly overhead at that time's noon or not. It will be close enough to overhead to consider it noon.

These 24 zones are close, but not quite, timezones. The thing that makes a region a timezone is this grouping plus how it deals with daylight savings time. For example, Arizona does not observe DST, so for part of the year, you could consider it to be in the Mountain Time Zone, and part of the time, when everybody in the Mountain Time Zone observes DST, Arizona shifts into the Pacific Time Zone. You can see how this complicates things.

So, with this desire to have local times, part of the solution is to use a single reference time that everyone can reference and say what their local time is relative to that standard time. That is what GMT (Greenwich Mean Time), a.k.a. UTC (Universal Time Coordinate) is for. The way i think of this is that UTC is where the sun is directly overhead at noon, every day. I need to specify 'every day' because in Daylight Savings Time, we shift the clocks ahead, but this means that when you are in DST, the sun is directly overhead at 1pm. So, UTC never experiences DST, and so, every day, the sun is overhead at noon. All of the other times can be thought of as referencing this time and then saying how much time you need to add or subtract from it to get the equivalent local time you care about for your area. For example, in Minneapolis, MN, if it is July 4, 2016, 3am, Then I know that I am 5 hours behind UTC time. So, to specify my date time in UTC, I say my time is:

2016-07-04T08:00:00-05:00

Meaning, it is 8am on July 4, 2016 in UTC time, and at the end, the -05:00 means that you have to subtract 5 hours to get my local time.

We are close to a timezone now...

The part that makes some area a timezone is that you incorporate if, and then how, you observe DST.

What is a timezone, and what isn't

From: http://stackoverflow.com/tags/timezone/info :

A time zone is a region on Earth that has a uniform, legally mandated standard time. It is often represented by an identifier, such as "America/Los_Angeles". Do not mistake an offset such as -07:00 as a time zone. A time zone is much more than that. Please read the entire tag wiki for details.

A timezone is effectively what your offsets are from UTC during non-DST time during DST, and when you start and stop using DST.

This is the first distinction that bit me. I live in the Midwest of the US, and I thought CST (Central Standard Time) was a timezone. It isn't. CST is a constant offset of 6 hours from UTC (CST is 6 hours earlier than the UTC time). But when we are in Daylight Savings Time, we are offset from UTC by only 5 hours. The thing that makes two points in the same timezone then, are:

  • Daylight Savings Time starts and stops at exactly the same instance in time for the two points.
  • The offests from UTC are identical in and out of DST.

So, for me, if two points observe CST and CDT at exactly the same ranges of time, they are in the same timezone.

So, specifying the timezone has a lot of parts. POSIX has a standard format for specifying these offests and times of enforcing them that you can read about here: http://www.ibm.com/developerworks/aix/library/au-aix-posix/ .

If you read the above and understand the convention, then cool. But since this is a bit involved, a guy named Arthur David Olson came up with a way to index all of these different format with common names to make this a bit easier. For example, to specify the timezone of Minneapolis, MN in POSIX from the above link, it would be:

CST6CDT,M3.2.0/2:00:00,M11.1.0/2:00:00

You can read the link above to decipher this. However, a lot of areas use this exact same spec, so Olson grouped everything that uses this spec with the tag name:

America/Chicago

So there is a database that he maintained where you could look up America/Chicago and you'd get back the POSIX spec that is usable by computer operating systems.

Again, that POSIX spec, and hence America/Chicago, are considered timezones.

A Practical Example: logs with local timestamps spanning DST changes

So, here is a problem I have had, and have had to overcome.

What happens when timestamps in logfiles are recorded in local time, and they span DST changes?

So, in the spring forward DST change, logs wil just have a gap of an hour of log entries when, at 2am, you turn your clock ahead to 3am, and will have no timestamps in the 2am hour. Not a hard problem.

However, in the fall back time, the 1am time will duplicate the 1am-2am hour timestamps when you turn back the clock from 2am to 1am. So you will have times increasing from 1am to 2am, and then you will start to see timestamps starting again at 1am.

How can you deal with this?

Let's simulate the problem. But first, let's make some observations... The data we want to look at looks roughly like this:

2015-03-08 00:15:00

In our case, we know that that timestamp originated in the America/New_York timezone (Olson name). That means that during standard time, it is 5 hours behing UTC, and during DST, it is 4 hours behind DST. We also know that this is the date on which this timezone springs forward at 1am.

First, let's verify that if we put this time into an R datetime object, it parses correctly and jibes with what we think its UTC offset is.

  ## Force the local timestamp to have the known timezone
dt <- ymd_hms("2015-03-08 00:15:00", tz="America/New_York")
print(dt)

## [1] "2015-03-08 00:15:00 EST"

  ## What is this timestamp in UTC time?
dt %>% with_tz("UTC") %>% print()

## [1] "2015-03-08 05:15:00 UTC"

  ## check if this datetime stamp is in DST yet.
dst(dt)

## [1] FALSE

So far, so good. This timestamp is 15 minutes past midnight, so we are still on EST, not yet into DST (which is signfied with 'EDT' rather than 'EST'). And when we look at what its time is in UTC, it is 5 hours ahead, like we expect. For the record, the with_tz() function is a handy lubridate function that lets you take a timestamp with a known timezone and returns what timestamp that is in the supplied timezone (here, "UTC"). Very nice.

Let's check the fall back end of the spectrum. The following UTC times correspond to the first time we hit 1am on the fall back time, and the second time we hit it, when we turn back the clocks:

ymd_hms("2015-11-01 05:00:00", tz="UTC") %>% with_tz("America/New_York")

## [1] "2015-11-01 01:00:00 EDT"

ymd_hms("2015-11-01 06:00:00", tz="UTC") %>% with_tz("America/New_York")

## [1] "2015-11-01 01:00:00 EST"

Note that these both give 1am, but the first is in EDT, and the second in EST. So this is great.

However, our problem is a little more subtle. We have a America/New_York kind of timestamp. When we get the timestamp: "2015-11-01 01:00:00", how do we know if this is the first 1am we encounter, or the second one, when we turn the clock back?

In fact, if we try this with R, it makes an assumption:

dt <- ymd_hms("2015-11-01 01:00:00", tz="America/New_York")
print(dt)

## [1] "2015-11-01 01:00:00 EST"

  ## Are we in DST?
dst(dt)

## [1] FALSE

It assumes it is the second pass at 1am. Well, that's a problem, because our data has timestamp info like the following:

  ## Datetimes, 15min increments, that span DST breaks
dt_str <- c(
    # Spring forward
   "2015-03-08 00:00:00"
  , "2015-03-08 00:15:00"
  , "2015-03-08 00:30:00"
  , "2015-03-08 00:45:00"
  , "2015-03-08 01:00:00"
  , "2015-03-08 01:15:00"
  , "2015-03-08 01:30:00"
  , "2015-03-08 01:45:00"
  , "2015-03-08 03:00:00"
  , "2015-03-08 03:15:00"
  , "2015-03-08 03:30:00"
  , "2015-03-08 03:45:00"
  , "2015-03-08 04:00:00"
    # Fall back
  , "2015-11-01 00:00:00"
  , "2015-11-01 00:15:00"
  , "2015-11-01 00:30:00"
  , "2015-11-01 00:45:00"
  , "2015-11-01 01:00:00"
  , "2015-11-01 01:15:00"
  , "2015-11-01 01:30:00"
  , "2015-11-01 01:45:00"
  , "2015-11-01 01:00:00"
  , "2015-11-01 01:15:00"
  , "2015-11-01 01:30:00"
  , "2015-11-01 01:45:00"
  , "2015-11-01 02:00:00"
  , "2015-11-01 02:15:00"
)

dat <- data.frame(dt_str, stringsAsFactors=FALSE) %>%
  ## http://stackoverflow.com/questions/15230446/cumulative-sequence-of-occurrences-of-values 
  ## Add a sequence number, make sure we keep track of row order
  mutate(rownum = sequence(n())) %>%
  select(rownum, dt_str)

dat %>% print.data.frame()

##    rownum              dt_str
## 1       1 2015-03-08 00:00:00
## 2       2 2015-03-08 00:15:00
## 3       3 2015-03-08 00:30:00
## 4       4 2015-03-08 00:45:00
## 5       5 2015-03-08 01:00:00
## 6       6 2015-03-08 01:15:00
## 7       7 2015-03-08 01:30:00
## 8       8 2015-03-08 01:45:00
## 9       9 2015-03-08 03:00:00
## 10     10 2015-03-08 03:15:00
## 11     11 2015-03-08 03:30:00
## 12     12 2015-03-08 03:45:00
## 13     13 2015-03-08 04:00:00
## 14     14 2015-11-01 00:00:00
## 15     15 2015-11-01 00:15:00
## 16     16 2015-11-01 00:30:00
## 17     17 2015-11-01 00:45:00
## 18     18 2015-11-01 01:00:00
## 19     19 2015-11-01 01:15:00
## 20     20 2015-11-01 01:30:00
## 21     21 2015-11-01 01:45:00
## 22     22 2015-11-01 01:00:00
## 23     23 2015-11-01 01:15:00
## 24     24 2015-11-01 01:30:00
## 25     25 2015-11-01 01:45:00
## 26     26 2015-11-01 02:00:00
## 27     27 2015-11-01 02:15:00

Again, you repeat timestamps in the 1am hour in November, and the first ones are for DST, and the second set are not.

How can you deal with this?

Option 1: Filter out potential problem records

If you have no reliable way of determining which pass of 1am-2am a record comes from, and it is important that you do know which pass you are dealing with, best to just delete those observations.

To do this, however, you need to identify the day on which the fall back happens.

#date_turn_back_clocks <- function(tzn=Sys.timezone(), yr=lubridate::year(as.Date(now()))) {
date_turn_back_clocks <- function(tzn, yr) {

    ## From the given year and timezone, construct a vector of all of the days in
    ## the given year and timezone, but as timestamps of noon of those days.
  diy_vec <- seq(ymd_hms(sprintf("%s-01-01T12:00:00", yr), tz=tzn)
    , ymd_hms(sprintf("%s-12-31T12:00:00", yr), tz=tzn)
    , by="day")

    ## Vector along days just made, indicating if noon of date is in dst or not.
  isdst <- dst(diy_vec)

    ## Short-circuit out if no dst, return an NA value in that case.
  if(!any(isdst)) return(NA_integer_)

    ## rle gives information about runs.  Here we find the runs and calculate the
    ## start and end index of those runs found by rle.
    ## See: http://masterr.org/r/how-to-find-consecutive-repeats-in-r/
    ## for some help.  I made the calc for runs_start_index.
  runs <- rle(isdst)
  runs_end_index <- cumsum(runs$length)
  runs_start_index <- runs_end_index %>% lag() %>% ifelse(is.na(.), 0, .) %>% {. + 1}

    ## Index of first day after DST ends
    ## logic: since for a given year, DST is in the opposite part of the year
    ## in the southern hemisphere, we have to be careful how we ID the first
    ## day after DST.  In the northern hemisphere, DST will be a consecutive
    ## block of days in the middle of the year (isdst will have an unbroken run
    ## of TRUE) values.  In the Southern Hemisphere, DST will span the new year,
    ## so jan 1 will be true for dst, then the middle of the year will have a run
    ## of isdst == FALSE, and then a run of isdst == TRUE at end of year.
    ## To ID this, find the runs of FALSES in the year (1 run in S. Hemi, 2 in
    ## the N. Hemi.), and then find the first index of the last run of FALSEs
    ## in the year.
    ## Ex: general pattern of days that are in dst (T) or not in dst (F)
    ##  N. Hemi:
    ##  Jan ... Mar ... Nov ... Dec
    ##  F F ... F T ... T F ... F F
    ##
    ##  S. Hemi:
    ##  Jan ... Mar ... Nov ... Dec
    ##  T T ... T F ... F T ... T T
    ##
    ## The first index of the last false run is the day on which the clocks rolled back
    ## also because we are checking the dst status of noon of that day, and the first
    ## day of the last run which is false had the clocks roll back earlier that morning.
  first_day_after_dst_index <- runs_start_index[runs$values == FALSE] %>% rev() %>% .[1]

    ## Return that date
  diy_vec[first_day_after_dst_index] %>% as.Date()
}

  ## Memoise the function, so repeated calls to the same year and timezone are
  ## just a hash lookup and not the whole computation given in the function definition.
date_turn_back_clocks <- memoise::memoise(date_turn_back_clocks)

  ## And Vectorize it
date_turn_back_clocks <- Vectorize(date_turn_back_clocks)
#date_turn_back_clocks <- function(tzn, yr) date_turn_back_clocks(tzn, yr) %>% as.Date(origin="1970-01-01")

  ## Test some defaults
#date_turn_back_clocks()
#date_turn_back_clocks(yr=2011)

  # America/Chicago
date_turn_back_clocks(tzn="America/Chicago", yr=2011) %>% as.Date(origin="1970-01-01")

## America/Chicago 
##    "2011-11-06"

  # No DST
date_turn_back_clocks(tzn="Africa/Johannesburg", yr=2011) %>% as.Date(origin="1970-01-01")

## Africa/Johannesburg 
##                  NA

  # S. Hemi, spring is the end of DST
date_turn_back_clocks(tzn="Pacific/Easter", yr=2011) %>% as.Date(origin="1970-01-01")

## Pacific/Easter 
##   "2011-05-08"

So now we have a function to ID the day of the year in a timezone when DST ends. Let's use that to just filter out records (1am on DST fall back day) we don't want:

dat %>%
  mutate(dt_tz = ymd_hms(dt_str, tz="America/New_York")) %>%
  mutate(end_dst_date = date_turn_back_clocks(tz(dt_tz), year(dt_tz)) %>% as.Date(origin="1970-01-01")) %>%
  filter(!((as.Date(dt_tz) == end_dst_date) & (hour(dt_tz) == 1))) %>%
  ungroup()

##    rownum              dt_str               dt_tz end_dst_date
## 1       1 2015-03-08 00:00:00 2015-03-08 00:00:00   2015-11-01
## 2       2 2015-03-08 00:15:00 2015-03-08 00:15:00   2015-11-01
## 3       3 2015-03-08 00:30:00 2015-03-08 00:30:00   2015-11-01
## 4       4 2015-03-08 00:45:00 2015-03-08 00:45:00   2015-11-01
## 5       5 2015-03-08 01:00:00 2015-03-08 01:00:00   2015-11-01
## 6       6 2015-03-08 01:15:00 2015-03-08 01:15:00   2015-11-01
## 7       7 2015-03-08 01:30:00 2015-03-08 01:30:00   2015-11-01
## 8       8 2015-03-08 01:45:00 2015-03-08 01:45:00   2015-11-01
## 9       9 2015-03-08 03:00:00 2015-03-08 03:00:00   2015-11-01
## 10     10 2015-03-08 03:15:00 2015-03-08 03:15:00   2015-11-01
## 11     11 2015-03-08 03:30:00 2015-03-08 03:30:00   2015-11-01
## 12     12 2015-03-08 03:45:00 2015-03-08 03:45:00   2015-11-01
## 13     13 2015-03-08 04:00:00 2015-03-08 04:00:00   2015-11-01
## 14     14 2015-11-01 00:00:00 2015-11-01 00:00:00   2015-11-01
## 15     15 2015-11-01 00:15:00 2015-11-01 00:15:00   2015-11-01
## 16     16 2015-11-01 00:30:00 2015-11-01 00:30:00   2015-11-01
## 17     17 2015-11-01 00:45:00 2015-11-01 00:45:00   2015-11-01
## 18     26 2015-11-01 02:00:00 2015-11-01 02:00:00   2015-11-01
## 19     27 2015-11-01 02:15:00 2015-11-01 02:15:00   2015-11-01

Option 2: Cast the first and second run of the fall back overlap hour to right UTC time

In order to do this, you have to be able to take a row and distinguish if it is the first pass over the 1am hour or the second pass. If these are local timestamps, you won't be able to use the raw timestamps to determine this, as they will look identical between the first and second passes. You will have to ascertain the pass from the order that they occur in the logifile. So the following assumptions are made in this solution:

  • Log records are read in in time sequence order -- that is, records occuring earlier in the file aslo occur earlier in time, regardless of the timestamp.
  • There will exist records in both the the first pass of the overlap hour and second pass.
  • There will be at least one record in the second pass that has an earlier timestamp than some timestamp in the first pass.

These are all assumptions the code needs to make to determine if a given log record timestamp occurrs during the first pass at the overlap hour, or the second.

dat %>%
  mutate(dt_tz = ymd_hms(dt_str, tz="America/New_York")) %>%
  mutate(end_dst_date = date_turn_back_clocks(tz(dt_tz), year(dt_tz)) %>% as.Date(origin="1970-01-01")) %>%
  mutate(fallback_hour = ifelse((as.Date(dt_tz) == end_dst_date) & (hour(dt_tz) == 1), TRUE, FALSE )) %>%
  group_by(end_dst_date, fallback_hour) %>%
  do((function(df) {
    if (!any(df$fallback_hour)) {
        ## Normal, just cast to UTC
      df$dt_utc <- with_tz(df$dt_tz, "UTC")
    } else {
        ## Have to determine if timestamp is 1st or 2nd pass through
      dt_tz_lag <- dplyr::lag(df$dt_tz) #%>% {.[is.na(.)] <- 0}
      second_pass <- purrr::accumulate(df$dt_tz - dt_tz_lag, min, na.rm=TRUE, 0) %>%
        {ifelse((. < 0) & !is.na(.), TRUE, FALSE)}

        # If second pass, with_tz works correctly by it's assumptions, otherwise, in first pass
        # need to subtract an hour.
      df$dt_utc <- ifelse(second_pass
                          , with_tz(df$dt_tz, "UTC")
                          , with_tz(df$dt_tz, "UTC") - hours(1)) %>%
                      as.POSIXct(origin="1970-01-01")
    }

    df
  })(.)) %>%
  ungroup() %>%
  arrange(rownum) %>%
  print.data.frame()

##    rownum              dt_str               dt_tz end_dst_date fallback_hour              dt_utc
## 1       1 2015-03-08 00:00:00 2015-03-08 00:00:00   2015-11-01         FALSE 2015-03-08 05:00:00
## 2       2 2015-03-08 00:15:00 2015-03-08 00:15:00   2015-11-01         FALSE 2015-03-08 05:15:00
## 3       3 2015-03-08 00:30:00 2015-03-08 00:30:00   2015-11-01         FALSE 2015-03-08 05:30:00
## 4       4 2015-03-08 00:45:00 2015-03-08 00:45:00   2015-11-01         FALSE 2015-03-08 05:45:00
## 5       5 2015-03-08 01:00:00 2015-03-08 01:00:00   2015-11-01         FALSE 2015-03-08 06:00:00
## 6       6 2015-03-08 01:15:00 2015-03-08 01:15:00   2015-11-01         FALSE 2015-03-08 06:15:00
## 7       7 2015-03-08 01:30:00 2015-03-08 01:30:00   2015-11-01         FALSE 2015-03-08 06:30:00
## 8       8 2015-03-08 01:45:00 2015-03-08 01:45:00   2015-11-01         FALSE 2015-03-08 06:45:00
## 9       9 2015-03-08 03:00:00 2015-03-08 03:00:00   2015-11-01         FALSE 2015-03-08 07:00:00
## 10     10 2015-03-08 03:15:00 2015-03-08 03:15:00   2015-11-01         FALSE 2015-03-08 07:15:00
## 11     11 2015-03-08 03:30:00 2015-03-08 03:30:00   2015-11-01         FALSE 2015-03-08 07:30:00
## 12     12 2015-03-08 03:45:00 2015-03-08 03:45:00   2015-11-01         FALSE 2015-03-08 07:45:00
## 13     13 2015-03-08 04:00:00 2015-03-08 04:00:00   2015-11-01         FALSE 2015-03-08 08:00:00
## 14     14 2015-11-01 00:00:00 2015-11-01 00:00:00   2015-11-01         FALSE 2015-11-01 04:00:00
## 15     15 2015-11-01 00:15:00 2015-11-01 00:15:00   2015-11-01         FALSE 2015-11-01 04:15:00
## 16     16 2015-11-01 00:30:00 2015-11-01 00:30:00   2015-11-01         FALSE 2015-11-01 04:30:00
## 17     17 2015-11-01 00:45:00 2015-11-01 00:45:00   2015-11-01         FALSE 2015-11-01 04:45:00
## 18     18 2015-11-01 01:00:00 2015-11-01 01:00:00   2015-11-01          TRUE 2015-11-01 05:00:00
## 19     19 2015-11-01 01:15:00 2015-11-01 01:15:00   2015-11-01          TRUE 2015-11-01 05:15:00
## 20     20 2015-11-01 01:30:00 2015-11-01 01:30:00   2015-11-01          TRUE 2015-11-01 05:30:00
## 21     21 2015-11-01 01:45:00 2015-11-01 01:45:00   2015-11-01          TRUE 2015-11-01 05:45:00
## 22     22 2015-11-01 01:00:00 2015-11-01 01:00:00   2015-11-01          TRUE 2015-11-01 06:00:00
## 23     23 2015-11-01 01:15:00 2015-11-01 01:15:00   2015-11-01          TRUE 2015-11-01 06:15:00
## 24     24 2015-11-01 01:30:00 2015-11-01 01:30:00   2015-11-01          TRUE 2015-11-01 06:30:00
## 25     25 2015-11-01 01:45:00 2015-11-01 01:45:00   2015-11-01          TRUE 2015-11-01 06:45:00
## 26     26 2015-11-01 02:00:00 2015-11-01 02:00:00   2015-11-01         FALSE 2015-11-01 07:00:00
## 27     27 2015-11-01 02:15:00 2015-11-01 02:15:00   2015-11-01         FALSE 2015-11-01 07:15:00

Note that in November, on the day when the clocks turn back, the first pass through 1am maps to the right incremented UTC time, and the second pass through 1am maps to the next hour UTC, and will blend correctly with what 2am standard time becomes.

Links

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment