mpettis/timezones-reading-group-talk.md

## timezones-reading-group-talk.md

      
    Raw
  

              timezones-reading-group-talk.md
            
          
Introduction
Setup
Datetime formats

Formatting dates
Parsing datetimes
Internal structures, briefly


Timezones

Quick quiz
Basics, and brief overview
What is a timezone, and what
isn't


A Practical Example: logs with local timestamps spanning DST
changes

Option 1: Filter out potential problem
records
Option 2: Cast the first and second run of the fall back
overlap hour to right UTC
time


Links

Introduction


Like all those possessing a library, Aurelian was aware that he was
guilty of not knowing his in its entirety.
-- Jorge Luis Borges

Setup

knitr::opts_chunk$set(echo = TRUE, fig.width=12, fig.height=9, warnings=FALSE)
options(width=116)

suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(lubridate))

Datetime formats

Formatting dates

Dates and datetimes can come to you in a variety of formats and schemes.
Below is a brief, incomplete list of examples. We'll address the funky
syntax in a later section with the assumption you are not familiar with
it.
Let's look at the date: July 4, 2016, 9pm in different formats:
  ## First, make a single datetime object.
  ## Again, we'll address the funky second argument later -- for now, it is a convenience I am using
dt4th <- strptime("7/4/2016 09:00:00 PM", "%m/%d/%Y %r")

  ## Write it back out in originally inputted format
strftime(dt4th, "%m/%d/%Y %r")

## [1] "07/04/2016 09:00:00 PM"

  ## US date format, military time
strftime(dt4th, "%m/%d/%Y %T")

## [1] "07/04/2016 21:00:00"

  ## European date convention
strftime(dt4th, "%d/%m/%Y %T")

## [1] "04/07/2016 21:00:00"

  ## ISO localtime
strftime(dt4th, "%FT%T")

## [1] "2016-07-04T21:00:00"

  ## With words, with day of week
strftime(dt4th, "%A, %B %d, %Y")

## [1] "Monday, July 04, 2016"

  ## With words, European style, 0-padded
strftime(dt4th, "%d %B, %Y, %I %p")

## [1] "04 July, 2016, 09 PM"

  ## An epoch time, which is the number of seconds since midnight, Jan 1, 1970
strftime(dt4th, "%s")

## [1] "1467684000"

  ## Default system rendering of date to console
dt4th

## [1] "2016-07-04 21:00:00 CDT"

There are many ways to express dates, and they can be found in text, in
logfiles, and pretty much any data you have to deal with. However, a lot
of the variety comes just from rearranging the parts of the date and
time, using different separators, and using either numbers or words for
months, and the like. And so, a mini-language was used to describe those
different parts of the date, which are used when either outputting a
formatted datetime (like above) or when trying to read in a date of a
certain format, which we will get to. That mini-language is expressed
with the %-letter parts of the above, and the separtors are there
as-is.
For example, here is a short, incomplete list of what those specifiers
above mean:


Token
Meaning


%m
Month as decimal number (01-12).


%d
Day of the month as decimal number (01-31).


%Y
Year, as decimal number (4 digits, like '2016')


%I
Hours as decimal number (01-12).


%p
AM or PM string.


That mini-language comes from old C strptime/strftime functions to
simplify this problem. You can see a reference to it here:
http://man7.org/linux/man-pages/man3/strptime.3.html
That page has a guide to how to interpret the special % formatting
tokens. Also, this can be found in the help pages of your language of
choice. For sure, both R and Python implement this spec for reading
and formatting dates and datetimes with their implementation of
strptime/strftime.
Parsing datetimes

Those formatting codes work both ways. Used as above, you can control
the format the representation of your datetime. But you can also use
them to identify the format of a string that represents a time, and let
your language figure out how to translate that into it's own internal
representation of that datetime. You saw a hint of that at the beginning
when I first created the dt4th object with strptime().
This is highly convenient. In the absence of this, or not knowing this,
you may resort to using regular expressions to pull apart the portions
of the datetime string to get the day, month, year, hour, minute second,
and put together a bunch of complex logic to construct your date. Don't
do this! This is reinventing a wheel that people a few generations back
already invented and hardened. You will likely invent some error-prone
ersatz code.
Let's try the reverse of above: parse various strings into datetimes,
and show that they give you the same things:
  ## A common format
  ##
  ## In R, typing the name at the console will return a formatted value
  ## of the thing the name points to, and if it is a certain class, there
  ## are default formats that are used.  We will use this property to show
  ## that we can parse from different string structures and get the same
  ## underlying object.
dt_string <- "7/4/2016 09:00:00 PM"
dt_obj <- strptime(dt_string, "%m/%d/%Y %r")
dt_obj

## [1] "2016-07-04 21:00:00 CDT"

  ## European, military time
dt_string <- "04/07/2016 21:00:00"
dt_obj <- strptime(dt_string, "%d/%m/%Y %T")
dt_obj

## [1] "2016-07-04 21:00:00 CDT"

  ## With words
dt_string <- "July 04, 2016 21:00:00"
dt_obj <- strptime(dt_string, "%B %d, %Y %T")
dt_obj

## [1] "2016-07-04 21:00:00 CDT"

  ## A Unix epoch time
dt_string <- "1467684000"
dt_obj <- strptime(dt_string, "%s")
dt_obj

## [1] "2016-07-04 21:00:00 CDT"

Note that they all return a date in the same format. Under the covers,
there is a datetime object that your programming language is
representing in some internal data structure, but when it shows you the
date, it formats in in some default format that usually is easy for
humans to read.
Again, use this power to parse dates from ingested data when you can.
Operations such as:

Finding minutes, hours, or days between dates.
Snapping datetime stamps to top of the hour, midnight of the day, or
first of the month.
Checking if a datetime is between two other datetimes
Making a sequence of dates or datetimes at constant intervals
(minute, hour, day, etc.)

are usually easier, and less error prone, with datetime objects than
with strings.
For R, there is a great package (among many) called lubridate that
makes a lot of this easy. A vignette on that can be found here:
https://cran.r-project.org/web/packages/lubridate/vignettes/lubridate.html
Here, you can find even more convenience functions that hide some of the
complexity of this mini-language, and bundle up some of the
mini-language concepts for easy parsing. Using those tools is it's own
presentation topic. However, I'll show one convenience function to give
the flavor:
lubridate::ymd_hms("2016-07-04T21:00:00")

## [1] "2016-07-04 21:00:00 UTC"

lubridate::ymd_hms("2016/07/04 09:00:00 PM")

## [1] "2016-07-04 21:00:00 UTC"

Note how you didn't need to specify the full mini-language % format
string, with components and separators and whether or not it is AM/PM or
military. This particular convenience function expects a date in roughly
'year-month-day-hour-minute-second' format and works out what the
separators and time specifics are. Very handy.
Internal structures, briefly

It is worth a quick aside to see what is meant by "internal datetime
representation" as apart from the formatted string you see. Mainly, I
will show that in Linux and related OS'es, there are two common but
different representations.
Unix and flavors have a standard called POSIX which help keep the
flavors fairly standardized in many aspects so that it isn't too hard to
make those systems easily compatible with each other. One thing POSIX
proscribes is datetime internal representations, and there are actually
two standards.
The first addressed is the less used one, called POSIXlt. It
internally stores the datetime as a list with slots for the different
parts of the datetime. Let's take a look:
dt4th <- strptime("7/4/2016 09:00:00 PM", "%m/%d/%Y %r")

dt4th %>% as.POSIXlt() %>% unclass()

## $sec
## [1] 0
## 
## $min
## [1] 0
## 
## $hour
## [1] 21
## 
## $mday
## [1] 4
## 
## $mon
## [1] 6
## 
## $year
## [1] 116
## 
## $wday
## [1] 1
## 
## $yday
## [1] 185
## 
## $isdst
## [1] 1
## 
## $zone
## [1] "CDT"
## 
## $gmtoff
## [1] NA

You can see that there is a slot for different types of information,
like the day, month, year, timezone, isdst ("is it DST for this
datetime?"), etc. That is the underlying structure of information that
must be manipulated to give you your final formatted datetime string.
A second, more compact structure is POSIXct:
dt4th <- strptime("7/4/2016 09:00:00 PM", "%m/%d/%Y %r")

dt4th %>% as.POSIXct() %>% unclass()

## [1] 1467684000
## attr(,"tzone")
## [1] ""

Note that this is a single number. It represents the number of seconds
this datetime is from midnight of January 1, 1970. A single number from
which functions must now convert into your nice datetime formatted
string.
There are other formats, such as Microsoft's, which is commonin Excel,
and is the number of days since Jan 1, 1900, with the decimal parts
corresponding to fractions of days, which can be converted to hours,
minutes, seconds, etc. You can find a bit more on that here:
https://support.microsoft.com/en-us/kb/214094
The rub:

In this system, the serial number 1 represents 1/1/1900 12:00:00 a.m.
Times are stored as decimal numbers between .0 and .99999, where .0 is
00:00:00 and .99999 is 23:59:59. The date integers and time decimal
fractions can be combined to create numbers that have a decimal and an
integer portion. For example, the number 32331.06 represents the date
and time 7/7/1988 1:26:24 a.m.

And, to parse a raw microsoft datetime in R, which I knew was in GMT,
but wanted to see in the 'America/New_York` timezone, I did the
following. It gets a little ahead of the timezone discussion below, but
worth having the example somewhere...
  ## Microsoft datetime
dt_ms <- 41140
dt_ms %>%
  as.Date(origin = "1899-12-30") %>%
  {.POSIXct(unclass(.)*86400, tz="GMT")} %>%
  lubridate::force_tz("America/New_York")

## [1] "2012-08-19 EDT"

See:

https://www.r-bloggers.com/date-formats-in-r/
http://stackoverflow.com/questions/12161984/how-to-elegantly-convert-datetime-from-decimal-to-d-m-y-hms

You might see why standards are desired.
Timezones

OK, the main event... dealing with timestamps. And, hand-in-glove,
daylight savings time.
Timezones are complicated beasts, and easy to get wrong. In order to see
what local time an event happens for someone in timezone A when it
happens at timezone B local time, there has to be adding or subtracting
of time that happens. People have tried to simplify this complexity with
different schemes so that you don't have to explicitly do the adding or
subtracting, which is error prone. I'll try to lay out how that can
generally work here.
Quick quiz


Is Central Standard Time a timezone?
What timezone is Arizona in?
Are Southern hemisphere regions that roughly line up in longitude
with Northern hemisphere regions often of the same timezone?

Answers:

No.
By Olson database identifier, 'America/Phoenix'. What does that mean
though?
No.

Basics, and brief overview

The earth is a sphere that spins, and the sun shines on it. This makes
for complications.
People like to work in local time. Noon is generally thought of as
when the sun is directly overhead. Being a spinning sphere, at any given
moment in time, it is noon someplace on earth, but other parts are not
noon. On the opposite side of the globe from noon is actually midnight.
So when you tell someone that something happened at noon, you have to
specify where as well so a person can translate into a meaningful time
for themselves.
So there is a tension. We want to work with local times, but how local
is good enough? When it is noon for you, 100 miles away it may be noon a
few minutes later. Should they operate on a different local time than
you, offset by those few minutes? The convention is no, and we break
the earth roughly into 24 zones in which we consider everybody in the
same zone to have the same time, regardless of whether or not the sun is
directly overhead at that time's noon or not. It will be close enough to
overhead to consider it noon.
These 24 zones are close, but not quite, timezones. The thing that
makes a region a timezone is this grouping plus how it deals with
daylight savings time. For example, Arizona does not observe DST, so for
part of the year, you could consider it to be in the Mountain Time Zone,
and part of the time, when everybody in the Mountain Time Zone observes
DST, Arizona shifts into the Pacific Time Zone. You can see how this
complicates things.
So, with this desire to have local times, part of the solution is to use
a single reference time that everyone can reference and say what their
local time is relative to that standard time. That is what GMT
(Greenwich Mean Time), a.k.a. UTC (Universal Time Coordinate) is for.
The way i think of this is that UTC is where the sun is directly
overhead at noon, every day. I need to specify 'every day' because in
Daylight Savings Time, we shift the clocks ahead, but this means that
when you are in DST, the sun is directly overhead at 1pm. So, UTC never
experiences DST, and so, every day, the sun is overhead at noon. All of
the other times can be thought of as referencing this time and then
saying how much time you need to add or subtract from it to get the
equivalent local time you care about for your area. For example, in
Minneapolis, MN, if it is July 4, 2016, 3am, Then I know that I am 5
hours behind UTC time. So, to specify my date time in UTC, I say my time
is:
2016-07-04T08:00:00-05:00

Meaning, it is 8am on July 4, 2016 in UTC time, and at the end, the
-05:00 means that you have to subtract 5 hours to get my local time.
We are close to a timezone now...
The part that makes some area a timezone is that you incorporate if, and
then how, you observe DST.
What is a timezone, and what isn't

From: http://stackoverflow.com/tags/timezone/info :

A time zone is a region on Earth that has a uniform, legally mandated
standard time. It is often represented by an identifier, such as
"America/Los_Angeles". Do not mistake an offset such as -07:00 as a
time zone. A time zone is much more than that. Please read the entire
tag wiki for details.

A timezone is effectively what your offsets are from UTC during non-DST
time during DST, and when you start and stop using DST.
This is the first distinction that bit me. I live in the Midwest of the
US, and I thought CST (Central Standard Time) was a timezone. It isn't.
CST is a constant offset of 6 hours from UTC (CST is 6 hours earlier
than the UTC time). But when we are in Daylight Savings Time, we are
offset from UTC by only 5 hours. The thing that makes two points in the
same timezone then, are:

Daylight Savings Time starts and stops at exactly the same
instance in time for the two points.
The offests from UTC are identical in and out of DST.

So, for me, if two points observe CST and CDT at exactly the same ranges
of time, they are in the same timezone.
So, specifying the timezone has a lot of parts. POSIX has a standard
format for specifying these offests and times of enforcing them that you
can read about here:
http://www.ibm.com/developerworks/aix/library/au-aix-posix/ .
If you read the above and understand the convention, then cool. But
since this is a bit involved, a guy named Arthur David Olson came up
with a way to index all of these different format with common names to
make this a bit easier. For example, to specify the timezone of
Minneapolis, MN in POSIX from the above link, it would be:
CST6CDT,M3.2.0/2:00:00,M11.1.0/2:00:00

You can read the link above to decipher this. However, a lot of areas
use this exact same spec, so Olson grouped everything that uses this
spec with the tag name:
America/Chicago

So there is a database that he maintained where you could look up
America/Chicago and you'd get back the POSIX spec that is usable by
computer operating systems.
Again, that POSIX spec, and hence America/Chicago, are considered
timezones.
A Practical Example: logs with local timestamps spanning DST changes

So, here is a problem I have had, and have had to overcome.

What happens when timestamps in logfiles are recorded in local time,
and they span DST changes?

So, in the spring forward DST change, logs wil just have a gap of an
hour of log entries when, at 2am, you turn your clock ahead to 3am, and
will have no timestamps in the 2am hour. Not a hard problem.
However, in the fall back time, the 1am time will duplicate the
1am-2am hour timestamps when you turn back the clock from 2am to 1am. So
you will have times increasing from 1am to 2am, and then you will start
to see timestamps starting again at 1am.
How can you deal with this?
Let's simulate the problem. But first, let's make some observations...
The data we want to look at looks roughly like this:
2015-03-08 00:15:00

In our case, we know that that timestamp originated in the
America/New_York timezone (Olson name). That means that during
standard time, it is 5 hours behing UTC, and during DST, it is 4 hours
behind DST. We also know that this is the date on which this timezone
springs forward at 1am.
First, let's verify that if we put this time into an R datetime object,
it parses correctly and jibes with what we think its UTC offset is.
  ## Force the local timestamp to have the known timezone
dt <- ymd_hms("2015-03-08 00:15:00", tz="America/New_York")
print(dt)

## [1] "2015-03-08 00:15:00 EST"

  ## What is this timestamp in UTC time?
dt %>% with_tz("UTC") %>% print()

## [1] "2015-03-08 05:15:00 UTC"

  ## check if this datetime stamp is in DST yet.
dst(dt)

## [1] FALSE

So far, so good. This timestamp is 15 minutes past midnight, so we are
still on EST, not yet into DST (which is signfied with 'EDT' rather than
'EST'). And when we look at what its time is in UTC, it is 5 hours
ahead, like we expect. For the record, the with_tz() function is a
handy lubridate function that lets you take a timestamp with a known
timezone and returns what timestamp that is in the supplied timezone
(here, "UTC"). Very nice.
Let's check the fall back end of the spectrum. The following UTC times
correspond to the first time we hit 1am on the fall back time, and the
second time we hit it, when we turn back the clocks:
ymd_hms("2015-11-01 05:00:00", tz="UTC") %>% with_tz("America/New_York")

## [1] "2015-11-01 01:00:00 EDT"

ymd_hms("2015-11-01 06:00:00", tz="UTC") %>% with_tz("America/New_York")

## [1] "2015-11-01 01:00:00 EST"

Note that these both give 1am, but the first is in EDT, and the second
in EST. So this is great.
However, our problem is a little more subtle. We have a
America/New_York kind of timestamp. When we get the timestamp:
"2015-11-01 01:00:00", how do we know if this is the first 1am we
encounter, or the second one, when we turn the clock back?
In fact, if we try this with R, it makes an assumption:
dt <- ymd_hms("2015-11-01 01:00:00", tz="America/New_York")
print(dt)

## [1] "2015-11-01 01:00:00 EST"

  ## Are we in DST?
dst(dt)

## [1] FALSE

It assumes it is the second pass at 1am. Well, that's a problem, because
our data has timestamp info like the following:
  ## Datetimes, 15min increments, that span DST breaks
dt_str <- c(
    # Spring forward
   "2015-03-08 00:00:00"
  , "2015-03-08 00:15:00"
  , "2015-03-08 00:30:00"
  , "2015-03-08 00:45:00"
  , "2015-03-08 01:00:00"
  , "2015-03-08 01:15:00"
  , "2015-03-08 01:30:00"
  , "2015-03-08 01:45:00"
  , "2015-03-08 03:00:00"
  , "2015-03-08 03:15:00"
  , "2015-03-08 03:30:00"
  , "2015-03-08 03:45:00"
  , "2015-03-08 04:00:00"
    # Fall back
  , "2015-11-01 00:00:00"
  , "2015-11-01 00:15:00"
  , "2015-11-01 00:30:00"
  , "2015-11-01 00:45:00"
  , "2015-11-01 01:00:00"
  , "2015-11-01 01:15:00"
  , "2015-11-01 01:30:00"
  , "2015-11-01 01:45:00"
  , "2015-11-01 01:00:00"
  , "2015-11-01 01:15:00"
  , "2015-11-01 01:30:00"
  , "2015-11-01 01:45:00"
  , "2015-11-01 02:00:00"
  , "2015-11-01 02:15:00"
)

dat <- data.frame(dt_str, stringsAsFactors=FALSE) %>%
  ## http://stackoverflow.com/questions/15230446/cumulative-sequence-of-occurrences-of-values 
  ## Add a sequence number, make sure we keep track of row order
  mutate(rownum = sequence(n())) %>%
  select(rownum, dt_str)

dat %>% print.data.frame()

##    rownum              dt_str
## 1       1 2015-03-08 00:00:00
## 2       2 2015-03-08 00:15:00
## 3       3 2015-03-08 00:30:00
## 4       4 2015-03-08 00:45:00
## 5       5 2015-03-08 01:00:00
## 6       6 2015-03-08 01:15:00
## 7       7 2015-03-08 01:30:00
## 8       8 2015-03-08 01:45:00
## 9       9 2015-03-08 03:00:00
## 10     10 2015-03-08 03:15:00
## 11     11 2015-03-08 03:30:00
## 12     12 2015-03-08 03:45:00
## 13     13 2015-03-08 04:00:00
## 14     14 2015-11-01 00:00:00
## 15     15 2015-11-01 00:15:00
## 16     16 2015-11-01 00:30:00
## 17     17 2015-11-01 00:45:00
## 18     18 2015-11-01 01:00:00
## 19     19 2015-11-01 01:15:00
## 20     20 2015-11-01 01:30:00
## 21     21 2015-11-01 01:45:00
## 22     22 2015-11-01 01:00:00
## 23     23 2015-11-01 01:15:00
## 24     24 2015-11-01 01:30:00
## 25     25 2015-11-01 01:45:00
## 26     26 2015-11-01 02:00:00
## 27     27 2015-11-01 02:15:00

Again, you repeat timestamps in the 1am hour in November, and the first
ones are for DST, and the second set are not.
How can you deal with this?
Option 1: Filter out potential problem records

If you have no reliable way of determining which pass of 1am-2am a
record comes from, and it is important that you do know which pass you
are dealing with, best to just delete those observations.
To do this, however, you need to identify the day on which the fall
back happens.
#date_turn_back_clocks <- function(tzn=Sys.timezone(), yr=lubridate::year(as.Date(now()))) {
date_turn_back_clocks <- function(tzn, yr) {

    ## From the given year and timezone, construct a vector of all of the days in
    ## the given year and timezone, but as timestamps of noon of those days.
  diy_vec <- seq(ymd_hms(sprintf("%s-01-01T12:00:00", yr), tz=tzn)
    , ymd_hms(sprintf("%s-12-31T12:00:00", yr), tz=tzn)
    , by="day")

    ## Vector along days just made, indicating if noon of date is in dst or not.
  isdst <- dst(diy_vec)

    ## Short-circuit out if no dst, return an NA value in that case.
  if(!any(isdst)) return(NA_integer_)

    ## rle gives information about runs.  Here we find the runs and calculate the
    ## start and end index of those runs found by rle.
    ## See: http://masterr.org/r/how-to-find-consecutive-repeats-in-r/
    ## for some help.  I made the calc for runs_start_index.
  runs <- rle(isdst)
  runs_end_index <- cumsum(runs$length)
  runs_start_index <- runs_end_index %>% lag() %>% ifelse(is.na(.), 0, .) %>% {. + 1}

    ## Index of first day after DST ends
    ## logic: since for a given year, DST is in the opposite part of the year
    ## in the southern hemisphere, we have to be careful how we ID the first
    ## day after DST.  In the northern hemisphere, DST will be a consecutive
    ## block of days in the middle of the year (isdst will have an unbroken run
    ## of TRUE) values.  In the Southern Hemisphere, DST will span the new year,
    ## so jan 1 will be true for dst, then the middle of the year will have a run
    ## of isdst == FALSE, and then a run of isdst == TRUE at end of year.
    ## To ID this, find the runs of FALSES in the year (1 run in S. Hemi, 2 in
    ## the N. Hemi.), and then find the first index of the last run of FALSEs
    ## in the year.
    ## Ex: general pattern of days that are in dst (T) or not in dst (F)
    ##  N. Hemi:
    ##  Jan ... Mar ... Nov ... Dec
    ##  F F ... F T ... T F ... F F
    ##
    ##  S. Hemi:
    ##  Jan ... Mar ... Nov ... Dec
    ##  T T ... T F ... F T ... T T
    ##
    ## The first index of the last false run is the day on which the clocks rolled back
    ## also because we are checking the dst status of noon of that day, and the first
    ## day of the last run which is false had the clocks roll back earlier that morning.
  first_day_after_dst_index <- runs_start_index[runs$values == FALSE] %>% rev() %>% .[1]

    ## Return that date
  diy_vec[first_day_after_dst_index] %>% as.Date()
}

  ## Memoise the function, so repeated calls to the same year and timezone are
  ## just a hash lookup and not the whole computation given in the function definition.
date_turn_back_clocks <- memoise::memoise(date_turn_back_clocks)

  ## And Vectorize it
date_turn_back_clocks <- Vectorize(date_turn_back_clocks)
#date_turn_back_clocks <- function(tzn, yr) date_turn_back_clocks(tzn, yr) %>% as.Date(origin="1970-01-01")

  ## Test some defaults
#date_turn_back_clocks()
#date_turn_back_clocks(yr=2011)

  # America/Chicago
date_turn_back_clocks(tzn="America/Chicago", yr=2011) %>% as.Date(origin="1970-01-01")

## America/Chicago 
##    "2011-11-06"

  # No DST
date_turn_back_clocks(tzn="Africa/Johannesburg", yr=2011) %>% as.Date(origin="1970-01-01")

## Africa/Johannesburg 
##                  NA

  # S. Hemi, spring is the end of DST
date_turn_back_clocks(tzn="Pacific/Easter", yr=2011) %>% as.Date(origin="1970-01-01")

## Pacific/Easter 
##   "2011-05-08"

So now we have a function to ID the day of the year in a timezone when
DST ends. Let's use that to just filter out records (1am on DST fall
back day) we don't want:
dat %>%
  mutate(dt_tz = ymd_hms(dt_str, tz="America/New_York")) %>%
  mutate(end_dst_date = date_turn_back_clocks(tz(dt_tz), year(dt_tz)) %>% as.Date(origin="1970-01-01")) %>%
  filter(!((as.Date(dt_tz) == end_dst_date) & (hour(dt_tz) == 1))) %>%
  ungroup()

##    rownum              dt_str               dt_tz end_dst_date
## 1       1 2015-03-08 00:00:00 2015-03-08 00:00:00   2015-11-01
## 2       2 2015-03-08 00:15:00 2015-03-08 00:15:00   2015-11-01
## 3       3 2015-03-08 00:30:00 2015-03-08 00:30:00   2015-11-01
## 4       4 2015-03-08 00:45:00 2015-03-08 00:45:00   2015-11-01
## 5       5 2015-03-08 01:00:00 2015-03-08 01:00:00   2015-11-01
## 6       6 2015-03-08 01:15:00 2015-03-08 01:15:00   2015-11-01
## 7       7 2015-03-08 01:30:00 2015-03-08 01:30:00   2015-11-01
## 8       8 2015-03-08 01:45:00 2015-03-08 01:45:00   2015-11-01
## 9       9 2015-03-08 03:00:00 2015-03-08 03:00:00   2015-11-01
## 10     10 2015-03-08 03:15:00 2015-03-08 03:15:00   2015-11-01
## 11     11 2015-03-08 03:30:00 2015-03-08 03:30:00   2015-11-01
## 12     12 2015-03-08 03:45:00 2015-03-08 03:45:00   2015-11-01
## 13     13 2015-03-08 04:00:00 2015-03-08 04:00:00   2015-11-01
## 14     14 2015-11-01 00:00:00 2015-11-01 00:00:00   2015-11-01
## 15     15 2015-11-01 00:15:00 2015-11-01 00:15:00   2015-11-01
## 16     16 2015-11-01 00:30:00 2015-11-01 00:30:00   2015-11-01
## 17     17 2015-11-01 00:45:00 2015-11-01 00:45:00   2015-11-01
## 18     26 2015-11-01 02:00:00 2015-11-01 02:00:00   2015-11-01
## 19     27 2015-11-01 02:15:00 2015-11-01 02:15:00   2015-11-01

Option 2: Cast the first and second run of the fall back overlap hour to right UTC time

In order to do this, you have to be able to take a row and distinguish
if it is the first pass over the 1am hour or the second pass. If these
are local timestamps, you won't be able to use the raw timestamps to
determine this, as they will look identical between the first and second
passes. You will have to ascertain the pass from the order that they
occur in the logifile. So the following assumptions are made in this
solution:

Log records are read in in time sequence order -- that is, records
occuring earlier in the file aslo occur earlier in time, regardless
of the timestamp.
There will exist records in both the the first pass of the overlap
hour and second pass.
There will be at least one record in the second pass that has an
earlier timestamp than some timestamp in the first pass.

These are all assumptions the code needs to make to determine if a given
log record timestamp occurrs during the first pass at the overlap hour,
or the second.
dat %>%
  mutate(dt_tz = ymd_hms(dt_str, tz="America/New_York")) %>%
  mutate(end_dst_date = date_turn_back_clocks(tz(dt_tz), year(dt_tz)) %>% as.Date(origin="1970-01-01")) %>%
  mutate(fallback_hour = ifelse((as.Date(dt_tz) == end_dst_date) & (hour(dt_tz) == 1), TRUE, FALSE )) %>%
  group_by(end_dst_date, fallback_hour) %>%
  do((function(df) {
    if (!any(df$fallback_hour)) {
        ## Normal, just cast to UTC
      df$dt_utc <- with_tz(df$dt_tz, "UTC")
    } else {
        ## Have to determine if timestamp is 1st or 2nd pass through
      dt_tz_lag <- dplyr::lag(df$dt_tz) #%>% {.[is.na(.)] <- 0}
      second_pass <- purrr::accumulate(df$dt_tz - dt_tz_lag, min, na.rm=TRUE, 0) %>%
        {ifelse((. < 0) & !is.na(.), TRUE, FALSE)}

        # If second pass, with_tz works correctly by it's assumptions, otherwise, in first pass
        # need to subtract an hour.
      df$dt_utc <- ifelse(second_pass
                          , with_tz(df$dt_tz, "UTC")
                          , with_tz(df$dt_tz, "UTC") - hours(1)) %>%
                      as.POSIXct(origin="1970-01-01")
    }

    df
  })(.)) %>%
  ungroup() %>%
  arrange(rownum) %>%
  print.data.frame()

##    rownum              dt_str               dt_tz end_dst_date fallback_hour              dt_utc
## 1       1 2015-03-08 00:00:00 2015-03-08 00:00:00   2015-11-01         FALSE 2015-03-08 05:00:00
## 2       2 2015-03-08 00:15:00 2015-03-08 00:15:00   2015-11-01         FALSE 2015-03-08 05:15:00
## 3       3 2015-03-08 00:30:00 2015-03-08 00:30:00   2015-11-01         FALSE 2015-03-08 05:30:00
## 4       4 2015-03-08 00:45:00 2015-03-08 00:45:00   2015-11-01         FALSE 2015-03-08 05:45:00
## 5       5 2015-03-08 01:00:00 2015-03-08 01:00:00   2015-11-01         FALSE 2015-03-08 06:00:00
## 6       6 2015-03-08 01:15:00 2015-03-08 01:15:00   2015-11-01         FALSE 2015-03-08 06:15:00
## 7       7 2015-03-08 01:30:00 2015-03-08 01:30:00   2015-11-01         FALSE 2015-03-08 06:30:00
## 8       8 2015-03-08 01:45:00 2015-03-08 01:45:00   2015-11-01         FALSE 2015-03-08 06:45:00
## 9       9 2015-03-08 03:00:00 2015-03-08 03:00:00   2015-11-01         FALSE 2015-03-08 07:00:00
## 10     10 2015-03-08 03:15:00 2015-03-08 03:15:00   2015-11-01         FALSE 2015-03-08 07:15:00
## 11     11 2015-03-08 03:30:00 2015-03-08 03:30:00   2015-11-01         FALSE 2015-03-08 07:30:00
## 12     12 2015-03-08 03:45:00 2015-03-08 03:45:00   2015-11-01         FALSE 2015-03-08 07:45:00
## 13     13 2015-03-08 04:00:00 2015-03-08 04:00:00   2015-11-01         FALSE 2015-03-08 08:00:00
## 14     14 2015-11-01 00:00:00 2015-11-01 00:00:00   2015-11-01         FALSE 2015-11-01 04:00:00
## 15     15 2015-11-01 00:15:00 2015-11-01 00:15:00   2015-11-01         FALSE 2015-11-01 04:15:00
## 16     16 2015-11-01 00:30:00 2015-11-01 00:30:00   2015-11-01         FALSE 2015-11-01 04:30:00
## 17     17 2015-11-01 00:45:00 2015-11-01 00:45:00   2015-11-01         FALSE 2015-11-01 04:45:00
## 18     18 2015-11-01 01:00:00 2015-11-01 01:00:00   2015-11-01          TRUE 2015-11-01 05:00:00
## 19     19 2015-11-01 01:15:00 2015-11-01 01:15:00   2015-11-01          TRUE 2015-11-01 05:15:00
## 20     20 2015-11-01 01:30:00 2015-11-01 01:30:00   2015-11-01          TRUE 2015-11-01 05:30:00
## 21     21 2015-11-01 01:45:00 2015-11-01 01:45:00   2015-11-01          TRUE 2015-11-01 05:45:00
## 22     22 2015-11-01 01:00:00 2015-11-01 01:00:00   2015-11-01          TRUE 2015-11-01 06:00:00
## 23     23 2015-11-01 01:15:00 2015-11-01 01:15:00   2015-11-01          TRUE 2015-11-01 06:15:00
## 24     24 2015-11-01 01:30:00 2015-11-01 01:30:00   2015-11-01          TRUE 2015-11-01 06:30:00
## 25     25 2015-11-01 01:45:00 2015-11-01 01:45:00   2015-11-01          TRUE 2015-11-01 06:45:00
## 26     26 2015-11-01 02:00:00 2015-11-01 02:00:00   2015-11-01         FALSE 2015-11-01 07:00:00
## 27     27 2015-11-01 02:15:00 2015-11-01 02:15:00   2015-11-01         FALSE 2015-11-01 07:15:00

Note that in November, on the day when the clocks turn back, the first
pass through 1am maps to the right incremented UTC time, and the second
pass through 1am maps to the next hour UTC, and will blend correctly
with what 2am standard time becomes.
Links


R strptime() and strftime() control codes for parsing (help
page):
https://stat.ethz.ch/R-manual/R-devel/library/base/html/strptime.html
Lubridate: Nice package from Hadely Wickham and co. on handling
dates:
https://cran.r-project.org/web/packages/lubridate/lubridate.pdf
Lubridate Vignette:
https://cran.r-project.org/web/packages/lubridate/vignettes/lubridate.html
More on datetime parsing:
https://rstudio-pubs-static.s3.amazonaws.com/28038_1bcb9aa80ca84f27ace07d612872861a.html
Stack Overflow comprehensive discussion on timezones and best
practices:
http://stackoverflow.com/questions/2532729/daylight-saving-time-and-time-zone-best-practices
Stack Overflow tag article on timezones:
http://stackoverflow.com/tags/timezone/info
On converting POSIX to Olson database and such:
http://www.ibm.com/developerworks/aix/library/au-aix-posix/
Get available Olson timezone names (accepted as tz argument in
POSIX and related datetime functions:
http://stackoverflow.com/questions/4047188/unknown-timezone-name-in-r-strptime-as-posixct
Wiki page on Olson time zone database. Also, see the cool map at top
coloring the world time zones:
https://en.wikipedia.org/wiki/Tz_database
ISO 8601, time zone format spec:
https://en.wikipedia.org/wiki/ISO_8601
Token	Meaning
%m	Month as decimal number (01-12).
%d	Day of the month as decimal number (01-31).
%Y	Year, as decimal number (4 digits, like '2016')
%I	Hours as decimal number (01-12).
%p	AM or PM string.