Skip to content

Instantly share code, notes, and snippets.

@jonathaneunice
Last active August 29, 2015 14:18
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jonathaneunice/f690aa45134b2ebf7344 to your computer and use it in GitHub Desktop.
Save jonathaneunice/f690aa45134b2ebf7344 to your computer and use it in GitHub Desktop.

Dirty Data Still Dirty
Sad Panda Still Sad

The New York Stock Exchange (NYSE) historical records contain multiple, absurdly novice errors. It's dirty data, referring to days that don't exist, and giving multiple inconsistent answers for over a month of trading days. And this is the data set of record, from the world's leading financial market--a massively well-resourced organization for which information management and sharing is the core mission. If they can't, or don't, or don't care enough to get it right...just wow.

NYSE: Why you make panda cry?

TL;DR

I've been looking through NYSE trading days for an analysis / visualization project. You would think, in 2015, that NYSE historical records would have long since been well-scrubbed and cross-verified. Being an optimist, you might even think they'd be easy to get and consume. A REST API returning JSON, say. Or at least some form of XML. You'd be wrong.

My starting point was https://www.nyse.com/data/transactions-statistics-data-library Look under "NYSE DAILY SHARE VOLUME IN NYSE LISTED ISSUES"

screencap

I'm really just looking for which dates were actual trading days, and what was the volume on those dates. Simple!

Yet in what should be a straightforward historical record, there are 13 files, provided three distinct file types (.dat, .prn, and .html). Hope you're ready to do a little ad hoc parsing!

Once parsed, there are two dates that are clearly wrong. Unless, you know, there really were 80 months in 1950, or November had 31 days in 1900.

I also found 32 "duplicates": Dates for which multiple, inconsistent volume figures were given:

1928: 12/26 12/27 12/28
1944: 09/11
1945: 08/01 08/02 08/03 08/06 08/07 08/08 08/09 08/10 08/13 08/14 08/17
      08/20 08/21 08/22 08/23 08/24 08/27 08/28 08/29 08/30 08/31
1951: 12/18
1954: 12/27
1963: 04/15 09/30
1999: 06/16
2003: 02/24 03/24

It's very much a truism that when you do data analysis and visualization:

ALL DATA IS DIRTY.

NEVER TRUST IT.

Always clean it up, homogenize its formats, and do whatever consistency and "does this make sense??" cross-checks are feasible.

But still! It's 2015! We're decades into the "Internet is for information" age. You'd think that organizations that provide data sets of record--data on which investors, economists, and policy makers depend--would provide clean, consistent, easily-consumed information. JSON, or CSV, or something! But we get multi-part, multi-format, parse-it-all-yourself, errors-akimbo data instead!

Truth be told, I've only barely started digging into it. This is just Pass 1. There might be--read: assuredly are--other errors still lurking. In some cases, it would require near-heroic efforts to cross-check. How do you check the market volume for a given day, for instance? You'd go to the source of record, the market itself... OH. Yeah. Already there. This is a financial market. They keep the world's financial books. If they can't get it right... Abandon all hope ye who enter here?

Yet here we are, with dozens of clearly identifiable errors and inconsistent values given.

Please, people! Clean your data sets! Provide them in simple, consistent formats! Especially when you're the source of record. Do your job!

Until such a time, users of the data: Caveat lector.

Colophon

Author Jonathan Eunice, aka @jeunice on the Twitters

Image That panda image seems ubiquitous on the web. I used reverse image search to seek its true author / originating source, but there's no associated metadata and I couldn't identify an authoritative source or authorship.

The Real Pandas Involved pandas.pydata.org

pandas.pydata.org

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment