Skip to content

Instantly share code, notes, and snippets.

@DavidMertz
Last active February 26, 2020 20:44
Show Gist options
  • Save DavidMertz/7b9851070057ffd22f7616e4273c9795 to your computer and use it in GitHub Desktop.
Save DavidMertz/7b9851070057ffd22f7616e4273c9795 to your computer and use it in GitHub Desktop.
Cleaning Data
Cleaning Data for Effective Data Science
Doing the other 80% of the work
In order for something to become clean, something else must become dirty.
–Imbesi's Law of the Conservation of Filth
It is something of a truism in data science, data analysis, or machine learning
that most of the work needed to do your actual work lies in cleaning your data.
The subtitle of this work alludes to a commonly assigned percentage. A keynote
speaker I listened to at a data science conference a few years ago made a joke—
perhaps one already widely repeated by the time he told it—about talking with a
colleague of his. The colleague complained of data cleaning taking up half of
her time, in response to which the speaker expressed astonishment that it could
be so little as 50%.
Without worrying too much about assigning a precise percentage, in my experience
working as a technologist and data scientist, I have found that the bulk of what
I do is preparing my data for the statistical analyses, machine learning models,
or nuanced visualizations that I would like to utilize it for. Although hopeful
executives, or technical managers a bit removed from the daily work, tend to have
an eternal optimism that the next set of data the organization acquires will be
clean and easy to work with, I have yet to find that to be true in my concrete
experience.
Certainly, some data is better and some is worse. But all data is dirty, at least
within a very small margin of error in the tally. Even data sets that have been
published, carefully studied, and that are widely distributed as canonical examples
for statistics textbooks or software libraries, generally have a moderate number
of data integrity problems. Even after our best pre-processing, a more attainable
goal should be to make our data less dirty; making it clean remains unduly utopian
in aspiration.
By all means we should distinguish data quality from data utility. These descriptions
are roughly orthogonal to each other. Data can be dirty (up to a point) but still be
enormously useful. Data can be (relatively) clean but have little purpose, or at least
not be fit for purpose. Concerns about the choice of measurements to collect, or about
possible selection bias, or other methodological or scientific questions are mostly
outside the scope of this book. However, a fair number of techniques I present can
aid in evaluating the utility of data, but there is often no mechanical method of
remedying systemic issues. For example, statistics and other analyses may reveal—
or at least strongly suggest—the unreliability of a certain data field. But the
techniques in this book cannot generally automatically fix that unreliable data or
collect better data.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment