Last active
February 26, 2020 20:44
-
-
Save DavidMertz/7b9851070057ffd22f7616e4273c9795 to your computer and use it in GitHub Desktop.
Cleaning Data
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Cleaning Data for Effective Data Science | |
Doing the other 80% of the work | |
In order for something to become clean, something else must become dirty. | |
–Imbesi's Law of the Conservation of Filth | |
It is something of a truism in data science, data analysis, or machine learning | |
that most of the work needed to do your actual work lies in cleaning your data. | |
The subtitle of this work alludes to a commonly assigned percentage. A keynote | |
speaker I listened to at a data science conference a few years ago made a joke— | |
perhaps one already widely repeated by the time he told it—about talking with a | |
colleague of his. The colleague complained of data cleaning taking up half of | |
her time, in response to which the speaker expressed astonishment that it could | |
be so little as 50%. | |
Without worrying too much about assigning a precise percentage, in my experience | |
working as a technologist and data scientist, I have found that the bulk of what | |
I do is preparing my data for the statistical analyses, machine learning models, | |
or nuanced visualizations that I would like to utilize it for. Although hopeful | |
executives, or technical managers a bit removed from the daily work, tend to have | |
an eternal optimism that the next set of data the organization acquires will be | |
clean and easy to work with, I have yet to find that to be true in my concrete | |
experience. | |
Certainly, some data is better and some is worse. But all data is dirty, at least | |
within a very small margin of error in the tally. Even data sets that have been | |
published, carefully studied, and that are widely distributed as canonical examples | |
for statistics textbooks or software libraries, generally have a moderate number | |
of data integrity problems. Even after our best pre-processing, a more attainable | |
goal should be to make our data less dirty; making it clean remains unduly utopian | |
in aspiration. | |
By all means we should distinguish data quality from data utility. These descriptions | |
are roughly orthogonal to each other. Data can be dirty (up to a point) but still be | |
enormously useful. Data can be (relatively) clean but have little purpose, or at least | |
not be fit for purpose. Concerns about the choice of measurements to collect, or about | |
possible selection bias, or other methodological or scientific questions are mostly | |
outside the scope of this book. However, a fair number of techniques I present can | |
aid in evaluating the utility of data, but there is often no mechanical method of | |
remedying systemic issues. For example, statistics and other analyses may reveal— | |
or at least strongly suggest—the unreliability of a certain data field. But the | |
techniques in this book cannot generally automatically fix that unreliable data or | |
collect better data. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment