DavidMertz/Cleaning Data

## Cleaning Data
Cleaning Data for Effective Data Science
Doing the other 80% of the work

    In order for something to become clean, something else must become dirty.
    –Imbesi's Law of the Conservation of Filth

It is something of a truism in data science, data analysis, or machine learning
that most of the work needed to do your actual work lies in cleaning your data.
The subtitle of this work alludes to a commonly assigned percentage. A keynote
speaker I listened to at a data science conference a few years ago made a joke—
perhaps one already widely repeated by the time he told it—about talking with a
colleague of his. The colleague complained of data cleaning taking up half of
her time, in response to which the speaker expressed astonishment that it could
be so little as 50%.

Without worrying too much about assigning a precise percentage, in my experience
working as a technologist and data scientist, I have found that the bulk of what
I do is preparing my data for the statistical analyses, machine learning models,
or nuanced visualizations that I would like to utilize it for. Although hopeful
executives, or technical managers a bit removed from the daily work, tend to have
an eternal optimism that the next set of data the organization acquires will be
clean and easy to work with, I have yet to find that to be true in my concrete
experience.

Certainly, some data is better and some is worse. But all data is dirty, at least
within a very small margin of error in the tally. Even data sets that have been
published, carefully studied, and that are widely distributed as canonical examples
for statistics textbooks or software libraries, generally have a moderate number
of data integrity problems. Even after our best pre-processing, a more attainable
goal should be to make our data less dirty; making it clean remains unduly utopian
in aspiration.

By all means we should distinguish data quality from data utility. These descriptions
are roughly orthogonal to each other. Data can be dirty (up to a point) but still be
enormously useful. Data can be (relatively) clean but have little purpose, or at least
not be fit for purpose. Concerns about the choice of measurements to collect, or about
possible selection bias, or other methodological or scientific questions are mostly
outside the scope of this book. However, a fair number of techniques I present can
aid in evaluating the utility of data, but there is often no mechanical method of
remedying systemic issues. For example, statistics and other analyses may reveal—
or at least strongly suggest—the unreliability of a certain data field. But the
techniques in this book cannot generally automatically fix that unreliable data or
collect better data.
	Cleaning Data for Effective Data Science
	Doing the other 80% of the work

	In order for something to become clean, something else must become dirty.
	–Imbesi's Law of the Conservation of Filth

	It is something of a truism in data science, data analysis, or machine learning
	that most of the work needed to do your actual work lies in cleaning your data.
	The subtitle of this work alludes to a commonly assigned percentage. A keynote
	speaker I listened to at a data science conference a few years ago made a joke—
	perhaps one already widely repeated by the time he told it—about talking with a
	colleague of his. The colleague complained of data cleaning taking up half of
	her time, in response to which the speaker expressed astonishment that it could
	be so little as 50%.

	Without worrying too much about assigning a precise percentage, in my experience
	working as a technologist and data scientist, I have found that the bulk of what
	I do is preparing my data for the statistical analyses, machine learning models,
	or nuanced visualizations that I would like to utilize it for. Although hopeful
	executives, or technical managers a bit removed from the daily work, tend to have
	an eternal optimism that the next set of data the organization acquires will be
	clean and easy to work with, I have yet to find that to be true in my concrete
	experience.

	Certainly, some data is better and some is worse. But all data is dirty, at least
	within a very small margin of error in the tally. Even data sets that have been
	published, carefully studied, and that are widely distributed as canonical examples
	for statistics textbooks or software libraries, generally have a moderate number
	of data integrity problems. Even after our best pre-processing, a more attainable
	goal should be to make our data less dirty; making it clean remains unduly utopian
	in aspiration.

	By all means we should distinguish data quality from data utility. These descriptions
	are roughly orthogonal to each other. Data can be dirty (up to a point) but still be
	enormously useful. Data can be (relatively) clean but have little purpose, or at least
	not be fit for purpose. Concerns about the choice of measurements to collect, or about
	possible selection bias, or other methodological or scientific questions are mostly
	outside the scope of this book. However, a fair number of techniques I present can
	aid in evaluating the utility of data, but there is often no mechanical method of
	remedying systemic issues. For example, statistics and other analyses may reveal—
	or at least strongly suggest—the unreliability of a certain data field. But the
	techniques in this book cannot generally automatically fix that unreliable data or
	collect better data.