Skip to content

Instantly share code, notes, and snippets.

@DavidMertz
Created April 26, 2021 17:09
Show Gist options
  • Save DavidMertz/0ede1c6683282e8936651d0918f45e17 to your computer and use it in GitHub Desktop.
Save DavidMertz/0ede1c6683282e8936651d0918f45e17 to your computer and use it in GitHub Desktop.
Alex Martelli on _Cleaning Data_
I made it -- going far deeper into the text than I had planned to do on a
"first skim" because the text is worth it. Here's my review, which I do not
believe I shall need to edit.
I started reviewing this text with very high expectations.
First, I know the author, and I know he thinks sharply and writes engagingly,
convincingly, and clearly, to present his thinking.
Second, I'm convinced that the subject is vital, yet apparently sadly neglected
in the literature and academic courses -- I've seen offhand estimates
(applications of Pareto's Law, maybe?) that 80% of the work of a typical data
engineer (I shun the phrase "data scientist": the use of "data science" instead
of "data engineering" in the book's title is my only substantial quibble with
the book!-) is acquiring, verifying, cleaning, and preparing data, and only 20%
is actually applying statistics, machine learning, or other forms of analytics.
Whether that Pareto-like 80% estimate is accurate or just a bit abundant, the
parts of the job covered by this book are nevertheless the majority of how we
spend our time and effort.
And yet, all fellow practitioners I've discussed this with have learned most of
what this book teaches "on the job" by trial and error and/or mentoring by more
experienced peers, rather than academic courses, seminars, tutorials, or books.
Typically, starting on a book (or most anything else in life) with very high
expectations is a recipe for disappointment, as the book (or whatever) may fall
short of your high hopes for it.
This book, for me, was an exception: my high expectations were vastly
surpassed, indeed "run rings around"! by what the book delivered to me. If I
had to rate the book on a scale from 1 to 5 stars. I would refuse... because
even 6 stars would not be enough!-)
The book is highly pragmatic yet quite usefully structured and sequenced. I met
almost all the topics I would expect (among the few exceptions, minor ones such
as protocol buffers as an important data format and the use of wget and curl
rather than a couple of funky text-only browsers to get from the web HTML to
scrape)... PLUS, I've actually LEARNED stuff I could have used myself in the
course of a few years of professional practice... but didn't realize on my own,
and was never taught (for examples, usable heuristics to detect -- and possibly
correct -- sampling bias in one's input data sets).
Some of the issues the book covers are elementary to intermediate, some of the
others are wickedly advanced (such as t-SNE -- the book does not delve into its
theoretical underpinnings, it just shows how to use relevant Python and R
packages)... but both kinds will stand you in good stead as a data-engineering
practitioner, by whatever name you refer to your craft ("analytics", "business
intelligence", "data science", or whatever!-).
I could, of course, add some quibbles to try to show I'm not a fanboy...
For example, the author's observation that "edit distance" for strings is not
transitive (if the distance between A and B is 5, and that between B and C is
5, then the distance between A and C can be anything between 0 and 10) also
trivially applies to Euclidean distance on a plane (or a sphere, AKA Haversine
distance)... yet nobody's ever argued that Euclidean/Haversine distances
between points on a plane or sphere are not a proper metric (!).
But, that would be supererogatory on my part! I'll rather close with the
practical observation that I'm strongly urging every colleague I interact with,
who has anything to do with data processing (with my job at Google, that's most
of them:-), to buy and study the book: it WILL be well worth their time and
energy!
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment