Created
April 26, 2021 17:09
-
-
Save DavidMertz/0ede1c6683282e8936651d0918f45e17 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Alex Martelli on _Cleaning Data_ | |
I made it -- going far deeper into the text than I had planned to do on a | |
"first skim" because the text is worth it. Here's my review, which I do not | |
believe I shall need to edit. | |
I started reviewing this text with very high expectations. | |
First, I know the author, and I know he thinks sharply and writes engagingly, | |
convincingly, and clearly, to present his thinking. | |
Second, I'm convinced that the subject is vital, yet apparently sadly neglected | |
in the literature and academic courses -- I've seen offhand estimates | |
(applications of Pareto's Law, maybe?) that 80% of the work of a typical data | |
engineer (I shun the phrase "data scientist": the use of "data science" instead | |
of "data engineering" in the book's title is my only substantial quibble with | |
the book!-) is acquiring, verifying, cleaning, and preparing data, and only 20% | |
is actually applying statistics, machine learning, or other forms of analytics. | |
Whether that Pareto-like 80% estimate is accurate or just a bit abundant, the | |
parts of the job covered by this book are nevertheless the majority of how we | |
spend our time and effort. | |
And yet, all fellow practitioners I've discussed this with have learned most of | |
what this book teaches "on the job" by trial and error and/or mentoring by more | |
experienced peers, rather than academic courses, seminars, tutorials, or books. | |
Typically, starting on a book (or most anything else in life) with very high | |
expectations is a recipe for disappointment, as the book (or whatever) may fall | |
short of your high hopes for it. | |
This book, for me, was an exception: my high expectations were vastly | |
surpassed, indeed "run rings around"! by what the book delivered to me. If I | |
had to rate the book on a scale from 1 to 5 stars. I would refuse... because | |
even 6 stars would not be enough!-) | |
The book is highly pragmatic yet quite usefully structured and sequenced. I met | |
almost all the topics I would expect (among the few exceptions, minor ones such | |
as protocol buffers as an important data format and the use of wget and curl | |
rather than a couple of funky text-only browsers to get from the web HTML to | |
scrape)... PLUS, I've actually LEARNED stuff I could have used myself in the | |
course of a few years of professional practice... but didn't realize on my own, | |
and was never taught (for examples, usable heuristics to detect -- and possibly | |
correct -- sampling bias in one's input data sets). | |
Some of the issues the book covers are elementary to intermediate, some of the | |
others are wickedly advanced (such as t-SNE -- the book does not delve into its | |
theoretical underpinnings, it just shows how to use relevant Python and R | |
packages)... but both kinds will stand you in good stead as a data-engineering | |
practitioner, by whatever name you refer to your craft ("analytics", "business | |
intelligence", "data science", or whatever!-). | |
I could, of course, add some quibbles to try to show I'm not a fanboy... | |
For example, the author's observation that "edit distance" for strings is not | |
transitive (if the distance between A and B is 5, and that between B and C is | |
5, then the distance between A and C can be anything between 0 and 10) also | |
trivially applies to Euclidean distance on a plane (or a sphere, AKA Haversine | |
distance)... yet nobody's ever argued that Euclidean/Haversine distances | |
between points on a plane or sphere are not a proper metric (!). | |
But, that would be supererogatory on my part! I'll rather close with the | |
practical observation that I'm strongly urging every colleague I interact with, | |
who has anything to do with data processing (with my job at Google, that's most | |
of them:-), to buy and study the book: it WILL be well worth their time and | |
energy! |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment