szeitlin/extract_conf_notes.md

## extract_conf_notes.md

      
    Raw
  

              extract_conf_notes.md
            
          
    30Oct2015
4th one they've done
David White of import.io

web data at scale:
free data extractor tool: url -> CSV,JSON
paid data feeds
1M websites so far, 500k users
ex. validating US business address data (~25% are wrong, 7% are missing)
ex. book publisher ~ 3M queries/day
Owen Thomas interviewed Chris Lambert of Lyft

Lyft grew 5x in < 1 year
A/B testing pricing algorithms is hard: can't easily bucket drivers & passengers
actually doing less testing now, more simulating
"like SimCity, but visualizing is different"
run hundreds of times, because results are variable from individual runs

100% on AWS
~40 microservices
RedShift, DynamoDB, MongoDB & cache-ing
mostly python
NLP on ratings/feedback

RedShift for exploratory analysis (he expressed surprise that more people aren't using this yet)
Kamelia Aryafar from Etsy

~21M buyers, 1.4M sellers
unique items, no bar codes
goal: make the big marketplace feel smaller
Personalized models for recommendations
1.item
2.user
3.shop
Matrix Factorization approach:
decompose & create scoring for new items users haven't seen before
Locality Sensitive Hashing
Latent Dirichlet Allocation
A/B testing to compare
readout: business metrics
age-test: newer models with more data always do better than older models with less data
Matt Kruger at WhoSay

match celebrities with brands, get better reach
Andrew Ng of baidu

Deep Learning - still mostly supervised classification
~20 years old, but now it's better because of scale
make very large neural networks, feed lots of labeled data as (x,y) pairs
computer vision improvements
can now input (image, question) --> "the bus is red"
monetizes really well
duLight: wearable camera for blind people, has a speaker, face recognition, can read labels (all in Chinese)
Speech recognition:
100k hours' worth of data
95% accurate recognition
Older algorithms plateau, but deep learning improves with more data
Matt Ellsworth

example project DrizzyAI (actual twitter account) is a bot
scraped lyrics with import.io
used node library twist to get a stream of tweets
Gabriela de Querioz at Sharethrough

The first 9 months as the only Data Scientist at her company:

no data dictionary: no description of variables or their context


designed metrics, unified terminology


data unusable as raw


created a pipeline


no infrastruture for experiments


no integration with DevOps or engineering


Paul Ellwood at Netflix

A/B tests: too many ways to look at the data
seasonality, non-normal distributions, high-dimensionality data
building smarter analytics products:

anomalies as input
personalized to the user and use-case
filtered and ranked by impact
usable: UX best practices
ability to dig deeper

Case Study on Payment Processing:

multiple dimensions: banks, transactions, payment mechanisms
hundreds of thousands of time series
filter on location, rank of impact ($)
bins, last 45 days trend
showed an example of tracking down a credit union in Brazil having problems processing payments

Sumeet Howe of Good Data

emphasis on underappreciation of the financial value of data assets
SAAS - BI (data -> viz)
most companies use < 40% of the data they collect
Travis Brooks from Yelp

83M mobile users & ~83M reviews
~68% of searches are from mobile
live in 32 countries
originally thought they'd just connect users with local businesses
but, we day 3x a day! so ended up focusing on food
"100 best restaurants":

what does best mean?
What's a restaurant?
for what? for whom? where?
locals vs. tourist reviews
popularity vs. quality

Wilson score as estimate of confidence interval
changed to Dirichlet in 2015 and weighted for newest reviews
smoothing by adding pseudocounts
conclusions: location matters to reviewers, as much as food
Matt Sundquist of Plotly


JSON for everything
maps
pandas cufflinks
matplotlib is bad
now Dashboards with React

Ignacio Elola of Import.io


create extractors, combine extractors, schedule extraction
depends on the number of data tables, and how much pre and post-processing
XPaths: can customize rather than letting import.io guess for you
avoid hard-coding/indexing, better to use tags that will still work if the page is redesigned
Regex
Required Column anchor: example of Amazon's different pages for the same book on kindle vs. hard copy

Quality of data:

completeness (is each record complete)
coverage (# of expected records)
data type tests, and format tests
manual sampling to confirm (+) controls

Post-processing:

anomaly detection
variance, variability, and noise
normalization
confidence score