30Oct2015
4th one they've done
web data at scale:
free data extractor tool: url -> CSV,JSON paid data feeds
1M websites so far, 500k users
ex. validating US business address data (~25% are wrong, 7% are missing) ex. book publisher ~ 3M queries/day
Lyft grew 5x in < 1 year
A/B testing pricing algorithms is hard: can't easily bucket drivers & passengers
actually doing less testing now, more simulating
"like SimCity, but visualizing is different"
run hundreds of times, because results are variable from individual runs
- 100% on AWS
- ~40 microservices
- RedShift, DynamoDB, MongoDB & cache-ing
- mostly python
- NLP on ratings/feedback
RedShift for exploratory analysis (he expressed surprise that more people aren't using this yet)
~21M buyers, 1.4M sellers unique items, no bar codes goal: make the big marketplace feel smaller
Personalized models for recommendations 1.item 2.user 3.shop
Matrix Factorization approach: decompose & create scoring for new items users haven't seen before
Locality Sensitive Hashing Latent Dirichlet Allocation
A/B testing to compare
readout: business metrics
age-test: newer models with more data always do better than older models with less data
match celebrities with brands, get better reach
Deep Learning - still mostly supervised classification ~20 years old, but now it's better because of scale make very large neural networks, feed lots of labeled data as (x,y) pairs
computer vision improvements
can now input (image, question) --> "the bus is red"
monetizes really well
duLight: wearable camera for blind people, has a speaker, face recognition, can read labels (all in Chinese)
Speech recognition: 100k hours' worth of data 95% accurate recognition Older algorithms plateau, but deep learning improves with more data
example project DrizzyAI (actual twitter account) is a bot scraped lyrics with import.io used node library twist to get a stream of tweets
The first 9 months as the only Data Scientist at her company:
- no data dictionary: no description of variables or their context
- designed metrics, unified terminology
- data unusable as raw
- created a pipeline
-
no infrastruture for experiments
-
no integration with DevOps or engineering
A/B tests: too many ways to look at the data seasonality, non-normal distributions, high-dimensionality data
building smarter analytics products:
- anomalies as input
- personalized to the user and use-case
- filtered and ranked by impact
- usable: UX best practices
- ability to dig deeper
Case Study on Payment Processing:
- multiple dimensions: banks, transactions, payment mechanisms
- hundreds of thousands of time series
- filter on location, rank of impact ($)
- bins, last 45 days trend
- showed an example of tracking down a credit union in Brazil having problems processing payments
emphasis on underappreciation of the financial value of data assets
SAAS - BI (data -> viz)
most companies use < 40% of the data they collect
83M mobile users & ~83M reviews
~68% of searches are from mobile
live in 32 countries
originally thought they'd just connect users with local businesses but, we day 3x a day! so ended up focusing on food
"100 best restaurants":
- what does best mean?
- What's a restaurant?
- for what? for whom? where?
- locals vs. tourist reviews
- popularity vs. quality
Wilson score as estimate of confidence interval changed to Dirichlet in 2015 and weighted for newest reviews smoothing by adding pseudocounts
conclusions: location matters to reviewers, as much as food
- JSON for everything
- maps
- pandas cufflinks
- matplotlib is bad
- now Dashboards with React
- create extractors, combine extractors, schedule extraction
- depends on the number of data tables, and how much pre and post-processing
- XPaths: can customize rather than letting import.io guess for you
- avoid hard-coding/indexing, better to use tags that will still work if the page is redesigned
- Regex
- Required Column anchor: example of Amazon's different pages for the same book on kindle vs. hard copy
Quality of data:
- completeness (is each record complete)
- coverage (# of expected records)
- data type tests, and format tests
- manual sampling to confirm (+) controls
Post-processing:
- anomaly detection
- variance, variability, and noise
- normalization
- confidence score