Skip to content

Instantly share code, notes, and snippets.

@szeitlin
Last active November 12, 2015 00:35
Show Gist options
  • Save szeitlin/ee9df26f8ae2b229263a to your computer and use it in GitHub Desktop.
Save szeitlin/ee9df26f8ae2b229263a to your computer and use it in GitHub Desktop.
notes from Extract Conf

30Oct2015

4th one they've done

David White of import.io

web data at scale:

free data extractor tool: url -> CSV,JSON paid data feeds

1M websites so far, 500k users

ex. validating US business address data (~25% are wrong, 7% are missing) ex. book publisher ~ 3M queries/day

Owen Thomas interviewed Chris Lambert of Lyft

Lyft grew 5x in < 1 year

A/B testing pricing algorithms is hard: can't easily bucket drivers & passengers

actually doing less testing now, more simulating

"like SimCity, but visualizing is different"

run hundreds of times, because results are variable from individual runs

  • 100% on AWS
  • ~40 microservices
  • RedShift, DynamoDB, MongoDB & cache-ing
  • mostly python
  • NLP on ratings/feedback

RedShift for exploratory analysis (he expressed surprise that more people aren't using this yet)

Kamelia Aryafar from Etsy

~21M buyers, 1.4M sellers unique items, no bar codes goal: make the big marketplace feel smaller

Personalized models for recommendations 1.item 2.user 3.shop

Matrix Factorization approach: decompose & create scoring for new items users haven't seen before

Locality Sensitive Hashing Latent Dirichlet Allocation

A/B testing to compare

readout: business metrics

age-test: newer models with more data always do better than older models with less data

Matt Kruger at WhoSay

match celebrities with brands, get better reach

Andrew Ng of baidu

Deep Learning - still mostly supervised classification ~20 years old, but now it's better because of scale make very large neural networks, feed lots of labeled data as (x,y) pairs

computer vision improvements

can now input (image, question) --> "the bus is red"

monetizes really well

duLight: wearable camera for blind people, has a speaker, face recognition, can read labels (all in Chinese)

Speech recognition: 100k hours' worth of data 95% accurate recognition Older algorithms plateau, but deep learning improves with more data

Matt Ellsworth

example project DrizzyAI (actual twitter account) is a bot scraped lyrics with import.io used node library twist to get a stream of tweets

Gabriela de Querioz at Sharethrough

The first 9 months as the only Data Scientist at her company:

  1. no data dictionary: no description of variables or their context
  • designed metrics, unified terminology
  1. data unusable as raw
  • created a pipeline
  1. no infrastruture for experiments

  2. no integration with DevOps or engineering

Paul Ellwood at Netflix

A/B tests: too many ways to look at the data seasonality, non-normal distributions, high-dimensionality data

building smarter analytics products:

  • anomalies as input
  • personalized to the user and use-case
  • filtered and ranked by impact
  • usable: UX best practices
  • ability to dig deeper

Case Study on Payment Processing:

  • multiple dimensions: banks, transactions, payment mechanisms
  • hundreds of thousands of time series
  • filter on location, rank of impact ($)
  • bins, last 45 days trend
  • showed an example of tracking down a credit union in Brazil having problems processing payments

Sumeet Howe of Good Data

emphasis on underappreciation of the financial value of data assets

SAAS - BI (data -> viz)

most companies use < 40% of the data they collect

Travis Brooks from Yelp

83M mobile users & ~83M reviews

~68% of searches are from mobile

live in 32 countries

originally thought they'd just connect users with local businesses but, we day 3x a day! so ended up focusing on food

"100 best restaurants":

  • what does best mean?
  • What's a restaurant?
  • for what? for whom? where?
  • locals vs. tourist reviews
  • popularity vs. quality

Wilson score as estimate of confidence interval changed to Dirichlet in 2015 and weighted for newest reviews smoothing by adding pseudocounts

conclusions: location matters to reviewers, as much as food

Matt Sundquist of Plotly

  • JSON for everything
  • maps
  • pandas cufflinks
  • matplotlib is bad
  • now Dashboards with React

Ignacio Elola of Import.io

  • create extractors, combine extractors, schedule extraction
  • depends on the number of data tables, and how much pre and post-processing
  • XPaths: can customize rather than letting import.io guess for you
  • avoid hard-coding/indexing, better to use tags that will still work if the page is redesigned
  • Regex
  • Required Column anchor: example of Amazon's different pages for the same book on kindle vs. hard copy

Quality of data:

  • completeness (is each record complete)
  • coverage (# of expected records)
  • data type tests, and format tests
  • manual sampling to confirm (+) controls

Post-processing:

  • anomaly detection
  • variance, variability, and noise
  • normalization
  • confidence score
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment