Skip to content

Instantly share code, notes, and snippets.

@sanand0
Last active August 29, 2015 14:23
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save sanand0/a59297dc11925b4b1491 to your computer and use it in GitHub Desktop.
Save sanand0/a59297dc11925b4b1491 to your computer and use it in GitHub Desktop.

Automating news discovery in real-time

Talk outline

  • How media works
    • There's a difference in positioning: in-depth vs breaking news
    • Crunch in talent, margin pressures. Not enough staff to 'break news'
    • Sources of breaking news: agencies, in-house, competition, social media
    • Increasingly, social media is a dominant source
  • How can we source social media data at scale
    • Twitter vs Facebook vs Google Trends vs ...: accessibility vs reach
    • Streaming in real-time (importance of sub-second responses for TV)
    • Parallel extraction: Sockets & threads -- importance of async (and why node.js is better than Python 2)
    • Storage: JSON and coming of age of RDBMSs (and why Postgres is as good as MongoDB)
    • Distributed scraping -- building a headless browser farm
    • Client-side scraper farms as alternatives -- building Chrome plugins
  • Filtering sources for insights
    • Why traditional entity extraction fails
    • Fuzzy matching in the Indian context: key-collision vs distance-based methods
    • How visuals help flexibly identify topic clusters -- k-means and beyond
    • Determining the importance and relevance of topics
    • Manual vs automated filtering -- negative-lists
  • Structure of the final solution -- what it looked like, and what it resulted in
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment