Skip to content

Instantly share code, notes, and snippets.

What would you like to do?

Automating news discovery in real-time

Talk outline

  • How media works
    • There's a difference in positioning: in-depth vs breaking news
    • Crunch in talent, margin pressures. Not enough staff to 'break news'
    • Sources of breaking news: agencies, in-house, competition, social media
    • Increasingly, social media is a dominant source
  • How can we source social media data at scale
    • Twitter vs Facebook vs Google Trends vs ...: accessibility vs reach
    • Streaming in real-time (importance of sub-second responses for TV)
    • Parallel extraction: Sockets & threads -- importance of async (and why node.js is better than Python 2)
    • Storage: JSON and coming of age of RDBMSs (and why Postgres is as good as MongoDB)
    • Distributed scraping -- building a headless browser farm
    • Client-side scraper farms as alternatives -- building Chrome plugins
  • Filtering sources for insights
    • Why traditional entity extraction fails
    • Fuzzy matching in the Indian context: key-collision vs distance-based methods
    • How visuals help flexibly identify topic clusters -- k-means and beyond
    • Determining the importance and relevance of topics
    • Manual vs automated filtering -- negative-lists
  • Structure of the final solution -- what it looked like, and what it resulted in
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.