sanand0/automating-news-discovery-in-real-time.md

## automating-news-discovery-in-real-time.md

      
    Raw
  

              automating-news-discovery-in-real-time.md
            
          
    Automating news discovery in real-time

Talk outline

How media works

There's a difference in positioning: in-depth vs breaking news
Crunch in talent, margin pressures. Not enough staff to 'break news'
Sources of breaking news: agencies, in-house, competition, social media
Increasingly, social media is a dominant source


How can we source social media data at scale

Twitter vs Facebook vs Google Trends vs ...: accessibility vs reach
Streaming in real-time (importance of sub-second responses for TV)
Parallel extraction: Sockets & threads -- importance of async (and why node.js is better than Python 2)
Storage: JSON and coming of age of RDBMSs (and why Postgres is as good as MongoDB)
Distributed scraping -- building a headless browser farm
Client-side scraper farms as alternatives -- building Chrome plugins


Filtering sources for insights

Why traditional entity extraction fails
Fuzzy matching in the Indian context: key-collision vs distance-based methods
How visuals help flexibly identify topic clusters -- k-means and beyond
Determining the importance and relevance of topics
Manual vs automated filtering -- negative-lists


Structure of the final solution -- what it looked like, and what it resulted in