Skip to content

Instantly share code, notes, and snippets.

@jaredhirsch
Created August 4, 2015 00:17
Show Gist options
  • Save jaredhirsch/ee4a2bda5bc2079fff5c to your computer and use it in GitHub Desktop.
Save jaredhirsch/ee4a2bda5bc2079fff5c to your computer and use it in GitHub Desktop.
activity feed in a hurry

What do we need?

  1. Scrape a page
  2. store the result in a relational DB
  3. also store the result in a full text index
  4. expose search API
  5. on search hit, return the full record & any score

How do we do this quickly / easily?

  1. Scraping: ReaderMode
  • This is in FF already, and it works decently enough
let result;
let url = 'http://time.com/3983182/inflatable-minion-dublin';
ReaderMode.downloadAndParseDocument(url)
  .then(function(x) { result = x; }, console.error);

result.url; // "http://time.com/3983182/inflatable-minion-dublin"
result.title; // "Giant Inflatable Minion Causes Chaos on Dublin Road"
result.byline; // "Sarah Begley							@SCBegley"
result.excerpt; // "The balloon got loose from a fairground"
result.length; // 906
result.dir; // undefined (??? todo)
result.content; // the content is a mess, lot of html still in it,
                // see https://gist.github.com/6a68/d074fcff51cc39aa7e11

  1. Store the result in a relational DB
  1. also store the result in a full text index
  1. expose search API
  2. on search hit, fetch the corresponding record & search score
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment