Skip to content

Instantly share code, notes, and snippets.

@phillipoertel
Last active August 29, 2015 13:57
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save phillipoertel/9741002 to your computer and use it in GitHub Desktop.
Save phillipoertel/9741002 to your computer and use it in GitHub Desktop.
require 'embedly'
key = ENV['KEY']
timeout = 60
client = Embedly::API.new(key: key, timeout: timeout)
url = 'http://de.qantara.de/inhalt/zivilgesellschaftliche-initiativen-im-libanon-ich-bin-kein-maertyrer'
p client.extract(url: url)
# Result:
#[#<Embedly::EmbedlyObject provider_url="http://www.80legs.com", description="What is 008? If you've come to this page, then you're probably interested in learning more about our web crawler, identified as user-agent \"008\". 008 runs on a grid computing platform that consists of several thousand computers, which is why you may see our web crawler access your site from many different IP addresses.", embeds=[], safe=true, provider_display="www.80legs.com", related=[{"score"=>0.5730392336845398, "description"=>"Twenty-five years ago today, I filed the proposal for what was to become the World Wide Web. My boss dubbed it 'vague but exciting'. Luckily, he thought enough of the idea to allow me to quietly work on it on the side.", "title"=>"{title}", "url"=>"http://www.webat25.org/", "thumbnail_height"=>171, "thumbnail_url"=>"http://webat25.org/assets/img/logo.png", "thumbnail_width"=>240}, {"score"=>0.5435097813606262, "description"=>"In 1989, Tim Berners-Lee, a software engineer, sat in his small office at CERN, the European Organization for Nuclear Research near Geneva and started work on a new system called the World Wide Web. On Wednesday, that project, now simply called the web, will celebrate its 25th anniversary, and Mr. Berners-Lee is looking ahead at the next 25.", "title"=>"As the Web Turns 25, Its Creator Talks About Its Future", "url"=>"http://bits.blogs.nytimes.com/2014/03/11/as-the-world-wide-web-turns-25-fear-about-its-future/", "thumbnail_height"=>338, "thumbnail_url"=>"http://graphics8.nytimes.com/images/2014/03/11/technology/bits-web25-slide-20U8/bits-web25-slide-20U8-videoSixteenByNine600.png", "thumbnail_width"=>600}], favicon_url=nil, authors=[], images=[], cache_age=77636, lead=nil, language="English", original_url="http://de.qantara.de/inhalt/zivilgesellschaftliche-initiativen-im-libanon-ich-bin-kein-maertyrer", url="http://www.80legs.com/webcrawler.html", media=#<Embedly::EmbedlyObject>, title="80legs - Most Powerful Web Crawler Ever", offset=nil, content="<div>\n<h2>What is 008?</h2>\n<p>If you've come to this page, then you're probably interested in learning more about our web crawler, identified as user-agent \"008\".</p>\n<p>008 runs on a grid computing platform that consists of several thousand computers, which is why you may see our web crawler access your site from many different IP addresses.</p>\n<h2>Why is 008 crawling my website?</h2>\n<p>008 is the user-agent used by 80legs, a web crawling service provider. 80legs allows its users to design and run custom web crawls. So, if 008 is crawling your website, it means that one or more 80legs users created a web crawl that went (eventually) to your website.</p>\n<p>People use 80legs for a variety of reasons, including providing data to their own search engines, monitoring trends in online opinions, and <a href=\"http://www.80legs.com/who-uses-80legs.html\">other interesting applications</a>.</p>\n<h2>Help us crawl your website properly</h2>\n<p>If you feel that 008 is crawling your website too quickly, please <a href=\"http://www.80legs.com/contact-linked.html\">let us know</a> what an appropriate crawl rate is. If you'd like us to stop crawling your website, the best thing to do is to block our web crawler using the <a href=\"http://en.wikipedia.org/wiki/robots.txt\">robots.txt specification</a>. To do this, add the following to your robots.txt:</p>\n<pre>\n User-agent: 008\n Disallow: /</pre>\n<p>If you block 008 using robots.txt, you will see crawl requests die down gradually, rather than immediately. This happens because of our distributed architecture. Our computers only periodically receive robots.txt information for domains they are crawling.</p>\n<h2>Blocking us by IP address</h2>\n<p><strong>Blocking our web crawler by IP address will not work</strong>. Due to the distributed nature of our infrastructure, we have thousands of constantly changing IP addresses. We strongly recommend you don't try to block our web crawler by IP address, as you'll most likely spend several hours of futile effort and be in a very bad mood at the end of it. You really should just include us in your robots.txt or <a href=\"http://www.80legs.com/contact.html\">contact us directly</a>.</p>\n<h2>Learn more</h2>\n<p>To read more about the inner workings of 008, please <a href=\"http://wiki.80legs.com/FAQ#Howdoes80legscrawlwebpages\">visit our wiki</a>.</p>\n<p>To learn more about 80legs, please check out the rest of the site. If you'd like to ask us any questions, please <a href=\"http://www.80legs.com/contact-linked.html\">contact us</a>.</p>\n</div>", entities=[], favicon_colors=nil, keywords=[{"score"=>93, "name"=>"008"}, {"score"=>74, "name"=>"crawling"}, {"score"=>50, "name"=>"80legs"}, {"score"=>50, "name"=>"crawler"}, {"score"=>43, "name"=>"txt"}, {"score"=>39, "name"=>"web"}, {"score"=>30, "name"=>"user-agent"}, {"score"=>27, "name"=>"robots"}, {"score"=>23, "name"=>"website"}, {"score"=>20, "name"=>"block"}], published=nil, provider_name="80legs", type="html">]
@BunHouth
Copy link

BunHouth commented Apr 9, 2015

Hello, How can i get api key?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment