Skip to content

Instantly share code, notes, and snippets.

@berlinbrown
Created March 16, 2013 18:44
Show Gist options
  • Save berlinbrown/5177716 to your computer and use it in GitHub Desktop.
Save berlinbrown/5177716 to your computer and use it in GitHub Desktop.
More web crawler fun, how goes the crawling
Octane Crawler is a fun/safe/friendly crawler. I am barelling/requesting at about 10-15 seconds a host. So, I am gathering about 100 requests a day.
mysql> select count(1) from bot_crawler_links;
+----------+
| count(1) |
+----------+
| 4746 |
+----------+
1 row in set (0.01 sec)
More Notes:
http://berlin2research.com/
http://code.google.com/p/octane-crawler/
Here are some of the more popular links:
| blogs.detroitnews.com | 13 |
| videocafe.crooksandliars.com | 14 |
| whos.amung.us | 14 |
| www.realclearworld.com | 15 |
| supremecourt.c-span.org | 15 |
| www.bbc.co.uk | 17 |
| www.townhall.com | 19 |
| diversity.mit.edu | 21 |
| blueamerica.crooksandliars.com | 27 |
| www.realclearreligion.org | 28 |
| www.wikidot.com | 28 |
| blogs.reuters.com | 29 |
| www.marco.org | 31 |
| www.edx.org | 32 |
| www.detroitnews.com | 32 |
| www.blogger.com | 33 |
| www.realclearpolitics.com | 33 |
| npr.org | 33 |
| www.abcnews.com | 34 |
| www.wendymcelroy.com | 35 |
| www.publicagenda.org | 38 |
| creativecommons.org | 38 |
| www.hlntv.com | 40 |
| www.foxbusiness.com | 40 |
| ureport.foxnews.com | 41 |
| techcrunch.com | 43 |
| www.c-spanvideo.org | 43 |
| www.npr.org | 45 |
| www.africanews.com | 46 |
| reuters.com | 47 |
| web.mit.edu | 48 |
| www.japantoday.com | 52 |
| www.wired.com | 52 |
| wiki.creativecommons.org | 52 |
| latino.foxnews.com | 53 |
| news.bbc.co.uk | 53 |
| cspan.org | 54 |
| bloomberg.com | 54 |
| www.deadline.com | 56 |
| cnn.com | 56 |
| blog.markwatson.com | 57 |
| mises.org | 59 |
| www.huffingtonpost.com | 59 |
| wordpress.org | 60 |
| www.economist.com | 65 |
| ocw.mit.edu | 66 |
| www.johnthavis.com | 68 |
| www.newscientist.com | 73 |
| www.anncoulter.com | 76 |
| www.foxnews.com | 78 |
| www.hooktheory.com | 79 |
| www.amazon.com | 85 |
| cdn.breitbart.com | 91 |
| jamescarlin.wikidot.com | 93 |
| betterimmigration.com | 93 |
| www.theverge.com | 101 |
| www.nytimes.com | 102 |
| abcnews.go.com | 103 |
| crooksandliars.com | 104 |
| www.c-span.org | 106 |
| www.guardian.co.uk | 119 |
| dailyanarchist.com | 129 |
| www.breitbart.com | 285 |
| www.usatoday.com | 340 |
+-------------------------------------------+-------+
578 rows in set (0.01 sec)
@berlinbrown
Copy link
Author

And then here is the size of the crawler file store, not much data:

16K ./dataoctanecrawl/pXwwwbbccomXy_robots_ignore
8.0K ./dataoctanecrawl/pXwwwvividseatscomXy_robots_ignore
24K ./dataoctanecrawl/pXenradiovaticanavaXy
8.0K ./dataoctanecrawl/pXtravelcnncomXy_robots_ignore
128K ./dataoctanecrawl/pXcaumiteduXy
8.0K ./dataoctanecrawl/pXcoretracwordpressorgXy_robots_ignore
566M ./dataoctanecrawl
566M .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment