Skip to content

Instantly share code, notes, and snippets.

@cederigo
Last active December 27, 2015 11:29
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save cederigo/7318509 to your computer and use it in GitHub Desktop.
Save cederigo/7318509 to your computer and use it in GitHub Desktop.

Crawl.js - Experiment 3

  • Place: Opennebula cluster unine.ch
  • Date: 11.5.2013

Setup

  • Wikipedia-languages: ar,af,be
  • worker-vms: 2
  • crawlers (url-blocks): 4
  • hashing: simple (md5 on whole url)
  • virtual latency: none

##Seed urls

##worker-vm

  • cpu: 1 vcpu, 2cpus
  • ram: 2048
  • details: ./vms/lshw_worker_001.txt

##results notes:

  • expected pages: find articles/ -type f | wc -l
  • crawled pages: everything including errors (404, ...)

expected pages: 362'748
crawled pages: 397'406

crawl started:
crawl ended:
crawl duration:

pages / sec:

##crawler config https://gist.github.com/cederigo/7317124

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment