Skip to content

Instantly share code, notes, and snippets.

View chapter09's full-sized avatar
🚀
Focus

Hao Wang chapter09

🚀
Focus
View GitHub Profile
@mrflip
mrflip / datasets.md
Created August 9, 2012 20:01
Overview of Datasets

== Overview of Datasets ==

The examples in this book use the "Chimpmark" datasets: a set of freely-redistributable datasets, converted to simple standard formats, with traceable provenance and documented schema. They are the same datasets as used in the upcoming Chimpmark Challenge big-data benchmark. The datasets are:

  • Wikipedia English-language Article Corpus (wikipedia_corpus; 38 GB, 619 million records, 4 billion tokens): the full text of every English-language wikipedia article, in

  • Wikipedia Pagelink Graph (wikipedia_pagelinks; ) --

  • Wikipedia Pageview Stats (wikipedia_pageviews; 2.3 TB, about 250 billion records (FIXME: verify num records)) -- hour-by-hour pageview

@acolyer
acolyer / service-checklist.md
Last active July 10, 2024 05:13
Internet Scale Services Checklist

Internet Scale Services Checklist

A checklist for designing and developing internet scale services, inspired by James Hamilton's 2007 paper "On Desgining and Deploying Internet-Scale Services."

Basic tenets

  • Does the design expect failures to happen regularly and handle them gracefully?
  • Have we kept things as simple as possible?