Hao Wang chapter09

## datasets.md

      
              1 file
            
          
              9 forks
            
          
              0 comments
            
          
              21 stars
            
          
                mrflip
                / datasets.md
            
            
              Created
              August 9, 2012 20:01
            
              
                Overview of Datasets
              
          
    == Overview of Datasets ==
The examples in this book use the "Chimpmark" datasets: a set of freely-redistributable datasets, converted to simple standard formats, with traceable provenance and documented schema. They are the same datasets as used in the upcoming Chimpmark Challenge big-data benchmark. The datasets are:


Wikipedia English-language Article Corpus (wikipedia_corpus; 38 GB, 619 million records, 4 billion tokens): the full text of every English-language wikipedia article, in


Wikipedia Pagelink Graph (wikipedia_pagelinks; ) --


Wikipedia Pageview Stats (wikipedia_pageviews; 2.3 TB, about 250 billion records (FIXME: verify num records)) -- hour-by-hour pageview


## service-checklist.md

      
              1 file
            
          
              185 forks
            
          
              26 comments
            
          
              717 stars
            
          
                acolyer
                / service-checklist.md
            
            
              Last active
              July 10, 2024 05:13
            
              
                Internet Scale Services Checklist
              
          
    Internet Scale Services Checklist

A checklist for designing and developing internet scale services, inspired by James Hamilton's 2007 paper "On Desgining and Deploying Internet-Scale Services."

http://mvdirona.com/jrh/talksandpapers/jamesrh_lisa.pdf

Basic tenets


 Does the design expect failures to happen regularly and handle them gracefully?
 Have we kept things as simple as possible?