Skip to content

Instantly share code, notes, and snippets.

@edsu
Created December 2, 2011 10:34
Show Gist options
  • Save edsu/1422742 to your computer and use it in GitHub Desktop.
Save edsu/1422742 to your computer and use it in GitHub Desktop.
example of using wget's warc functionality
.-(ed@curry 05:24:32) ~
`-->wget --recursive --warc-file=c4lj.warc.gz http://journal.code4lib.org
FINISHED --2011-12-02 05:17:11--
Total wall clock time: 19m 24s
Downloaded: 1524 files, 99M in 4m 17s (395 KB/s)
.-(ed@curry 05:45:22) ~
`-->ls -lh c4lj.warc.gz.warc.gz
-rw-rw-r-- 1 ed ed 85M 2011-12-02 05:17 c4lj.warc.gz.warc.gz
.-(ed@curry 05:45:30) ~
`-->zcat c4lj.warc.gz.warc.gz | head -n300
WARC/1.0
WARC-Type: warcinfo
Content-Type: application/warc-fields
WARC-Date: 2011-12-02T09:57:47Z
WARC-Record-ID: <urn:uuid:1d0fdadf-94ea-4b4f-9aa4-fa055cd8f2c0>
WARC-Filename: c4lj.warc.gz.warc.gz
WARC-Block-Digest: sha1:32POUMA7CZE3T7DDGE7PALB5HTSE2S6Z
Content-Length: 257
software: Wget/1.13.4-2575 (linux-gnu)
format: WARC File Format 1.0
conformsTo: http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf
robots: classic
wget-arguments: "--recursive" "--warc-file=c4lj.warc.gz" "http://journal.code4lib.org"
WARC/1.0
WARC-Type: request
WARC-Target-URI: http://journal.code4lib.org/
Content-Type: application/http;msgtype=request
WARC-Date: 2011-12-02T09:57:48Z
WARC-Record-ID: <urn:uuid:29724054-6514-4e65-8d8c-06025dbf1b70>
WARC-IP-Address: 152.19.134.41
WARC-Warcinfo-ID: <urn:uuid:1d0fdadf-94ea-4b4f-9aa4-fa055cd8f2c0>
WARC-Block-Digest: sha1:6NHMBYY6DWIVIUQSSCHCD4ILASYOC2KF
Content-Length: 125
GET / HTTP/1.1
User-Agent: Wget/1.13.4-2575 (linux-gnu)
Accept: */*
Host: journal.code4lib.org
Connection: Keep-Alive
WARC/1.0
WARC-Type: response
WARC-Record-ID: <urn:uuid:4b3ff038-3be2-41fa-8426-93ef273d59eb>
WARC-Warcinfo-ID: <urn:uuid:1d0fdadf-94ea-4b4f-9aa4-fa055cd8f2c0>
WARC-Concurrent-To: <urn:uuid:29724054-6514-4e65-8d8c-06025dbf1b70>
WARC-Target-URI: http://journal.code4lib.org/
WARC-Date: 2011-12-02T09:57:48Z
WARC-IP-Address: 152.19.134.41
WARC-Block-Digest: sha1:XMKE56HN3RJ4QDGPZ2EUPXDMB2KKEEYE
WARC-Payload-Digest: sha1:2VCRMGAXDI4AML6UNJDND34HR6LHVGJV
Content-Type: application/http;msgtype=response
Content-Length: 14987
HTTP/1.1 200 OK
Date: Fri, 02 Dec 2011 09:57:48 GMT
Server: Apache
X-Pingback: http://journal.code4lib.org/xmlrpc.php
Keep-Alive: timeout=5, max=200
Connection: Keep-Alive
Transfer-Encoding: chunked
Content-Type: text/html; charset=UTF-8
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" dir="ltr" lang="en-US">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<title>The Code4Lib Journal</title>
<meta name="generator" content="WordPress 3.0.4" /> <!-- leave this for stats -->
<link rel="shortcut icon" href="http://journal.code4lib.org/wp-content/themes/c4lj/images/favicon.ico" />
<link rel="stylesheet" href="http://journal.code4lib.org/wp-content/themes/c4lj/style.css" type="text/css" media="screen, print" />
<!--[if lte IE 7]>
<link rel="stylesheet" href="http://journal.code4lib.org/wp-content/themes/c4lj/fix-ie7.css" type="text/css" media="screen" />
<![endif]-->
<!--[if lte IE 6]>
<link rel="stylesheet" href="http://journal.code4lib.org/wp-content/themes/c4lj/fix-ie6.css" type="text/css" media="screen" />
<![endif]-->
<link rel="stylesheet" href="http://journal.code4lib.org/wp-content/themes/c4lj/print.css" type="text/css" media="print" />
<link rel="alternate" type="application/rss+xml" title="The Code4Lib Journal Syndication Feed" href="http://feeds.feedburner.com/c4lj" />
<link rel="pingback" href="http://journal.code4lib.org/xmlrpc.php" />
<link rel='stylesheet' id='contact-form-7-css' href='http://journal.code4lib.org/wp-content/plugins/contact-form-7/styles.css?ver=2.4.2' type='text/css' media='all' />
<script type='text/javascript' src='http://journal.code4lib.org/wp-includes/js/jquery/jquery.js?ver=1.4.2'></script>
<link rel="EditURI" type="application/rsd+xml" title="RSD" href="http://journal.code4lib.org/xmlrpc.php?rsd" />
<link rel="wlwmanifest" type="application/wlwmanifest+xml" href="http://journal.code4lib.org/wp-includes/wlwmanifest.xml" />
<link rel='index' title='The Code4Lib Journal' href='http://journal.code4lib.org' />
<meta name="generator" content="WordPress 3.0.4" />
<!-- unAPI -->
<link rel="unapi-server" type="application/xml" title="unAPI" href="http://journal.code4lib.org/wp-content/plugins/unapi/server.php"/>
<link rel="stylesheet" type="text/css" href="http://journal.code4lib.org/wp-content/plugins/wp-recaptcha/recaptcha.css" /> <link type="text/css" rel="stylesheet" href="http://journal.code4lib.org/wp-content/plugins/syntaxhighlighter/files/SyntaxHighlighter.css"></link>
</head>
<body>
<div id="page">
<div id="header">
<div id="headerbackground">
<h1><a href="http://journal.code4lib.org/"><img src="http://journal.code4lib.org/wp-content/themes/c4lj/images/logo.png" alt="The Code4Lib Journal" /></a></h1>
<h2 id="issn">ISSN 1940-5758</h2>
</div>
</div>
<div id="meta">
<form method="get" id="searchform" action="http://journal.code4lib.org/">
<div>
<input type="text" value="" name="s" id="s" tabindex="1" />
<input type="submit" value="Search" id="searchsubmit" tabindex="2" />
</div>
</form>
<div id="archives">
<h2>Current Issue</h2>
<ul>
<li><a href="http://journal.code4lib.org/issues/issue15">Issue 15, 2011-10-31</a></li>
</ul>
<h2>Previous Issues</h2>
<ul>
<li><a href="http://journal.code4lib.org/issues/issue14">Issue 14, 2011-07-25</a></li><li><a href="http://journal.code4lib.org/issues/issue13">Issue 13, 2011-04-11</a></li><li><a href="http://journal.code4lib.org/issues/issue12">Issue 12, 2010-12-21</a></li><li><a href="http://journal.code4lib.org/issues/issue11">Issue 11, 2010-09-21</a></li> <li><a href="/issues">Older Issues</a></li>
</ul>
</div>
<div id="about">
<h2>About</h2>
<ul>
<li class="page_item page-item-5"><a href="http://journal.code4lib.org/mission" title="Mission">Mission</a></li>
<li class="page_item page-item-6"><a href="http://journal.code4lib.org/editorial-committee" title="Editorial Committee">Editorial Committee</a></li>
<li class="page_item page-item-8"><a href="http://journal.code4lib.org/process-and-structure" title="Process and Structure">Process and Structure</a></li>
<li><a href="http://code4lib.org/">Code4Lib</a></li>
</ul>
</div>
<div id="forauthors">
<h2>For Authors</h2>
<ul>
<li class="page_item page-item-4"><a href="http://journal.code4lib.org/call-for-submissions" title="Call for Submissions">Call for Submissions</a></li>
<li class="page_item page-item-7"><a href="http://journal.code4lib.org/article-guidelines" title="Article Guidelines">Article Guidelines</a></li>
</ul>
</div>
</div>
<div id="content" class="listpage">
<h1 class="pagetitle">Issue 15</h1>
<div class="article" id="post-5989">
<h2 class="articletitle"><a href="http://journal.code4lib.org/articles/5989">Editorial Introduction</a></h2>
<p class="author">Tod A. Olson</p>
<div class="abstract">
<p>This Hallowe’en finds our contributors working away like (benign) mad scientists, assembling and deploying their creations to bring services and information in novel ways to their patrons and staff, approaching their work with a vital sprit of invention and discovery. </p>
</div>
</div> <div class="article" id="post-5994">
<h2 class="articletitle"><a href="http://journal.code4lib.org/articles/5994">Controlled Terms or Free Terms? A JavaScript Library to Utilize Subject Headings and Thesauri on the Web</a></h2>
<p class="author">Shun Nagaya, Yutaka Hayashi, Shuhei Otani and Keizo Itabashi </p>
<div class="abstract">
<p>There are two types of keywords used as metadata: controlled terms and free terms. Free terms have the advantage that metadata creators can freely select keywords, but there also exists a disadvantage that the information retrieval recall ratio might be reduced. The recall ratio can be improved by using controlled terms. But creating and maintaining controlled vocabularies has an enormous cost. In addition, many existing controlled vocabularies are published in formats less suitable for programming. We introduce a JavaScript library called “covo.js” that enables us to make use of controlled vocabularies as metadata for the organization of web pages.</p>
</div>
</div> <div class="article" id="post-5876">
<h2 class="articletitle"><a href="http://journal.code4lib.org/articles/5876">Best Practices for a University Laptop Lending Program</a></h2>
<p class="author">Pamela Buzzard and Travis Teetor</p>
<div class="abstract">
<p>The University of Arizona Libraries currently circulates over three hundred pieces of equipment including laptops, netbooks, projectors and iPads. This article describes the best practices and workflows we have developed since 2003 to create a laptop/equipment lending program that is efficient and mindful of financial resources and that our student body loves and continues to support.</p>
</div>
</div> <div class="article" id="post-6004">
<h2 class="articletitle"><a href="http://journal.code4lib.org/articles/6004">Using Amazon Mechanical Turk to Transcribe Historical Handwritten Documents</a></h2>
<p class="author">Andrew S.I.D. Lang and Joshua Rio-Ross</p>
<div class="abstract">
<p>The developing “information age” is continually unraveling new ways of discovering, presenting and sharing information. Most new academic material is digitally formatted upon its creation and is thus easy to find and query. However, there remains a good deal of material from times prior to the “information age” that has yet to be converted to digital form. Much of this material can be found in library collections—whether academic, public or private—and thus remains available only to a limited number of locals or willing-and-able sojourners. Using OCR technology, most typeset documents can be digitized and made available online; and there are several projects underway to do exactly this. However, there remains little to be done for handwritten materials. Those who own collections of handwritten documents are increasingly wanting to make the content thereof available to the general public. Unfortunately, traditional transcription models typically prove to be expensive or inefficient and pdf snapshots are not searchable. We have developed a model for digital transcription using Google Docs and Amazon&#8217;s Mechanical Turk. Using this model, one can use an online workforce to efficiently transcribe handwritten texts and perform quality control at a cost much lower than professional transcription services. To illustrate the model we used Amazon’s Mechanical Turk to transcribe and then proofread the Frederick Douglass Diary which we have made available on a public searchable wiki. The total cost of transcription and proofreading for the 72 page diary was less than $25.00 with some pages being transcribed and proofread for as little as $0.04. Our results show that using Amazon’s Mechanical Turk holds great promise for providing an affordable transcription method for hand-written historical documents making them easily sharable and fully searchable.</p>
</div>
</div> <div class="article" id="post-5832">
<h2 class="articletitle"><a href="http://journal.code4lib.org/articles/5832">Lessons in Public Touchscreen Development</a></h2>
<p class="author">Andreas K. Orphanides</p>
<div class="abstract">
<p>In October 2010, the NCSU Libraries debuted its first public touchscreen information kiosk, designed to provide on-demand access to useful and commonly consulted real-time displays of library information. This article presents a description of the hardware and software development process, as well as the rationale behind a variety of design and implementation decisions. This article also provides an analysis of usage of the touchscreen since its debut, including a numerical analysis of most popular content areas, and a heatmap-based analysis of user interaction patterns with the kiosk&#39;s interface components.</p>
</div>
</div> <div class="article" id="post-5859">
<h2 class="articletitle"><a href="http://journal.code4lib.org/articles/5859">An Android/LAMP Mobile In/Out Board Based on Wi-Fi Fingerprinting</a></h2>
<p class="author">Keith Kelley, Karlis Kaugars, Scott Garrison</p>
<div class="abstract">
<p>Library technology and other professionals with diverse skills must be able to locate each other during the workday, in order to most responsively serve their clients. While staff often carry cellular phones, contact can be especially challenging given the constant, highly mobile nature of library work, especially on larger campuses with variable cellular phone service. Western Michigan University (WMU) Libraries has developed an Android/LAMP application that library staff may use on their increasingly prevalent Wi-Fi enabled mobile devices to “check in” at various locations where they do work, so that their colleagues may locate them as needed. The application takes advantage of WMU’s widespread Wi-Fi network, a set of free platform and software development tools and open standards, and methods from the computer science literature, and overcomes GPS and telephony limitations. This article describes the application, which is based on Wi-Fi fingerprinting, and suggests how other developers could use it and new methods from the computer science literature as starting points to create their own applications.</p>
</div>
</div> <div class="article" id="post-5913">
<h2 class="articletitle"><a href="http://journal.code4lib.org/articles/5913">Open Access Publishing with Drupal</a></h2>
<p class="author">Nina McHale</p>
<div class="abstract">
<p>In January 2009, the Colorado Association of Libraries (CAL) suspended publication of its print quarterly journal, Colorado Libraries, as a cost-saving measure in a time of fiscal uncertainty. Printing and mailing the journal to its 1300 members cost CAL more than $26,000 per year. Publication of the journal was placed on an indefinite hiatus until the editorial staff proposed an online, open access format a year later. The benefits to migrating to open access included: significantly lower costs; a green platform; instant availability of content; a greater level of access to users with disabilities; and a higher level of visibility of the journal and the association. The editorial staff chose Drupal, including the E-journal module, and while Drupal is notorious for its steep learning curve—which exacerbated delays to content that had been created before the publishing hiatus—the fourth electronic issue was published recently at coloradolibrariesjournal.org. This article will discuss both the benefits and challenges of transitioning to an open access model and the choice Drupal as a platform over other more established journal software options.</p>
</div>
</div> </div>
<div id="footer">
<p id="login"><a href="http://journal.code4lib.org/wp-login.php">Log in</a></p>
<p id="copyright">This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by/3.0/us/">Creative Commons Attribution 3.0 United States License</a>.<br /><a rel="license" href="http://creativecommons.org/licenses/by/3.0/us/"><img alt="Creative Commons License" src="http://i.creativecommons.org/l/by/3.0/us/80x15.png" /></a></p>
</div>
<script type='text/javascript' src='http://journal.code4lib.org/wp-content/plugins/contact-form-7/jquery.form.js?ver=2.47'></script>
<script type='text/javascript' src='http://journal.code4lib.org/wp-content/plugins/contact-form-7/scripts.js?ver=2.4.2'></script>
<!-- SyntaxHighlighter Stuff -->
<script type="text/javascript" src="http://journal.code4lib.org/wp-content/plugins/syntaxhighlighter/files/shCore.js"></script>
<script type="text/javascript">
dp.SyntaxHighlighter.ClipboardSwf = 'http://journal.code4lib.org/wp-content/plugins/syntaxhighlighter/files/clipboard.swf';
dp.SyntaxHighlighter.HighlightAll('code');
</script>
</div>
<script src="http://www.google-analytics.com/urchin.js" type="text/javascript">
</script>
<script type="text/javascript">
_uacct = "UA-3211381-1";
urchinTracker();
</script>
</body>
</html>
WARC/1.0
WARC-Type: request
WARC-Target-URI: http://journal.code4lib.org/robots.txt
Content-Type: application/http;msgtype=request
WARC-Date: 2011-12-02T09:57:48Z
WARC-Record-ID: <urn:uuid:04d4206b-8967-4c97-82d4-b5b61decfb16>
WARC-IP-Address: 152.19.134.41
WARC-Warcinfo-ID: <urn:uuid:1d0fdadf-94ea-4b4f-9aa4-fa055cd8f2c0>
WARC-Block-Digest: sha1:IZ5PNJYOIN637BWISEB7K6A4C4UWAGXX
Content-Length: 135
GET /robots.txt HTTP/1.1
User-Agent: Wget/1.13.4-2575 (linux-gnu)
Accept: */*
Host: journal.code4lib.org
Connection: Keep-Alive
WARC/1.0
WARC-Type: response
WARC-Record-ID: <urn:uuid:c6260cf5-ad3b-464d-ac19-bf2620f73b49>
WARC-Warcinfo-ID: <urn:uuid:1d0fdadf-94ea-4b4f-9aa4-fa055cd8f2c0>
WARC-Concurrent-To: <urn:uuid:04d4206b-8967-4c97-82d4-b5b61decfb16>
WARC-Target-URI: http://journal.code4lib.org/robots.txt
WARC-Date: 2011-12-02T09:57:48Z
WARC-IP-Address: 152.19.134.41
WARC-Block-Digest: sha1:QBYTS6LHTJH3E7EKGH5X7GJEUFJLG5N5
WARC-Payload-Digest: sha1:YR6M6GSJYJGMLBBEGCVHLRZO6SISSJAS
Content-Type: application/http;msgtype=response
Content-Length: 265
HTTP/1.1 200 OK
Date: Fri, 02 Dec 2011 09:57:48 GMT
Server: Apache
X-Pingback: http://journal.code4lib.org/xmlrpc.php
Content-Length: 24
Keep-Alive: timeout=5, max=199
Connection: Keep-Alive
Content-Type: text/plain; charset=utf-8
User-agent: *
Disallow:
WARC/1.0
WARC-Type: request
WARC-Target-URI: http://journal.code4lib.org/wp-content/themes/c4lj/images/favicon.ico
Content-Type: application/http;msgtype=request
WARC-Date: 2011-12-02T09:57:49Z
WARC-Record-ID: <urn:uuid:989acd57-36cc-4a89-a475-2e4ebfb57a82>
WARC-IP-Address: 152.19.134.41
WARC-Warcinfo-ID: <urn:uuid:1d0fdadf-94ea-4b4f-9aa4-fa055cd8f2c0>
WARC-Block-Digest: sha1:YVY3MKIPQ7HJ3DE5VH5CCBUNSWHLXUWU
Content-Length: 205
GET /wp-content/themes/c4lj/images/favicon.ico HTTP/1.1
Referer: http://journal.code4lib.org/
User-Agent: Wget/1.13.4-2575 (linux-gnu)
Accept: */*
Host: journal.code4lib.org
Connection: Keep-Alive
WARC/1.0
WARC-Type: response
WARC-Record-ID: <urn:uuid:bc2abae2-a18c-4dc3-a009-306f696bfa31>
WARC-Warcinfo-ID: <urn:uuid:1d0fdadf-94ea-4b4f-9aa4-fa055cd8f2c0>
WARC-Concurrent-To: <urn:uuid:989acd57-36cc-4a89-a475-2e4ebfb57a82>
WARC-Target-URI: http://journal.code4lib.org/wp-content/themes/c4lj/images/favicon.ico
WARC-Date: 2011-12-02T09:57:49Z
WARC-IP-Address: 152.19.134.41
WARC-Block-Digest: sha1:PXAJ5HKX7OU5TVWGQJSVOPPXPXTUQUEZ
WARC-Payload-Digest: sha1:GES2HLPJAKPJCMTTNBGV3JAUE4KSRFJ6
Content-Type: application/http;msgtype=response
Content-Length: 615
HTTP/1.1 200 OK
Date: Fri, 02 Dec 2011 09:57:49 GMT
Server: Apache
Last-Modified: Fri, 30 Jan 2009 19:44:33 GMT
ETag: "e18ec7-13e-461b86f1eaa40"
Accept-Ranges: bytes
Content-Length: 318
Keep-Alive: timeout=5, max=198
Connection: Keep-Alive
Content-Type: text/plain; charset=ISO-8859-1
\00\00\00\00\00\00\00(\00\00\00\00\00(\00\00\00\00\00\00 \00\00\00\00\00\00\00\00\00\00\00\00\00 \00\00 \00\00\00\00\00\00\00\00\00\C9\CB\D4\00\FA\F9\F6\00\F2\F3\F5\00\E4\E5\EA\00\F5\F3\EB\00\E1\DA\C3\00v|\93\00\84\89\9E\00\A0\A4\B4\00\EB\E7\D7\00\AE\9C^\00$-R\00\FF\FF\FF\00\00\00\00\00\00\00\00\00\00\00\00\00\BB\BB\BB\BB\BB\BB\BB\BB\BC\CC\CC\CC\CC\CC\CC˼ʬ\CC\CCʬ˼\CA\\CC\CCI\AC˼\CA˻̬˼\CA\CC̼̬˼\9A \00\B0,\A9˼\AA+\BB\BB<\AA˼\AA\CB|\BC̪˼\9A̰\BC̩˼\CA\CC;\BC̬˼\CA\CCȼ̬˼ʜ\CCl\AC˼ʬ\CC\CC\AC˼\CC\CC\CC\CC\CC\CC˻\BB\BB\BB\BB\BB\BB\BB\00\00y\00\00\00y\00\00\00e\00\00\00A\00\00\00l\00\00\00i\00\00\00s\00\00\00e\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00xq\00\00t\00\00\00\00\00\00\00\00
WARC/1.0
WARC-Type: request
WARC-Target-URI: http://journal.code4lib.org/wp-content/themes/c4lj/style.css
etc ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment