Skip to content

Instantly share code, notes, and snippets.

@Smerity
Created July 8, 2014 01:38
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save Smerity/afe7430fdb4371015466 to your computer and use it in GitHub Desktop.
Save Smerity/afe7430fdb4371015466 to your computer and use it in GitHub Desktop.
Extract just the text from Common Crawl WARC WET files
# To run: python just_text.py > text
###
from glob import glob
#
import warc
# List any of the WARC files found in the data folder
warc_files = glob('data/*.wet.gz')
# Process each of the WARC files we found
files_processed = 0
for fn in warc_files:
f = warc.open(fn)
for record in f:
url = record.header.get('warc-target-uri', None)
if not url:
continue
text = record.payload.read()
#
print url
print text
print
print
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment