Created
July 8, 2014 01:38
-
-
Save Smerity/afe7430fdb4371015466 to your computer and use it in GitHub Desktop.
Extract just the text from Common Crawl WARC WET files
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# To run: python just_text.py > text | |
### | |
from glob import glob | |
# | |
import warc | |
# List any of the WARC files found in the data folder | |
warc_files = glob('data/*.wet.gz') | |
# Process each of the WARC files we found | |
files_processed = 0 | |
for fn in warc_files: | |
f = warc.open(fn) | |
for record in f: | |
url = record.header.get('warc-target-uri', None) | |
if not url: | |
continue | |
text = record.payload.read() | |
# | |
print url | |
print text | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment