Skip to content

Instantly share code, notes, and snippets.

Eric Harley ericharley

Block or report user

Report or block ericharley

Hide content and notifications from this user.

Learn more about blocking users

Contact Support about this user’s behavior.

Learn more about reporting abuse

Report abuse
View GitHub Profile
@ericharley
ericharley / doit.py
Created Nov 9, 2018
python for common crawl
View doit.py
import csv
import gzip
import requests
from StringIO import StringIO
# Parameters
prefix = 'https://commoncrawl.s3.amazonaws.com/'
fileout_extension = "pdf"
def get_file(warc_filename, warc_record_offset, warc_record_length, content_digest):
You can’t perform that action at this time.