Skip to content

Instantly share code, notes, and snippets.

@ericharley
ericharley / doit.py
Created November 9, 2018 20:00
python for common crawl
import csv
import gzip
import requests
from StringIO import StringIO
# Parameters
prefix = 'https://commoncrawl.s3.amazonaws.com/'
fileout_extension = "pdf"
def get_file(warc_filename, warc_record_offset, warc_record_length, content_digest):
@ericharley
ericharley / gist:bd653fcf8228cba43979c97d6efcf8da
Created September 19, 2022 16:15
Quickly extract all links from a web page using the browser console
// source https://towardsdatascience.com/quickly-extract-all-links-from-a-web-page-using-javascript-and-the-browser-console-49bb6f48127b
var x = document.querySelectorAll("a");
var myarray = []
for (var i=0; i<x.length; i++){
var nametext = x[i].textContent;
var cleantext = nametext.replace(/\s+/g, ' ').trim();
var cleanlink = x[i].href;
myarray.push([cleantext,cleanlink]);
};
function make_table() {