Skip to content

Instantly share code, notes, and snippets.

View stevemclaugh's full-sized avatar

Steve McLaughlin stevemclaugh

View GitHub Profile

Character Encoding is Tricky: The Big Difference Between Python 2.7 and Python 3

In Python 2, a "str" object — i.e., a text string — is typically encoded in 8-bit ASCII, or "extended ASCII," while Python 3 uses Unicode by default. 8-bit extended ASCII has 256 options for each unit in a string of characters, while Unicode can in theory represent 1.1 million characters (though the standard only includes ~140,000 characters at the moment).

Every character in a text file is represented by a number from zero to some maximum value, expressed in binary 1s and 0s. Old-school ASCII (the plainest of plain text), is a 7-bit encoding format. Because 2^7 = 128, 7-bit ASCII gives you a maximum of 128 possible characters. The capital letter "A" corresponds to 65 in decimal, or 1000001 in binary. "B" is decimal 66, or 1000010, and so on.

Each number in a given character encoding format is called a "code point." Here's a handy table of ASCII code points: https://upload.wikimedia.org/wikipedia/commons/thumb/1/1b/ASCII-T

import mnist
import numpy as np
x_train, y_train, x_data, y_data = mnist.load()
x_filtered = []
y_filtered = []
for i in range(len(x_data)):
x_temp = x_data[i]
y_temp = y_data[i]
@stevemclaugh
stevemclaugh / A List of Poets
Last active May 16, 2019 21:45
A list of poets recommended by humans. Thanks to my FB friends for help! (Comment for orthography errors.)
Abbas ibn Firnas
Abū Nuwās
Aimé Césaire
Al Berto
Al-Mu‘Allaqāt
Alan Davies
Aleister Crowley
Alexandre O'Neill
Alli Warren
Amaranth Borsuk
# Save this file as 'Dockerfile' in an empty directory
# and build the image with the following command:
# docker build -t wgettor .
FROM alpine
RUN apk update && \
apk add wget curl tor privoxy supervisor
RUN echo "forward-socks5 / 127.0.0.1:9050 ." >> /etc/privoxy/config

Convert all MP3s in a directory to mono 16/44.1 WAV files

cd /path/to/directory/

for file in *.mp3; 
do ffmpeg -i $file -acodec pcm_s16le -ac 1 `basename "$file" .mp3`.wav; 
done
@stevemclaugh
stevemclaugh / Gazette_of_India_scrape.py
Last active December 30, 2021 17:58
Scraping The Gazette of India with Selenium + ChromeDriver in Python
#!/usr/bin/python3
from selenium import webdriver
import time
import random
import os
import csv
url = 'http://egazette.bih.nic.in/SearchAdvanceGazette.aspx'

Install MongoDB and Python wrapper (Ubuntu Linux)

sudo apt-get install -y mongodb-org
pip install pymongo
pip3 install pymongo

Start MongoDB daemon

Download a list of URLs

wget --wait=0.2 --random-wait --no-check-certificate --page-requisites -erobots=off --tries="inf" -c --user-agent="Mozilla/5.0 (Windows NT 6.1; WOW64; rv:41.0) Gecko/20100101 Firefox/41.0" -i /path/to/list_of_urls.txt

Recursively download a full website

wget -r --wait=0.2 --random-wait --no-check-certificate --page-requisites -erobots=off --tries="inf" -c --user-agent="Mozilla/5.0 (Windows NT 6.1; WOW64; rv:41.0) Gecko/20100101 Firefox/41.0" http://principalhand.org

Scraping a page with a headless browser in Python: Selenium WebDriver + PhantomJS

Install dependencies in the bash shell

pip3 install -U selenium

# macOS
brew install phantomjs

Install MongoDB and Python wrapper

sudo apt-get install -y mongodb-org
python3 -m pip install -U pymongo

Start MongoDB daemon