Skip to content

Instantly share code, notes, and snippets.

View dannguyen's full-sized avatar
💭
havin a normal one

Dan Nguyen dannguyen

💭
havin a normal one
View GitHub Profile
@dannguyen
dannguyen / tx-dp-regex-religion.py
Last active April 26, 2024 10:03
Scraping and parsing the last words of Texas executed inmates for religious words; an exercise in webscraping and regexes
"""
Filter Texas executed inmates by whether any of their last words fit in a
list of words commonly associated with religion.
A quick demonstration of the overall patterns in web-scraping, including
using a HTML parser to navigate the DOM and the use of Regex for
hand-entered values. Does none of the file-caching/management that you should
be doing for such a task
"""
@dannguyen
dannguyen / fetch_ghstars.md
Last active April 10, 2024 19:25
fetch_ghstars.py: quick CLI script to fetch from Github API all of a user's starred repos and save it as raw JSON and wrangled CSV

fetch_ghstars.py: quick CLI script to fetch and collate from Github API all of a user's starred repos

  • Requires Python 3.6+
  • Creates a subdir 'ghstars-USERNAME' at the current working directory
  • the raw JSON of each page request is saved as: 01.json, 02.json 0n.json
  • A flattened, filtered CSV is also created: wrangled.csv

Example usage:

@dannguyen
dannguyen / pypy-print.py
Created February 4, 2016 16:49
the python print function
def print_(*args, **kwargs):
"""The new-style print function from py3k."""
fp = kwargs.pop("file", sys.stdout)
if fp is None:
return
def write(data):
if not isinstance(data, basestring):
data = str(data)
fp.write(data)
want_unicode = False
@dannguyen
dannguyen / schemacrawler-sqlite-macos-howto.md
Last active January 21, 2024 15:32
How to use schemacrawler to generate schema diagrams for SQLite from the commandline (Mac OS)
@dannguyen
dannguyen / README.md
Last active December 28, 2023 15:21
Using Python 3.x and Google Cloud Vision API to OCR scanned documents to extract structured data

Using Python 3 + Google Cloud Vision API's OCR to extract text from photos and scanned documents

Just a quickie test in Python 3 (using Requests) to see if Google Cloud Vision can be used to effectively OCR a scanned data table and preserve its structure, in the way that products such as ABBYY FineReader can OCR an image and provide Excel-ready output.

The short answer: No. While Cloud Vision provides bounding polygon coordinates in its output, it doesn't provide it at the word or region level, which would be needed to then calculate the data delimiters.

On the other hand, the OCR quality is pretty good, if you just need to identify text anywhere in an image, without regards to its physical coordinates. I've included two examples:

####### 1. A low-resolution photo of road signs

@dannguyen
dannguyen / wget-snapshotpage.md
Last active December 25, 2023 20:57
Use wget to snapshot a page and its necessary visual dependencies

Use wget to mirror a single page and its visible dependencies (images, styles)

Money graphic via State of Florida CFO Vendor Payment Search

Graphic via State of Florida CFO Vendor Payment Search (flair.myfloridacfo.com)

This is a quick command I use to snapshot webpages that have a fun image I want to keep for my own collection of WTFViz. Why not just right-click and save the image? Oftentimes, the webpage in which the image is embedded contains necessary context, such as captions and links to important documentation just incase you forget what exactly that fun graphic was trying to explain.

@dannguyen
dannguyen / ec2-centos-ruby-rvm-nginx-passenger.md
Last active November 27, 2023 15:43
Setting up Ruby 1.9.3 stable, RVM, nginx, passenger on Amazon Linux AMI (CentOS)

Ruby 1.9.3 stable, RVM, nginx, passenger on Amazon Linux AMI (CentOS, 03-2013)

This combines the instructions on a few different tutorials:

@dannguyen
dannguyen / guardian-articles-day-api.md
Last active November 23, 2023 12:28
How to use The Guardian's API to download article data for content analysis (in Python 3.x)

How to use The Guardian's API to download article data for content analysis (in Python 3.x)

The Guardian offers an API as deep and robust as the New York Times Article API when it comes to content analysis.

The Guardian's API offers more than "1.7 million pieces of content", with published items as far back as 1999. You can register as a developer here, which gets you 5,000 API hits a day and an API key that looks something like this:

zzzyyyyy-9a9z-999z-z999-9e8a83922516

The Guardian has a handy interactive explorer to interactively tweak the query parameters.

@dannguyen
dannguyen / aws-textract-sample-readme.md
Last active October 30, 2023 05:49
A gist of AWS Textract sample/demo data for easy reference and preview, in case you're curious how well Amazon does when it comes to pdf-to-csv

AWS Textract -- sample document image and data from the offical demo

AWS Textract is now out of closed beta. You can read the features page here, and you can also read about its limits here (e.g. no handwriting). Basically, if you've ever had to deal with the hell of getting structured data out of a PDF (scanned image or not), Textract is aiming for your business:

image

This short gist contains some of my brief observations about Textract and its demo, as well as direct links to the most relevant and important files, such as the Textract demo sample image and the resulting data files from Textract's API. If you have an AWS account, I h

@dannguyen
dannguyen / catdrawer-youtube-to-gif-README.md
Last active September 20, 2023 21:02
Using youtube-dl and gifify from the command-line to make a cat gif