Skip to content

Instantly share code, notes, and snippets.

View chriszs's full-sized avatar

Chris Zubak-Skees chriszs

View GitHub Profile
@dannguyen
dannguyen / README.md
Last active May 17, 2024 02:07
Using Python 3.x and Google Cloud Vision API to OCR scanned documents to extract structured data

Using Python 3 + Google Cloud Vision API's OCR to extract text from photos and scanned documents

Just a quickie test in Python 3 (using Requests) to see if Google Cloud Vision can be used to effectively OCR a scanned data table and preserve its structure, in the way that products such as ABBYY FineReader can OCR an image and provide Excel-ready output.

The short answer: No. While Cloud Vision provides bounding polygon coordinates in its output, it doesn't provide it at the word or region level, which would be needed to then calculate the data delimiters.

On the other hand, the OCR quality is pretty good, if you just need to identify text anywhere in an image, without regards to its physical coordinates. I've included two examples:

####### 1. A low-resolution photo of road signs

@veltman
veltman / README.md
Created October 10, 2016 16:08
Geosupport w/ JS and node-ffi

Geocoding 10,000 addresses a second with NYC's Geosupport library and Node FFI

Following on Chris Whong's excellent writeup of how to make calls directly to NYC's Geosupport client and this first attempt at generalizing it, here's a way that let me geocode about 10,000 addresses a second on Ubuntu using Node FFI.

Note: this assumes Ubuntu - other Linux is probably fine but may need adjustments.

First, install the basics:

# Update, install Node and unzip (if needed)
@mbostock
mbostock / .block
Last active November 13, 2016 21:45
U.S. Atlas, Redux [UNLISTED]
license: bsd-3-clause
@duner
duner / README.md
Last active April 28, 2022 19:48
Twitter Archive to JSON

If you download your personal Twitter archive, you don't quite get the data as JSON, but as a series of .js files, one for each month (there are meant to replicate the Twitter API respones for the front-end part of the downloadable archive.)

But if you want to be able to use the data in those files, which is far richer than the CSV data, for some analysis or app just run this script.

Run sh ./twitter-archive-to-json.sh in the same directory as the /tweets folder that comes with the archive download, and you'll get two files:

  • tweets.json — a JSON list of the objects
  • tweets_dict.json — a JSON dictionary where each Tweet's key is its id_str

You'll also get a /json-tweets directory which has the individual JSON files for each month of tweets.

@thomaswilburn
thomaswilburn / index.js
Last active July 22, 2017 21:23
ASP page scraper with comments
// Built-in modules
var csv = require("csv");
var fs = require("fs");
var url = require("url");
// Loaded from NPM
var $ = require("cheerio"); // jQuery-like DOM library
var async = require("async"); // Easier concurrency utils
var request = require("request"); // Make HTTP requests simply
@emanuelfeld
emanuelfeld / gi-lf
Last active April 24, 2017 14:26
as pre-commit script, automatically add files larger than some size to your repository's .git/info/exclude file
#!/bin/bash
# set max file size to include (in MB)
max_size_mb=100
max_size_b="$(($max_size_mb * 1000000))c"
git_dir="$(git rev-parse --show-toplevel)"
git_exclude=$git_dir/.git/info/exclude
files="$(find $git_dir -path $git_dir/.git -prune -o -type f -size +$max_size_b -print | sed "s%$git_dir/%%g" | sed "s/\ /\\\ /g")"
from collections import Counter
import pandas as pd
df = pd.read_hdf('training.h5')
g = df.groupby('slug')
def get_sample(slug):
return df.ix[g.groups[slug]]
@tmcw
tmcw / optimization.md
Last active February 14, 2021 14:38
Optimization

Optimization

Correctly prioritizing and targeting performance problems and optimization opportunities is one of the hardest things to master in programming. There are a lot of ways to do it wrong: by prematurely optimizing non-bottlenecks, or preferring fast solutions to clear solutions, or measuring problems incorrectly.

I'll try to summarize what I've learned about doing this right.

First, don't optimize until there's an issue. And issues should be defined as application issues: performance problems that are either detectable by the users (lag) or endanger the platform – i.e. problems that cause downtime, like out-of-memory issues. Until there's an issue, don't think about peformance at all: just solve the problem at hand, which is "creating value for the end-user," or some less-corporate translation of the same.

Second, only optimize with instruments. By instruments, I mean technology that lets you decipher which sub-part of the stack is the bottleneck. Let's say you see slowness around fet

@sindresorhus
sindresorhus / esm-package.md
Last active June 9, 2024 17:19
Pure ESM package

Pure ESM package

The package that linked you here is now pure ESM. It cannot be require()'d from CommonJS.

This means you have the following choices:

  1. Use ESM yourself. (preferred)
    Use import foo from 'foo' instead of const foo = require('foo') to import the package. You also need to put "type": "module" in your package.json and more. Follow the below guide.
  2. If the package is used in an async context, you could use await import(…) from CommonJS instead of require(…).
  3. Stay on the existing version of the package until you can move to ESM.