Skip to content

Instantly share code, notes, and snippets.

#Google Vision for OCR
This is a step by step guide to using Google Vision to identify and recognize text in document images. There are lots of ways to OCR, this is just the best method that I have found so far.

The general documentation for Vision can be found here: https://cloud.google.com/vision/ Before using any of the scripts below, you'll need to create a Google Cloud account. You'll also need to create a project, enable the Vision and Natural Language APIs on that project. In APIs & services you can create credentials in the credentials tab. Select service account key, your project name, JSON and then click on create.
A file should download to your machine. If you get stuck there's more information here: https://cloud.google.com/docs/authentication/api-keys

Move the JSON key to a safe place and remember the path to that file. I find it helpful to navigate in the terminal to the directory containing the file and then enter pwd. This will show the location of the file (suc

@apjanco
apjanco / spaCy.md
Created March 4, 2019 20:30
Workshop proposal for DH2019

Introduction to natural language processing for DH research with spaCy - A fast and accessible library that integrates modern machine learning technology.

This half-day tutorial will introduce DH scholars to spaCy, a free and open-source library for text analysis. Developed by Matthew Hannibal and Ines Montari in Berlin, spaCy offers a suite of tools for applied natural language processing (NLP) that are fast, practical and allow for quick experimentation and evaluation of language models. These tools make it possible for individual scholars to quickly train models that can infer customized categories in named entity recognition tasks, match phrases, and visualize model performance. While comparable to the Natural Language Toolkit (NLTK), spaCy offers neural network models, integrated word vectors, dependency parsing and a variety of new features that are not available elsewhere. Participants will learn how to use spaCy for common research tasks in the Digital Humanities and gain an understanding of how

# https://gist.github.com/zupo/5849843
import argparse
import os
import shutil
N = 1000000 # the number of files in seach subfolder folder
def make_files_list(abs_dirname):
files = []

Schedule survey of features most relevant to work with TEI

9:00-10:45

  • Intro to spaCy (Andy)
    • Linguistic Features
    • Rule-based Matching
    • NER (w/ pre-trained models)
    • displacy
  • comparison of available models
@apjanco
apjanco / PH_proposal.md
Last active April 20, 2020 18:02
Proposal for Find all the Places in Text with the World-Historical Gazetteer

Programming Historian Lesson Proposal

If you are interested in writing a lesson and submitting it to the Programming Historian, please fill in this form to give the Editorial Board enough detail to comment on your idea. If you are experiencing difficulty with the form you can contact our Managing Editor directly:

English: Anandi Silva Knuppel (anandi.silva.knuppel@emory.edu) Spanish: Maria José Afanador Llach (mj.afanador28@uniandes.co) French: Sofia Papastamkou (spapastamkou@gmail.com)

About You

[Unit]
Description=gunicorn daemon
After=network.target
[Service]
User=[your_user, say www-data]
Group=www-data
WorkingDirectory=[path to app directory]
Environment="PATH=[myvenv/bin]"
ExecStart=[myvenv/bin/gunicorn] --access-logfile - --workers 4 -k uvicorn.workers.UvicornWorker --bind unix:/tmp/myapp.sock main:app

Andrew Janco Professional Interactions

9 Rich Freedman, presentation at DH2019, regular CRIM project meetings
4 Darin Hayton, regular project meetings, GreekPal
1 Yvette Granata, consultation
3 Kathryne Corbin, consolutation, class instruction on Omeka
2 Jane Chandlee, taught class session on applied NLP
8 Nimisha Ladva, writing program instruction sessions
4 Sarah Watson, S.C Kaplan, regular project meetings, Books of Duchesses
8 Jake Culbertson, regular co-teaching