Skip to content

Instantly share code, notes, and snippets.

@dannguyen
dannguyen / README.md
Last active May 17, 2024 02:07
Using Python 3.x and Google Cloud Vision API to OCR scanned documents to extract structured data

Using Python 3 + Google Cloud Vision API's OCR to extract text from photos and scanned documents

Just a quickie test in Python 3 (using Requests) to see if Google Cloud Vision can be used to effectively OCR a scanned data table and preserve its structure, in the way that products such as ABBYY FineReader can OCR an image and provide Excel-ready output.

The short answer: No. While Cloud Vision provides bounding polygon coordinates in its output, it doesn't provide it at the word or region level, which would be needed to then calculate the data delimiters.

On the other hand, the OCR quality is pretty good, if you just need to identify text anywhere in an image, without regards to its physical coordinates. I've included two examples:

####### 1. A low-resolution photo of road signs

import React from "react";
import { render } from "react-dom";
const ParentComponent = React.createClass({
getDefaultProps: function() {
console.log("ParentComponent - getDefaultProps");
},
getInitialState: function() {
console.log("ParentComponent - getInitialState");
return { text: "" };
@primaryobjects
primaryobjects / classifytext.R
Last active August 9, 2020 19:54
Simple example of classifying text in R with machine learning (text-mining library, caret, and bayesian generalized linear model). Classify. tfidf tdm term document matrix
library(caret)
library(tm)
# Training data.
data <- c('Cats like to chase mice.', 'Dogs like to eat big bones.')
corpus <- VCorpus(VectorSource(data))
# Create a document term matrix.
tdm <- DocumentTermMatrix(corpus, list(removePunctuation = TRUE, stopwords = TRUE, stemming = TRUE, removeNumbers = TRUE))
@kane-thornwyrd
kane-thornwyrd / exemple.js
Last active April 17, 2019 03:19
How to browse and tweak objects using a string path. 😄 (require Underscore.js for the _.isString)
var target = {
foo: {
bar: {
baz: [
'madness'
]
}
}
};
@mbejda
mbejda / 10000-MTV-Music-Artists-page-1.csv
Last active May 16, 2024 01:48
10,000 MTV's Top Music Artists. Great dataset for machine learning, research and analysis. (name,facebook,twitter,website,genre,mtv).
name facebook twitter website genre mtv
Adele http://www.facebook.com/9770929278 http://www.twitter.com/officialadele Pop http://www.mtv.com/artists/adele/biography
Joey + Rory http://www.facebook.com/15044507815 http://www.twitter.com/joeyandrory Country http://www.cmt.com/artists/joey-rory/biography
Draaco Aventura http://www.facebook.com/856796091053581 http://www.twitter.com/DraacoAventura http://www.bandpage.com/draacoaventura Pop Latino http://www.mtv.com/artists/draaco-aventura/biography
Justin Bieber http://www.facebook.com/309570926875 http://www.twitter.com/justinbieber http://www.justinbiebermusic.com Pop http://www.mtv.com/artists/justin-bieber/biography
Peer van Mladen http://www.facebook.com/264487966 http://www.twitter.com/Predrag_Jugovic http://pejaintergroup.eu/Peer_van_Mladen.html House http://www.mtv.com/artists/peer-van-mladen/biography
Chris Janson http://www.facebook.com/296647641825 http://www.twitter.com/janson_chris http://www.chrisjanson.com Country http://www.cmt.com/a
@thiloplanz
thiloplanz / Zero_knowledge_db.md
Last active June 2, 2024 16:40
Zero-knowledge databases

Zero knowledge databases

The idea

The idea is to provide a database as a service to end users in such a way that no one except the user herself can access the data, not even the hosting provider or the database administrator.

Advantages

  • A privacy- and/or security-conscious user will have more trust in such a setup.
  • The service provider cannot be coerced to release the data they were trusted with, and he cannot be held responsible for the content he is storing.
@glamp
glamp / customer-segmentation.py
Last active April 30, 2020 13:40
Analysis for customer segmentation blog post
import pandas as pd
# http://blog.yhathq.com/static/misc/data/WineKMC.xlsx
df_offers = pd.read_excel("./WineKMC.xlsx", sheetname=0)
df_offers.columns = ["offer_id", "campaign", "varietal", "min_qty", "discount", "origin", "past_peak"]
df_offers.head()
df_transactions = pd.read_excel("./WineKMC.xlsx", sheetname=1)
df_transactions.columns = ["customer_name", "offer_id"]
df_transactions['n'] = 1
df_transactions.head()
@PurpleBooth
PurpleBooth / README-Template.md
Last active June 18, 2024 13:10
A template to make good README.md

Project Title

One Paragraph of project description goes here

Getting Started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes. See deployment for notes on how to deploy the project on a live system.

Prerequisites

@alediaferia
alediaferia / tiny_uploader.js
Last active November 27, 2022 01:36
A tiny snippet for reading files chunk by chunk in plain JavaScript
/*
Copyright (c) 2015-2020 Alessandro Diaferia
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
@mininao
mininao / Basic algorithm
Created May 22, 2015 20:56
LED lightning algorithms
for (int i = 0; i < length; ++i)
{
struct LED current_led = leds[i];
if(current_led.color.red != 0) {
on(current_led.pos, true, false,false);
delayMicroseconds(current_led.color.red);
off(current_led.pos);
}