Skip to content

Instantly share code, notes, and snippets.

View PureMath86's full-sized avatar

Bryan Arnold PureMath86

View GitHub Profile
@PureMath86
PureMath86 / pdf_table_with Tesseract
Created September 27, 2017 22:23 — forked from jaganadhg/pdf_table_with Tesseract
Extract Data from PDF table using Python Image. Image Magick and tesseract
#Refer http://craiget.com/extracting-table-data-from-pdfs-with-ocr/
import Image, ImageOps
import subprocess, sys, os, glob
# minimum run of adjacent pixels to call something a line
H_THRESH = 300
V_THRESH = 300
def get_hlines(pix, w, h):
"""Get start/end pixels of lines containing horizontal runs of at least THRESH black pix"""

API workthough

  1. Open a browser

    # start an instance of firefox with selenium-webdriver
    driver = Selenium::WebDriver.for :firefox
    # :chrome -> chrome
    # :ie     -> iexplore
    
  • Go to a specified URL
# List unique values in a DataFrame column
# h/t @makmanalp for the updated syntax!
df['Column Name'].unique()
# Convert Series datatype to numeric (will error if column has non-numeric values)
# h/t @makmanalp
pd.to_numeric(df['Column Name'])
# Convert Series datatype to numeric, changing non-numeric values to NaN
# h/t @makmanalp for the updated syntax!
@PureMath86
PureMath86 / pdf-report.py
Created November 13, 2018 23:20 — forked from chris1610/pdf-report.py
PDF Report Generation - pbpython.com
"""
Generate PDF reports from data included in several Pandas DataFrames
From pbpython.com
"""
from __future__ import print_function
import pandas as pd
import numpy as np
import argparse
from jinja2 import Environment, FileSystemLoader
from weasyprint import HTML