Skip to content

Instantly share code, notes, and snippets.

@b-meson
Created May 11, 2016 17:08
Show Gist options
  • Save b-meson/c47ea7677005364363f1f41893253edf to your computer and use it in GitHub Desktop.
Save b-meson/c47ea7677005364363f1f41893253edf to your computer and use it in GitHub Desktop.
1505 analysis using OCR
b_meson [3:22 PM]
I'm basically splitting the documents here: https://github.com/freddymartinez9/miscfoiamirror/tree/master/1505depositsanalysis and then doing a manual OCR on each document to get a text file out of them and then doing a side by side of the pdf and the text file to get an average expenditure
(yes they say deposits but its really an outgoing expense)
so there is a python program here checkreduction.py that has these two lines
```foia_response = '1505deposists1.49.pdf'
png_response = '1505deposists1.49.png'
````
where it reads those two files in and spits out a text file that is just the expenses which we have to compared side by side by to the PDF.
b_meson [11:38 AM]
so I went ahead and ran the check analysis on
```foia_response = '1505deposists2.1.pdf'
png_response = '1505deposists2.1.png' ```
```python checksreductions.py
[('Departmeni', False), ('&', False), ('null', False), ('to', False), ('null', False), ('null', False), ('null', False), ('3303.92', False), ('3135.73', False), ('3150.42', False), ('393.95', False), ('3154.55', False), ('372.07', False), ('3400.00', False), ('30.00', False), ('30.00', False), ('3437.51', False), ('314,023.30', False), ('35,535.00', False), ('33,200.00', False), ('3193.35', False), ('3235.00', False), ('3235.00', False), ('31,110.00', False), ('3575.00', False), ('31,354.22', False), ('3574.47', False), ('3500.00', False), ('3235.00', False), ('325.00', False), ('342,032.02', False), ('3150000.00', False), ('3129509.30', False), ('33,500.00', False), ('3340.00', False), ('31,533.00', False), ('32,334.17', False), ('charges', False), ('3400.00', False), ('31,353.00', False), ('3121.50', False), ('3143.50', False), ('35,313.75', False), ('null', False), ('of', False)]
Outputting data in 1505deposists2.1
```
Immediately I can see one error, the first result 3303.92 should be 308.92, this is because OCR can't deal with parenthesis or $ and confuses them with 3's and 6's.
the "false" is if they are flagged for review, almost all of these should be. I'm fixing that in the code.
lincolnb [11:49 AM]
ok makes sense. under what conditions does that evaluate to ‘true’?
b_meson [11:53 AM]
well in set 1, it was ``if (amounts[:2]=='(0' or amounts=='($1,') or amounts[0]=='3'`` line 35 in checkreduction.py because it would be true for when the first two characters were rendering as (0 or as an empty set with one dollar. the last or condition is a work in progress to get trigger locally. its not working...
lincolnb [11:54 AM]
alright. so what action should be taken when the OCR has missed something?
b_meson [11:57 AM]
well the "true" flags are the ones that need manual verification, which means comparing the PDF to the text output and fixing the amounts manually. unfortunately it looks like ​_everything_​ should be triggering based on this code, so I'm trying to hack on that.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment