Skip to content

Instantly share code, notes, and snippets.

@jxramos
Last active June 27, 2022 16:59
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jxramos/db8195d1972b6d7f7eb0e97be51d8369 to your computer and use it in GitHub Desktop.
Save jxramos/db8195d1972b6d7f7eb0e97be51d8369 to your computer and use it in GitHub Desktop.
Compiling a bunch of resources that serve or discuss receipt digitization and processing.

Receipt Processing

Neat

They used to sell an all in one scanner product, looks like they focus more on software and cloud processing solutions: https://www.neat.com/track-receipts/

Software is able to drive certain scanners that support this API https://en.wikipedia.org/wiki/TWAIN

The shifted their focus from the individual to small business as the market was likely larger in that segment. They used to have an offline solution that wrapped up the scanner and software in one package. Now they seem to be more cloud oriented.

https://www.amazon.com/NeatReceipts-Mobile-Scanner-Digital-Filing/dp/B001CQFRPO

Lots of folks complaining about hardware and software going into legacy state and no longer supported. Folks complained about being redirected to cloud subscription based solutions, drivers removed for downloading, etc.

Interesting feature to map to to IRS Tax Categories used in Schedules A, B, C.

Microsoft

Microsoft flexing some prebuilt AI models shared publicly through proprietary tools.

https://docs.microsoft.com/en-us/ai-builder/prebuilt-receipt-processing

Demo video Receipt Processing using AI Builder in Power Apps

Seems to be two routes of usage, Power Apps, Power Automation

Use the receipt processing prebuilt model in Power Automate https://docs.microsoft.com/en-us/ai-builder/flow-receipt-processing this is a cloud based solution, you upload the receipt artifact and it pushes parsed results to an online Excel spreadsheet.

ScanSnap

ScanSnap Receipts Scanning demo on youtube showing the hardware and software a bit. Walks through the parsed columns and you can see in high def video the text of the receipts scrolled in the app that it matches for the most part the parsed data. Not a perfect job but looks like a good starting point. https://www.youtube.com/watch?v=yuaToPhDT34

Official Product Page Hardware: https://www.fujitsu.com/us/products/computing/peripheral/scanners/soho/ Software: https://www.fujitsu.com/us/products/computing/peripheral/scanners/soho/sshome/ Specifications: https://www.fujitsu.com/us/products/computing/peripheral/scanners/soho/sshome/#tab-b-03 Has assignable configurations per document source to recall what fields go into what files names and other fields of data. Configurations can be recalled from the touch screen to swap to other settings, so essentially semi-automated approach.

Interesting comparison to VueScan Many people lost out on the 32bit dropped support and no longer had support for their old scanners. Folks began to look externally, and one of the classic scanning apps VueScan was evaluated by this fella. https://tidbits.com/2019/12/02/vuescan-not-the-scansnap-replacement-youre-looking-for/

Tesseract

Google OCR project, gone through many iterations and refinements, now uses LSTM deep learning to enhance recognition.

Receipt applications with tesseract https://stackoverflow.com/questions/31633403/tesseract-receipt-scanning-advice-needed

11 questions tagged both receipt and ocr. https://stackoverflow.com/search?q=%5Breceipt%5D%5Bocr%5D&searchOn=3

Abiity to train your own custom model, your own language. Differentiates fixed and non-fixed font widths.

Nanonets

Great blog post about receipt digitization workflow, processes, and background. https://nanonets.com/blog/receipt-ocr/

Mentions downloading tesseract binaries, installing pything petesseract bindings. Shows a full example that gets into text parsing using regex patterns.

Goes through various models that explore this space and then in the end redirects you to a pretrained OCR API you can purchase.

The rule-based methods rely heavily on the predefined template rules to extract information from specific invoice layouts This was an interesting bit, similar to how I parse known frequented ecommerce sites to develop those custom scrapers to extract the line item purchase data I'm after. So automate the most frequent stores and then whatever is left over is doable for manual entry.

Subscription based cloud solution it looks like: https://nanonets.com/pricing/

Hits in HackerNews https://news.ycombinator.com/item?id=21843342

Invoice2Data

Earlier this year I was working on hybrid PDFs[1] that embed a full XML invoice. Standardized and promoted by the German and French.[2] One more thing to hide. https://news.ycombinator.com/item?id=18383558

Wave Financial

https://www.waveapps.com/receipts https://en.wikipedia.org/wiki/Wave_Financial

Abbyy

https://www.abbyy.com/cloud-ocr-sdk/ mentioned by someone looking at their R package--https://cran.r-project.org/web/packages/abbyyR/index.html https://github.com/soodoku/abbyyR

Zoho Expense

https://www.pcmag.com/reviews/zoho-expense Zoho Expense: Overview

Various sources

General PDF Extraction Stuff

https://news.ycombinator.com/item?id=18199708 https://pdftables.com/ https://camelot-py.readthedocs.io/en/master/ https://tomassetti.me/how-to-convert-a-pdf-to-excel/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment