Skip to content

Instantly share code, notes, and snippets.

@grahamc
Last active June 26, 2020 22:17
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save grahamc/145e3382f7860431744c2a3d73e77636 to your computer and use it in GitHub Desktop.
Save grahamc/145e3382f7860431744c2a3d73e77636 to your computer and use it in GitHub Desktop.

I have a problem and I'd like to apply some machine learning to it. I have done a lot of ML-adjacent work, but only in speech processing and that was approximately 1 eternity ago. Maybe y'all could make some suggestions on some ML library bricks (I'm thinking I'd prefer Python's ecosystem here) I could snap together?

I've scanned all my (vaguely worthwhile) incoming mail since last November. Each letter gets scanned front and back and gets made in to a single PDF, and then I apply OCR to extract the text and make it searchable. I now have several hundred scans and thousands of pages represented as PDFs.

The vast majority of this mail comes from a handful of unique senders and their mail visually looks the same, and has text common across all of the letters they send. I'd like to cluster like mail together based on the text and image so I could select mail by how it looks, and see related mail which is almost certainly of the same type.

For example, my mortgage letter looks like this:

and my health insurance bills always have this logo:

so ... hem any suggested things to look at? Right now I'm eyeing scikit learn's clustering and image/text feature extraction functions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment