I have a problem and I'd like to apply some machine learning to it. I have done a lot of ML-adjacent work, but only in speech processing and that was approximately 1 eternity ago. Maybe y'all could make some suggestions on some ML library bricks (I'm thinking I'd prefer Python's ecosystem here) I could snap together?
I've scanned all my (vaguely worthwhile) incoming mail since last November. Each letter gets scanned front and back and gets made in to a single PDF, and then I apply OCR to extract the text and make it searchable. I now have several hundred scans and thousands of pages represented as PDFs.
The vast majority of this mail comes from a handful of unique senders and their mail visually looks the same, and has text common across all of the letters they send. I'd like to cluster like mail together based on the text and image so I could select mail by how it looks, and see related mail which is almost certainly of the same type.
For example, my mortgage letter looks like this:
and my health insurance bills always have this logo:
so ... hem any suggested things to look at? Right now I'm eyeing scikit learn's clustering and image/text feature extraction functions