Detecting similar and identical images using perseptual hashes
Couple of my hobbies are travelling and photography. I love to take pictures and experiment with photography. Usually after my trips, I just copy the photos to either my iPad or couple of my external hard disks. After 10 years, I have over 200K photos distributed across several disks and machines. I had to find a way to organize these photos and create a workflow for future maintenance. In this post I want to address one of the issues I had to solve: ** finding duplicate images **.
First, I needed to find out what exactly is a duplicate image. Analysing my photos, I found couple of interesting things:
- Identical images: There were multiple copies of the same photo in different directories with different names.
- Similar images: I usually bracket (exposure compensate or flash compensate) important pictures. So I have photos that visually appear to be the same, but may be a little darker/lighter based on e