What is image hashing?
An image hash is a small fingerprint for an image.
A normal file hash, like MD5 or SHA256, answers this question:
Are these files exactly the same?
A perceptual image hash answers a more useful visual question:
Do these images look similar?
Why not just use MD5?
Imagine two screenshots that look almost the same.
Maybe one has a different timestamp. Maybe the browser rendered one pixel differently. Maybe the image was resized or lightly compressed.
With MD5, one tiny change usually creates a totally different hash.
Original image -> 1a2b3c4d...
One pixel changed -> 9f8e7d6c...That is useful for exact duplicates, but useless for visual similarity.
Perceptual hashes are different
Perceptual hashes are designed so similar-looking images produce similar-looking hashes.
Original image -> f123456789abcdef
Similar image -> f123456789abcdee
Different image -> 0011aa99bb44cc22The hashes are not meant to be secret. They are meant to be compared.
Hamming distance
The difference between two hashes is called the Hamming distance.
That is just a count of how many bits are different.
Distance 0 exactly the same hash
Distance 1-5 probably very similar
Distance 6-10 possibly related
Distance 10+ often differentThe exact numbers depend on the image type, hash method and your tolerance for false matches.
The idiot version
Small distance = images probably look alike.
Large distance = images probably look different.That is the core idea.
