Finding duplicates using SHA256 and getting inconsistent results

I am trying to find duplicate audio files in a folder structure containing just over 20,000 files. Some dupes will have identical names, some will not, although extensions are consistent across dupe sets. I know that about 80%-90% of files will have dupes. The majority of the files are small, in the range of 5MB to 50MB, but several hundred of them are between 0.5GB and 1.5GB. I’m using a Surface Pro 6 i7 with Windows 10 Pro and 8 GB of RAM.

These 20,000 files were consolidated onto an SSD from multiple backup or archive sets located on different hard drives. Each set (with its own folder structure) is now in its own subfolder, and these are collectively under a single top-level pool folder.

I use a program called TreeSize to do this, using a SHA256 checksum, which I run at the top level

When I run SHA256 on the folder, I get a slightly different dupe count each time: 19668, 19204, 19671, 19675, 19669, 19673. The last two numbers I got twice.

All numbers are plausible: when I run a comparison based on name, size, and date, the results are in the same park, 19621, as expected. And that number is consistent every time I run it, but I realize that this method checks for something very different.

I wonder what could explain the slightly different SHA256 results when run against the same set of files. I find the same slight variations when I run the files through MD5.

Is it the size of the dataset? The size of certain files? Some files may be corrupted, but would that affect a comparison test’s ability to read them consistently?

Should I run the compare on smaller subsets and then regroup after first weeding the dupes of the subsets? Use another tool?

Any thoughts on this?

Thank you!

Edited by GGiF, Today, 3:00 p.m.


Source link

Steven L. Nielsen