GeorgiPachov/gist:039d2c339358dbfcc650

## gistfile1.md

      
    Raw
  

              gistfile1.md
            
          
    Approach1: Group all files by the hashcode of their bytes.
Use a Map<Long, List<Path>> to represent them.
Here, the key is the hashCode of all the bytes from a File.
The algorithm:
For each file(ancestor) of the folder given
   bytes = Files.read(file);
   hashCode = Arrays.hashCode(bytes);
   map.get(hashCode).add(path);
   

In this way you can form the groups.
Do not worry about collisions.
If you are done with every task - worry about them :D
How can you quickly detect colissions with very high probability? Without reading and comparing the files?
Approach2: Group all path by file size. Two files cannot be duplicates if they are of different sizes, right?
So once we form the groups, in each group, every two files are either duplicates, or have different content with the same size.
We start builiding the class for our first file in the group.
for file from 2... group.length
   if (getBytes(firstFile).equals(getBytes(file)) (1)
     groups.get(firstFile).add(file); //the currently checked file is in the group of the first file
   else //the currently checked file is not in the group of the first file, we start making his  group from the remaining files.

If you are having performance problems, (1) can be optimized not to read all the bytes, but to read them and compare them in chunks.
If two files are different, usually the difference will be pretty early in their bytes.