ankurbag/609. Find Duplicate File in System(#).java

## 609. Find Duplicate File in System(#).java
public class Solution {
    public List<List<String>> findDuplicate(String[] paths) {
        Map<String, List<String>> map = new HashMap<>();
        for(String path : paths) {
            String[] pathArr = path.split(" ");
            String dir = pathArr[0] + '/';
            for (int i = 1; i < pathArr.length; i++) {
                int start = pathArr[i].indexOf('(');
                String fileName = dir + pathArr[i].substring(0, start);
                String content = pathArr[i].substring(start + 1, pathArr[i].length() - 1);
                map.putIfAbsent(content, new ArrayList<>());
                map.get(content).add(fileName);
            }
        }
        List<List<String>> res = new ArrayList<>();
        for (String key : map.keySet()) {
            if (map.get(key).size() > 1) {
                res.add(map.get(key));
            }
        }
        return res;
    }
}

## 609. Find Duplicate File in System(Answer1).java
public class Solution {
    public List<List<String>> findDuplicate(String[] paths) {
        HashMap<String, Integer> map = new HashMap<>();
        HashMap<String, String> mapPath = new HashMap<>();
        HashMap<String, Integer> resultIndex = new HashMap<>();
        List<List<String>> result = new ArrayList<>();

        for(int i=0; i<paths.length; i++){
            String[] parts = paths[i].split(" ");
            String dir = parts[0];
            for(int j=1; j<parts.length; j++){
                String file = parts[j];
                //Separate content and filename
                int s_index = file.indexOf("(");
                String fname = file.substring(0, s_index);
                String content = file.substring(s_index+1, file.length()-1);
                if(map.containsKey(content)){
                    // Duplicate File Found
                    if(map.get(content)==1){
                        int r_index = result.size();
                        ArrayList<String> newRow = new ArrayList<>();
                        newRow.add(mapPath.get(content)); // To add first file
                        newRow.add(dir+"/"+fname);
                        result.add(newRow);
                        resultIndex.put(content, r_index);
                        map.put(content, map.get(content)+1);
                    }
                    else{
                        result.get(resultIndex.get(content)).add(dir+"/"+fname);
                    }
                }
                else{
                    map.put(content, 1);
                    mapPath.put(content, dir+"/"+fname);
                }
            }
        }// end for

        return result;
    }
}

## Follow up questions:.txt
Follow up questions:

1. Imagine you are given a real file system, how will you search files? DFS or BFS ?
In general, BFS will use more memory then DFS. However BFS can take advantage of the locality of files in inside directories, and therefore will probably be faster

2. If the file content is very large (GB level), how will you modify your solution?
In a real life solution we will not hash the entire file content, since it's not practical. Instead we will first map all the files according to size. Files with different sizes are guaranteed to be different. We will than hash a small part of the files with equal sizes (using MD5 for example). Only if the md5 is the same, we will compare the files byte by byte

3. If you can only read the file by 1kb each time, how will you modify your solution?
This won't change the solution. We can create the hash from the 1kb chunks, and then read the entire file if a full byte by byte comparison is required.

What is the time complexity of your modified solution? What is the most time consuming part and memory consuming part of it? How to optimize?
Time complexity is O(n^2 * k) since in worse case we might need to compare every file to all others. k is the file size

How to make sure the duplicated files you find are not false positive?
We will use several filters to compare: File size, Hash and byte by byte comparisons.

## Straight forward solution with a tiny bit of Java8.java
public static List<List<String>> findDuplicate(String[] paths) {
        Map<String, List<String>> map = new HashMap<>();
        for(String path : paths) {
            String[] tokens = path.split(" ");
            for(int i = 1; i < tokens.length; i++) {
                String file = tokens[i].substring(0, tokens[i].indexOf('('));
                String content = tokens[i].substring(tokens[i].indexOf('(') + 1, tokens[i].indexOf(')'));
                map.putIfAbsent(content, new ArrayList<>());
                map.get(content).add(tokens[0] + "/" + file);
            }
        }
        return map.values().stream().filter(e -> e.size() > 1).collect(Collectors.toList());
    }
	public class Solution {
	public List<List<String>> findDuplicate(String[] paths) {
	Map<String, List<String>> map = new HashMap<>();
	for(String path : paths) {
	String[] pathArr = path.split(" ");
	String dir = pathArr[0] + '/';
	for (int i = 1; i < pathArr.length; i++) {
	int start = pathArr[i].indexOf('(');
	String fileName = dir + pathArr[i].substring(0, start);
	String content = pathArr[i].substring(start + 1, pathArr[i].length() - 1);
	map.putIfAbsent(content, new ArrayList<>());
	map.get(content).add(fileName);
	}
	}
	List<List<String>> res = new ArrayList<>();
	for (String key : map.keySet()) {
	if (map.get(key).size() > 1) {
	res.add(map.get(key));
	}
	}
	return res;
	}
	}
	public class Solution {
	public List<List<String>> findDuplicate(String[] paths) {
	HashMap<String, Integer> map = new HashMap<>();
	HashMap<String, String> mapPath = new HashMap<>();
	HashMap<String, Integer> resultIndex = new HashMap<>();
	List<List<String>> result = new ArrayList<>();

	for(int i=0; i<paths.length; i++){
	String[] parts = paths[i].split(" ");
	String dir = parts[0];
	for(int j=1; j<parts.length; j++){
	String file = parts[j];
	//Separate content and filename
	int s_index = file.indexOf("(");
	String fname = file.substring(0, s_index);
	String content = file.substring(s_index+1, file.length()-1);
	if(map.containsKey(content)){
	// Duplicate File Found
	if(map.get(content)==1){
	int r_index = result.size();
	ArrayList<String> newRow = new ArrayList<>();
	newRow.add(mapPath.get(content)); // To add first file
	newRow.add(dir+"/"+fname);
	result.add(newRow);
	resultIndex.put(content, r_index);
	map.put(content, map.get(content)+1);
	}
	else{
	result.get(resultIndex.get(content)).add(dir+"/"+fname);
	}
	}
	else{
	map.put(content, 1);
	mapPath.put(content, dir+"/"+fname);
	}
	}
	}// end for

	return result;
	}
	}
	Follow up questions:

	1. Imagine you are given a real file system, how will you search files? DFS or BFS ?
	In general, BFS will use more memory then DFS. However BFS can take advantage of the locality of files in inside directories, and therefore will probably be faster

	2. If the file content is very large (GB level), how will you modify your solution?
	In a real life solution we will not hash the entire file content, since it's not practical. Instead we will first map all the files according to size. Files with different sizes are guaranteed to be different. We will than hash a small part of the files with equal sizes (using MD5 for example). Only if the md5 is the same, we will compare the files byte by byte

	3. If you can only read the file by 1kb each time, how will you modify your solution?
	This won't change the solution. We can create the hash from the 1kb chunks, and then read the entire file if a full byte by byte comparison is required.

	What is the time complexity of your modified solution? What is the most time consuming part and memory consuming part of it? How to optimize?
	Time complexity is O(n^2 * k) since in worse case we might need to compare every file to all others. k is the file size

	How to make sure the duplicated files you find are not false positive?
	We will use several filters to compare: File size, Hash and byte by byte comparisons.
	public static List<List<String>> findDuplicate(String[] paths) {
	Map<String, List<String>> map = new HashMap<>();
	for(String path : paths) {
	String[] tokens = path.split(" ");
	for(int i = 1; i < tokens.length; i++) {
	String file = tokens[i].substring(0, tokens[i].indexOf('('));
	String content = tokens[i].substring(tokens[i].indexOf('(') + 1, tokens[i].indexOf(')'));
	map.putIfAbsent(content, new ArrayList<>());
	map.get(content).add(tokens[0] + "/" + file);
	}
	}
	return map.values().stream().filter(e -> e.size() > 1).collect(Collectors.toList());
	}