Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save ankurbag/5542cf015708581fc7a75c088e9937d6 to your computer and use it in GitHub Desktop.
Save ankurbag/5542cf015708581fc7a75c088e9937d6 to your computer and use it in GitHub Desktop.
public class Solution {
public List<List<String>> findDuplicate(String[] paths) {
Map<String, List<String>> map = new HashMap<>();
for(String path : paths) {
String[] pathArr = path.split(" ");
String dir = pathArr[0] + '/';
for (int i = 1; i < pathArr.length; i++) {
int start = pathArr[i].indexOf('(');
String fileName = dir + pathArr[i].substring(0, start);
String content = pathArr[i].substring(start + 1, pathArr[i].length() - 1);
map.putIfAbsent(content, new ArrayList<>());
map.get(content).add(fileName);
}
}
List<List<String>> res = new ArrayList<>();
for (String key : map.keySet()) {
if (map.get(key).size() > 1) {
res.add(map.get(key));
}
}
return res;
}
}
public class Solution {
public List<List<String>> findDuplicate(String[] paths) {
HashMap<String, Integer> map = new HashMap<>();
HashMap<String, String> mapPath = new HashMap<>();
HashMap<String, Integer> resultIndex = new HashMap<>();
List<List<String>> result = new ArrayList<>();
for(int i=0; i<paths.length; i++){
String[] parts = paths[i].split(" ");
String dir = parts[0];
for(int j=1; j<parts.length; j++){
String file = parts[j];
//Separate content and filename
int s_index = file.indexOf("(");
String fname = file.substring(0, s_index);
String content = file.substring(s_index+1, file.length()-1);
if(map.containsKey(content)){
// Duplicate File Found
if(map.get(content)==1){
int r_index = result.size();
ArrayList<String> newRow = new ArrayList<>();
newRow.add(mapPath.get(content)); // To add first file
newRow.add(dir+"/"+fname);
result.add(newRow);
resultIndex.put(content, r_index);
map.put(content, map.get(content)+1);
}
else{
result.get(resultIndex.get(content)).add(dir+"/"+fname);
}
}
else{
map.put(content, 1);
mapPath.put(content, dir+"/"+fname);
}
}
}// end for
return result;
}
}
Follow up questions:
1. Imagine you are given a real file system, how will you search files? DFS or BFS ?
In general, BFS will use more memory then DFS. However BFS can take advantage of the locality of files in inside directories, and therefore will probably be faster
2. If the file content is very large (GB level), how will you modify your solution?
In a real life solution we will not hash the entire file content, since it's not practical. Instead we will first map all the files according to size. Files with different sizes are guaranteed to be different. We will than hash a small part of the files with equal sizes (using MD5 for example). Only if the md5 is the same, we will compare the files byte by byte
3. If you can only read the file by 1kb each time, how will you modify your solution?
This won't change the solution. We can create the hash from the 1kb chunks, and then read the entire file if a full byte by byte comparison is required.
What is the time complexity of your modified solution? What is the most time consuming part and memory consuming part of it? How to optimize?
Time complexity is O(n^2 * k) since in worse case we might need to compare every file to all others. k is the file size
How to make sure the duplicated files you find are not false positive?
We will use several filters to compare: File size, Hash and byte by byte comparisons.
public static List<List<String>> findDuplicate(String[] paths) {
Map<String, List<String>> map = new HashMap<>();
for(String path : paths) {
String[] tokens = path.split(" ");
for(int i = 1; i < tokens.length; i++) {
String file = tokens[i].substring(0, tokens[i].indexOf('('));
String content = tokens[i].substring(tokens[i].indexOf('(') + 1, tokens[i].indexOf(')'));
map.putIfAbsent(content, new ArrayList<>());
map.get(content).add(tokens[0] + "/" + file);
}
}
return map.values().stream().filter(e -> e.size() > 1).collect(Collectors.toList());
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment