Skip to content

Instantly share code, notes, and snippets.

@jiaaro
Created April 15, 2011 18:34
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jiaaro/922211 to your computer and use it in GitHub Desktop.
Save jiaaro/922211 to your computer and use it in GitHub Desktop.
Finds the largest duplicate text block in a file
#!/usr/bin/env python
import sys
orig_txt = open(sys.argv[-1], "r").readlines()
txt = [line.strip() for line in orig_txt]
match_len = 0
longest_match = [
'occurence1 start', 'occurance1 end',
'occurence2 start', 'occurance2 end',
]
for i in range(len(txt)):
for j in range(i+1, len(txt)):
if (txt[i] != txt[j]):
continue
matching_lines = 1
try:
while txt[i+matching_lines] == txt[j+matching_lines]:
matching_lines += 1
except IndexError: pass
if matching_lines > match_len:
match_len = matching_lines
longest_match = [
i, i + matching_lines,
j, j + matching_lines,
]
print "Longest match: %s lines" % match_len
print "Lines %s - %s and %s - %s match" % tuple(longest_match)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment