Skip to content

Instantly share code, notes, and snippets.

@RichLogan
Last active March 30, 2016 10:41
Show Gist options
  • Save RichLogan/087ecfba5c222a0000364f046011e4a6 to your computer and use it in GitHub Desktop.
Save RichLogan/087ecfba5c222a0000364f046011e4a6 to your computer and use it in GitHub Desktop.
Remove duplicate rows per column
import csv
import sys
column = input("Which column do you want to test on? (Starting from 0): ")
delimiter = raw_input("How if your file deliminated? ")
duplicate_count = 0
with open(sys.argv[1], 'r') as input_file, open(sys.argv[1].split('.')[0] + "_fixed.csv", 'w') as output_file:
seen = set()
for line in input_file.readlines():
row = line.split(delimiter)
if row[column] not in seen:
output_file.write(line)
seen.add(row[column])
else:
duplicate_count += 1
print "Found " + str(duplicate_count) + " duplicate rows"
print "Output file at: " + sys.argv[1].split('.')[0] + "_fixed.csv"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment