Skip to content

Instantly share code, notes, and snippets.

@Haaroon
Created July 6, 2020 15:18
Show Gist options
  • Save Haaroon/eea1a4fc256788de3a6f5af021a2f0a8 to your computer and use it in GitHub Desktop.
Save Haaroon/eea1a4fc256788de3a6f5af021a2f0a8 to your computer and use it in GitHub Desktop.
Extract from gzip , remove duplicates from csv but keep header, then gzip result
# Extract a csv file from a gzip but keep the original gzip zcat file_compressed.csv.gz
# cut the two columns of the file and save it cut -d',' -f1,2 > file_uncompressed.csv
zcat file_compressed.csv.gz | cut -d',' -f1,2 > file_uncompressed.csv
# Since its a csv with a header keep this header head -n1 file_uncompressed.csv
# but get the result of the file, sort it, uniq to
# remove duplicates, +2 so we start at line 2 tail -n +2 file_uncompressed.csv | sort | uniq
# gzip the result and save into new file gzip > file_compressed_no_dups.csv.gz
(head -n1 file_uncompressed.csv && tail -n +2 file_uncompressed.csv | sort | uniq) | gzip > file_compressed_no_dups.csv.gz
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment