Skip to content

Instantly share code, notes, and snippets.

@deeTEEcee
Created August 11, 2022 07:56
Show Gist options
  • Save deeTEEcee/62d9f9c25bb932b46ef62dec1446fcdc to your computer and use it in GitHub Desktop.
Save deeTEEcee/62d9f9c25bb932b46ef62dec1446fcdc to your computer and use it in GitHub Desktop.
Scratch notes + code for a csvdiff tool
"""
How to compare 2 csv files such that I can identify:
* Added/Modified/Removed Rows but know specifically which header changed
There are two ways to look at diffs:
1. Line-by-line diffs
Line-by-line diffs are dumb and can't tell what a "Modified" item is.
2. Diffs with primary keys.
If we analyze the two csv files and include the types of headers they are included, we can identify the "Modified" set based
on primary keys which could be a single header or the joining of multiple headers.
High level logic (csv1, csv2 where csv2 is the newer one):
1. Have 2 arrays of dicts and we assume the headers match. A primary key will be created where we simply join multiple header values together.
2. Process both csv files and fill them into the data structure for #1
3. Iterate through csv1, check deleted items that don't exist in csv2. These represent "Deletions"
4. Iterate through csv2, check new items that were not in csv1. These represent "Additions"
5. During #4, we can also just check primary key matches and then check each header that changed.
Questions:
1. What is processing time? Does this work for up to a million rows for csv1 and csv2?
O(n * m) Where n is number of rows and m is number of headers. Assume both csv files are relatively similar.
2. How to optimize?
"""
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment