Skip to content

Instantly share code, notes, and snippets.

@disulfidebond
Last active March 12, 2024 13:00
Show Gist options
  • Save disulfidebond/e6e2d96a76e56f650ecc53a679151774 to your computer and use it in GitHub Desktop.
Save disulfidebond/e6e2d96a76e56f650ecc53a679151774 to your computer and use it in GitHub Desktop.
Excel File Manipulation Cookbook
Notes: A hash symbol (#) denotes a comment that is not part of any code or Excel formula entry
The following is a cookbook of solutions to common problems with data file manipulation.
When possible, solutions for fixes within Excel and outside of Excel are provided.
I. Problem: There are duplicates in a column, but you don't want to lose row ordering
Solution Within Excel: Pick a column with unique identifiers, or create a new column with unique identifiers in each row with the command:
=1 # add this to the first cell in the column
=[row]+1 # copy and paste this this to each subsequent cell in the column
# where [row] is the previous row entry
Then, sort using the unique entries, and select "Conditional Highlighting" (the specific location varies across Excel Versions),
and select "Highlight Duplicates". Visually scan and remove or modify any duplicates, being careful to maintain ordering.
Solution Outside of Excel: Create a hash that assigns the column in question as the key, and all other entries as a string that is the value.
There are multiple ways to do this; methods in Bash and Python are provided:
#!/usr/bin/bash
# assumes entries are tab-delimited, the column in question is column 3, read from the file entries.txt
awk '{val[$3]++;} END{
for (var in val)
print var, "had ", val[var]," entries"
} entries.txt
# will output each value and the number of times it appears, i.e. cells with more than one are duplicates
# #################################
#!/usr/bin/python
# assumes entries are comma-separated, the column in question is column 3, read from the file entries.txt
l = []
with open('entries.txt') as f:
for i in f:
i = i.rstrip('\r\n')
l.append(i)
d = dict()
for i in l:
s = i.split(',')
k = s[2]
if k not in d:
d[k] = i
else:
print('Found duplicate: ' + str(k))
print('Job Done!')
# outputs only the entries that are duplicates
II. Problem: There are duplicates in columns in two different files or spreadsheets,
or there are unique values only present in one dataset but not the other.
You essentially want to perform an Inner Join SQL lookup.
Solution Within Excel (1): There are two ways to do this. If you'd like to manually see which values are excluded/included:
Sort both sets of data. The specific method isn't important, such as ascending value, descending value, or increasing alphabetical, but be consistent across both sets.
Note: If there is no column with similarly sorted values between both datasets, stop; you will not be able to perform an Inner Join,
or it will not provide you with the expected outcome.
Next, copy all data to one spreadsheet, making certain that all columns are justified to the top row. Then, to the far right of the data, enter the following formula:
=IF([row1] = [row2],TRUE,FALSE)
# where [row1] is the first cell in the first sorted column that is being compared,
# and [row2[ is the first cell in the second sorted column that is being compared
It should read true, or there should be a good reason that it does not. Then, copy and paste this value to all remaining cells in the column.
Finally, visually scan the column for cells with FALSE. You can either reorder the column to continue the verification,
or continue visually scanning, depending on your needs.
Solution Within Excel (2): Use VLOOKUP
First, get a list of the values that are present in both tables. Enter the following code in a spare section of the spreadsheet:
=IFERROR(AGGREGATE(15,6,Sheet2!$A$1:$A$5000/(COUNTIF(Sheet1!$A$1:$A$5000,Sheet2!$A$1:$A$5000)>0),ROW(1:1)),"")
# replace A and Sheet1 with the expected sheet and column/row names
Then, use the VLOOKUP command to return matching values, note that in this example, numbers are specified and not text:
IF(A2<>"",VLOOKUP(A2,Sheet1!A:B,2,FALSE),"")
# replace A and Sheet1 with the expected sheet and column/row names
source: https://stackoverflow.com/questions/35164745/how-to-inner-join-in-excel-eg-using-vlookup
Solution Outside Excel : Use hash table
Follow the steps described in I, but instead create a list or array from one dataset,
and a hash with the other. Due to the increasing complexity of the task, Bash would not be recommended.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment