disulfidebond/FileManipulationCookbook

## FileManipulationCookbook
Notes: A hash symbol (#) denotes a comment that is not part of any code or Excel formula entry
The following is a cookbook of solutions to common problems with data file manipulation.
When possible, solutions for fixes within Excel and outside of Excel are provided.

I. Problem: There are duplicates in a column, but you don't want to lose row ordering

Solution Within Excel:  Pick a column with unique identifiers, or create a new column with unique identifiers in each row with the command:

        =1 # add this to the first cell in the column
        =[row]+1 # copy and paste this this to each subsequent cell in the column
        # where [row] is the previous row entry

Then, sort using the unique entries, and select "Conditional Highlighting" (the specific location varies across Excel Versions),
and select "Highlight Duplicates".  Visually scan and remove or modify any duplicates, being careful to maintain ordering.

Solution Outside of Excel: Create a hash that assigns the column in question as the key, and all other entries as a string that is the value.
There are multiple ways to do this; methods in Bash and Python are provided:

        #!/usr/bin/bash

        # assumes entries are tab-delimited, the column in question is column 3, read from the file entries.txt

        awk '{val[$3]++;} END{
          for (var in val)
            print var, "had ", val[var]," entries"
        } entries.txt

        # will output each value and the number of times it appears, i.e. cells with more than one are duplicates


        # #################################

        #!/usr/bin/python

        # assumes entries are comma-separated, the column in question is column 3, read from the file entries.txt
        l = []
        with open('entries.txt') as f:
            for i in f:
              i = i.rstrip('\r\n')
              l.append(i)
        d = dict()
        for i in l:
          s = i.split(',')
          k = s[2]
          if k not in d:
            d[k] = i
          else:
            print('Found duplicate: ' + str(k))

        print('Job Done!')
        # outputs only the entries that are duplicates

II. Problem: There are duplicates in columns in two different files or spreadsheets,
or there are unique values only present in one dataset but not the other.

You essentially want to perform an Inner Join SQL lookup.

Solution Within Excel (1):  There are two ways to do this.  If you'd like to manually see which values are excluded/included:

Sort both sets of data.  The specific method isn't important, such as ascending value, descending value, or increasing alphabetical, but be consistent across both sets.
Note: If there is no column with similarly sorted values between both datasets, stop; you will not be able to perform an Inner Join,
or it will not provide you with the expected outcome.

Next, copy all data to one spreadsheet, making certain that all columns are justified to the top row.  Then, to the far right of the data, enter the following formula:

        =IF([row1] = [row2],TRUE,FALSE)
        # where [row1] is the first cell in the first sorted column that is being compared,
        # and [row2[ is the first cell in the second sorted column that is being compared

It should read true, or there should be a good reason that it does not.  Then, copy and paste this value to all remaining cells in the column.

Finally, visually scan the column for cells with FALSE.  You can either reorder the column to continue the verification,
or continue visually scanning, depending on your needs.

Solution Within Excel (2):  Use VLOOKUP

First, get a list of the values that are present in both tables.  Enter the following code in a spare section of the spreadsheet:

        =IFERROR(AGGREGATE(15,6,Sheet2!$A$1:$A$5000/(COUNTIF(Sheet1!$A$1:$A$5000,Sheet2!$A$1:$A$5000)>0),ROW(1:1)),"")
        # replace A and Sheet1 with the expected sheet and column/row names

Then, use the VLOOKUP command to return matching values, note that in this example, numbers are specified and not text:

        IF(A2<>"",VLOOKUP(A2,Sheet1!A:B,2,FALSE),"")
        # replace A and Sheet1 with the expected sheet and column/row names

source: https://stackoverflow.com/questions/35164745/how-to-inner-join-in-excel-eg-using-vlookup

Solution Outside Excel :  Use hash table

Follow the steps described in I, but instead create a list or array from one dataset,
and a hash with the other.  Due to the increasing complexity of the task, Bash would not be recommended.
	Notes: A hash symbol (#) denotes a comment that is not part of any code or Excel formula entry
	The following is a cookbook of solutions to common problems with data file manipulation.
	When possible, solutions for fixes within Excel and outside of Excel are provided.

	I. Problem: There are duplicates in a column, but you don't want to lose row ordering

	Solution Within Excel: Pick a column with unique identifiers, or create a new column with unique identifiers in each row with the command:

	=1 # add this to the first cell in the column
	=[row]+1 # copy and paste this this to each subsequent cell in the column
	# where [row] is the previous row entry

	Then, sort using the unique entries, and select "Conditional Highlighting" (the specific location varies across Excel Versions),
	and select "Highlight Duplicates". Visually scan and remove or modify any duplicates, being careful to maintain ordering.

	Solution Outside of Excel: Create a hash that assigns the column in question as the key, and all other entries as a string that is the value.
	There are multiple ways to do this; methods in Bash and Python are provided:

	#!/usr/bin/bash

	# assumes entries are tab-delimited, the column in question is column 3, read from the file entries.txt

	awk '{val[$3]++;} END{
	for (var in val)
	print var, "had ", val[var]," entries"
	} entries.txt

	# will output each value and the number of times it appears, i.e. cells with more than one are duplicates


	# #################################

	#!/usr/bin/python

	# assumes entries are comma-separated, the column in question is column 3, read from the file entries.txt
	l = []
	with open('entries.txt') as f:
	for i in f:
	i = i.rstrip('\r\n')
	l.append(i)
	d = dict()
	for i in l:
	s = i.split(',')
	k = s[2]
	if k not in d:
	d[k] = i
	else:
	print('Found duplicate: ' + str(k))

	print('Job Done!')
	# outputs only the entries that are duplicates

	II. Problem: There are duplicates in columns in two different files or spreadsheets,
	or there are unique values only present in one dataset but not the other.

	You essentially want to perform an Inner Join SQL lookup.

	Solution Within Excel (1): There are two ways to do this. If you'd like to manually see which values are excluded/included:

	Sort both sets of data. The specific method isn't important, such as ascending value, descending value, or increasing alphabetical, but be consistent across both sets.
	Note: If there is no column with similarly sorted values between both datasets, stop; you will not be able to perform an Inner Join,
	or it will not provide you with the expected outcome.

	Next, copy all data to one spreadsheet, making certain that all columns are justified to the top row. Then, to the far right of the data, enter the following formula:

	=IF([row1] = [row2],TRUE,FALSE)
	# where [row1] is the first cell in the first sorted column that is being compared,
	# and [row2[ is the first cell in the second sorted column that is being compared

	It should read true, or there should be a good reason that it does not. Then, copy and paste this value to all remaining cells in the column.

	Finally, visually scan the column for cells with FALSE. You can either reorder the column to continue the verification,
	or continue visually scanning, depending on your needs.

	Solution Within Excel (2): Use VLOOKUP

	First, get a list of the values that are present in both tables. Enter the following code in a spare section of the spreadsheet:

	=IFERROR(AGGREGATE(15,6,Sheet2!$A$1:$A$5000/(COUNTIF(Sheet1!$A$1:$A$5000,Sheet2!$A$1:$A$5000)>0),ROW(1:1)),"")
	# replace A and Sheet1 with the expected sheet and column/row names

	Then, use the VLOOKUP command to return matching values, note that in this example, numbers are specified and not text:

	IF(A2<>"",VLOOKUP(A2,Sheet1!A:B,2,FALSE),"")
	# replace A and Sheet1 with the expected sheet and column/row names

	source: https://stackoverflow.com/questions/35164745/how-to-inner-join-in-excel-eg-using-vlookup

	Solution Outside Excel : Use hash table

	Follow the steps described in I, but instead create a list or array from one dataset,
	and a hash with the other. Due to the increasing complexity of the task, Bash would not be recommended.