Skip to content

Instantly share code, notes, and snippets.

@philippgeisler
Created July 15, 2015 21:01
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save philippgeisler/6b3ece35f1647fa5a1ca to your computer and use it in GitHub Desktop.
Save philippgeisler/6b3ece35f1647fa5a1ca to your computer and use it in GitHub Desktop.
Awk script to fix header values/fill in missing headers in csv/tsv data created by export from WYSIWYG spreadsheets with joined cells.
#!/usr/local/bin/awk -f
BEGIN {
# adjust the following variables according to your needs:
FS = "\t" # the field seperator, "\t" for tab seperated, "," for comma seperated data
first = 2 # number of the first header line to be processed
last = 4 # number of the last header line to be processed
string_delimiter = "\"" # character used for delimiting strings in the input file; set to ""
# if none is used (empty string)
# Please note that the script will not work as expected if there are additional characters/
# spaces between string_delimiters and field seperators in the input file.
}
{
current = ""
previous = ""
if ( NR >= first && NR <= last ) {
for (i = 1; i <= NF; i++) {
current = $i
gsub(string_delimiter,"", current)
if ( current == "" && column_header[i] == "" ) {
current = previous
} else {
previous = current
}
if ( column_header[i] == "" ) {
column_header[i] = current
} else {
if ( current != "" ) {
column_header[i] = column_header[i] ": " current
}
}
}
}
if ( NR == last ) {
for (i = 1; i < length(column_header); i++) {
printf("%s%s%s%s", string_delimiter, column_header[i], string_delimiter, FS)
}
printf("%s%s%s\n", string_delimiter, column_header[i], string_delimiter)
}
if ( NR > last ) {
print
}
}
@philippgeisler
Copy link
Author

Let me try to better explain what this script does by example.

Let’s say you have a nice table in a spreadsheet program like Excel and by joining cells have arranged your headers to mirror the hierarchical nature of the data in the layout, e.g. like so:

     Transport        |
---------|------------|
 private |   public   |
---------|-----|------|
   car   | bus | taxi |

Looks nice, but when exported to CSV, the result will (may) look like this:

"Transport","",""
"private","public",""
"car","bus","taxi"

Which is quite undesireable for further processing in other programs, esp. on the command line.

This Awk script will concatenate the appropriate fields from the three rows into one row while preserving the hierarchy. The result in this example would be:

"Transport: private: car","Transport: public: bus","Transport: public: taxi"

All following rows, the data, will be left untouched.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment