Skip to content

Instantly share code, notes, and snippets.

@alex9311
Last active October 25, 2016 16:29
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save alex9311/55f7fedfd464fe591026 to your computer and use it in GitHub Desktop.
Save alex9311/55f7fedfd464fe591026 to your computer and use it in GitHub Desktop.
bash script to convert segmented hive table to single csv

###Hive Default Delimiters to CSV #####Reasoning Hive's default delimiters are

Row Delimiter => Control-A ('\001')
Collection Item Delimiter => Control-B ('\002')
Map Key Delimiter => Control-C ('\003')

There are ways to change these delimiters when exporting tables but sometimes you might still get stuck needing to convert this to csv.

Here's a quick bash script that can handle a DB export that's segmented in multiple files and has the default delimiters. It will output a single CSV file.

It is assumed that the segments all have the naming convention 000*_0

INDIRECTORY="path/to/input/directory"
for f in $INDIRECTORY/000*_0; do 
	echo "Processing $f file.."; 
	cat -v $f | 
		LC_ALL=C sed -e "s/^/\"/g" | 
		LC_ALL=C sed -e "s/\^A/\",\"/g" | 
		LC_ALL=C sed -e "s/\^C\^B/\"\":\"\"\"\",\"\"/g" | 
		LC_ALL=C sed -e "s/\^B/\"\",\"\"/g" |  
		LC_ALL=C sed -e "s/\^C/\"\":\"\"/g" |
		LC_ALL=C sed -e "s/$/\"/g" > $f-temp
done
echo "you,can,echo,your,header,here,if,you,like" > $INDIRECTORY/final_output.csv
cat $INDIRECTORY/*-temp >> $INDIRECTORY/final_output.csv
rm $INDIRECTORY/*-temp

#####Example

Suppose our hive table with 1 row looks like this:

id date event-map col4
ec329063 1/21/16 {"hash":"","currenturl":"google","adblock":"true"} dxbr5b0v87

The raw hive output with default delimiters would then be

ec329063^A2016-01-21^Ahash^C^Bcurrenturl^Cgoogle^Badblock^Ctrue^Adxbr5b0v87

Once the the script runs, you'll have a csv output:

"ec329063","2016-01-21","hash"":"""",""currenturl"":""google"",""adblock"":""true","dxbr5b0v87"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment