Skip to content

Instantly share code, notes, and snippets.

@vergenzt
Last active February 12, 2020 21:39
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save vergenzt/d717bbad096dcf4be2151c66af47bf3a to your computer and use it in GitHub Desktop.
Save vergenzt/d717bbad096dcf4be2151c66af47bf3a to your computer and use it in GitHub Desktop.
Split a large CSV file into chunks of rows when fields may contain newlines (i.e. can't do naive line-based split)
# uses Mac OSX `osascript` for convenience
# depends on `coreutils` and `csvkit`
ROWS_PER_FILE=20000
FILE="$(osascript -e 'POSIX path of (choose file with prompt "Choose combined CSV file")')"
if [ -z "$FILE" ]; then exit; fi
BASE="$(basename "$FILE" | cut -d. -f1)"
cd "$(dirname "$FILE")"
echo "Splitting $FILE into chunks of $ROWS_PER_FILE rows..."
cat "$FILE" \
| csvjson --stream --no-inference --snifflimit 0 \
| gsplit -d --additional-suffix=.json -l $ROWS_PER_FILE -u - "${BASE}_"
for chunk_json in ${BASE}_*.json; do
chunk_csv="$(basename "$chunk_json" .json).csv"
in2csv -f ndjson --no-inference "$chunk_json" > "$chunk_csv"
echo "Processed $chunk_csv"
rm "$chunk_json"
done
echo "Done!"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment