Skip to content

Instantly share code, notes, and snippets.

@heuermh
Created January 11, 2023 17:39
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save heuermh/fcfb927e65b5abe6230f71ca6e2dadd2 to your computer and use it in GitHub Desktop.
Save heuermh/fcfb927e65b5abe6230f71ca6e2dadd2 to your computer and use it in GitHub Desktop.
Convert fastq to Parquet with zstd compression via duckdb
#!/bin/bash
echo "converting FASTQ to tab-delimited text format, one read per line..."
dsh-bio fastq-to-text -i seqkit-benchmark-data/dataset_C.fq -o seqkit-benchmark-data/dataset_C.txt
echo "dataset_C.txt:"
head -n 2 seqkit-benchmark-data/dataset_C.txt
echo "CREATE TABLE reads(description VARCHAR, sequence VARCHAR, quality VARCHAR);" > convert.sql
echo "COPY reads FROM 'seqkit-benchmark-data/dataset_C.txt' (AUTO_DETECT TRUE);" >> convert.sql
echo "COPY reads TO 'dataset_C-zstd.parquet' (FORMAT 'PARQUET', CODEC 'ZSTD');" >> convert.sql
echo "converting text format to Parquet with zstd compression via duckdb..."
duckdb dataset_C.duckdb < convert.sql
echo "file sizes:"
du -h dataset*
@heuermh
Copy link
Author

heuermh commented Jan 11, 2023

$ ./duckdb-to-parquet.sh
converting FASTQ to tab-delimited text format, one read per line...
dataset_C.txt:
K00137:236:H7NLVBBXX:6:1126:29721:23241 1:N:0	TGGTAGGGAGTTGAGTAGCATGGGTATAGTATAGTGTCATGATGCCAGATTTTAAAAAAAATACTGGAGACAGTCAGCTTATTTATCAGAAAGGTTTATT	```eeiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiieiiiiiiiiiiiiiiiiiiiieiiiii
K00137:236:H7NLVBBXX:6:1117:32410:45906 1:N:0	NCATTCATTATCTCAGCACCGGCATCACGCACGCGGTCTACATAACGGCCCGGCTCGGCCACCATCATGTGGACATCCAGAGGTTTTTCGGCAATGGTGC	B``eeiiiiiiiiiiiieiiiiiiiieiiiiiiiiiiiiiiiiiiiiiiiieii`i`eii[eiiiieVeeieiii[`ei``e[L``eiiiii`i`i`[ei
converting text format to Parquet with zstd compression via duckdb...
100% ▕████████████████████████████████████████████████████████████▏
100% ▕████████████████████████████████████████████████████████████▏
file sizes:
465M	dataset_C-zstd.parquet
608M	dataset_C.duckdb

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment