Created
January 11, 2023 17:39
-
-
Save heuermh/fcfb927e65b5abe6230f71ca6e2dadd2 to your computer and use it in GitHub Desktop.
Convert fastq to Parquet with zstd compression via duckdb
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/bin/bash | |
echo "converting FASTQ to tab-delimited text format, one read per line..." | |
dsh-bio fastq-to-text -i seqkit-benchmark-data/dataset_C.fq -o seqkit-benchmark-data/dataset_C.txt | |
echo "dataset_C.txt:" | |
head -n 2 seqkit-benchmark-data/dataset_C.txt | |
echo "CREATE TABLE reads(description VARCHAR, sequence VARCHAR, quality VARCHAR);" > convert.sql | |
echo "COPY reads FROM 'seqkit-benchmark-data/dataset_C.txt' (AUTO_DETECT TRUE);" >> convert.sql | |
echo "COPY reads TO 'dataset_C-zstd.parquet' (FORMAT 'PARQUET', CODEC 'ZSTD');" >> convert.sql | |
echo "converting text format to Parquet with zstd compression via duckdb..." | |
duckdb dataset_C.duckdb < convert.sql | |
echo "file sizes:" | |
du -h dataset* |
Author
heuermh
commented
Jan 11, 2023
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment