Compress with dsh-bio compress-gfa1
(this parses and validates GFA 1.0 records before writing)
to Block Gzipped (BGZF, .bgz), Bzip2 (.bz2), Gzip (.gz), and Zstandard (.zst) formats
gunzip human__pan.AF0__18.gfa.gz
dsh-bio compress-gfa1 -i human__pan.AF0__18.gfa -o human__pan.AF0__18.gfa.bgz
dsh-bio compress-gfa1 -i human__pan.AF0__18.gfa -o human__pan.AF0__18.gfa.bz2
dsh-bio compress-gfa1 -i human__pan.AF0__18.gfa -o human__pan.AF0__18.gfa.gz
dsh-bio compress-gfa1 -i human__pan.AF0__18.gfa -o human__pan.AF0__18.gfa.zst
Compare file sizes
$ du -h *
546M human__pan.AF0__18.gfa
868M human__pan.AF0__18.gfa.bgz
80M human__pan.AF0__18.gfa.bz2
100M human__pan.AF0__18.gfa.gz
309M human__pan.AF0__18.gfa.zst
(what happened to the .bgz file?)
Compare streaming filter segments by sequence length operation
$ time dsh-bio filter-gfa1 --length 24 -i human__pan.AF0__18.gfa > /dev/null
real 0m34.717s
user 1m34.369s
sys 0m4.148s
$ time dsh-bio filter-gfa1 --length 24 -i human__pan.AF0__18.gfa.bgz > /dev/null
real 1m7.400s
user 2m3.822s
sys 0m18.442s
$ time dsh-bio filter-gfa1 --length 24 -i human__pan.AF0__18.gfa.bzip2 > /dev/null
real 0m33.617s
user 1m25.485s
sys 0m4.045s
$ time dsh-bio filter-gfa1 --length 24 -i human__pan.AF0__18.gfa.gz > /dev/null
real 0m37.181s
user 1m40.044s
sys 0m2.987s
$ time dsh-bio filter-gfa1 --length 24 -i human__pan.AF0__18.gfa.zst > /dev/null
real 0m38.020s
user 1m39.285s
sys 0m4.202s
Create path-with-traversals GFA 1.0 format
cat human__pan.AF0__18.gfa | dsh-bio traverse-paths | dsh-bio truncate-paths > human__pan.AF0__18.traversals.gfa
E.g.
$ grep "^P.*$" human__pan.AF0__18.traversals.gfa
P 18 * *
$ grep "^T.*$" human__pan.AF0__18.traversals.gfa | head
T 18 0 261306388 + 261306389 + 32M
T 18 1 261306389 + 261306390 + 32M
T 18 2 261306390 + 261306391 + 32M
T 18 3 261306391 + 261306392 + 32M
T 18 4 261306392 + 261306393 + 32M
T 18 5 261306393 + 261306394 + 32M
T 18 6 261306394 + 261306395 + 32M
T 18 7 261306395 + 261306396 + 32M
T 18 8 261306396 + 261306397 + 32M
T 18 9 261306397 + 261306398 + 32M
Compress with dsh-bio compress-gfa1
to Block Gzipped (BGZF, .bgz), Bzip2 (.bz2), Gzip (.gz), and Zstandard (.zst) formats
dsh-bio compress-gfa1 -i human__pan.AF0__18.traversals.gfa -o human__pan.AF0__18.traversals.gfa.bgz
dsh-bio compress-gfa1 -i human__pan.AF0__18.traversals.gfa -o human__pan.AF0__18.traversals.gfa.bz2
dsh-bio compress-gfa1 -i human__pan.AF0__18.traversals.gfa -o human__pan.AF0__18.traversals.gfa.gz
dsh-bio compress-gfa1 -i human__pan.AF0__18.traversals.gfa -o human__pan.AF0__18.traversals.gfa.zst
Compare file sizes
$ du -h *
689M human__pan.AF0__18.traversals.gfa
1.2G human__pan.AF0__18.traversals.gfa.bgz
103M human__pan.AF0__18.traversals.gfa.bz2
130M human__pan.AF0__18.traversals.gfa.gz
422M human__pan.AF0__18.traversals.gfa.zst
Transform GFA 1.0 to generic Gfa1Record
records in Parquet format
spark-submit \
--conf spark.sql.parquet.compression.codec=gzip \
--driver-memory 30G \
--class com.github.heuermh.adam.gfa.Gfa1ToDataframe \
adam-gfa_2.11-0.4.0-SNAPSHOT.jar \
human__pan.AF0__18.gfa \
human__pan.AF0__18.gzip.parquet
spark-submit \
--conf spark.sql.parquet.compression.codec=snappy \
--driver-memory 30G \
--class com.github.heuermh.adam.gfa.Gfa1ToDataframe \
adam-gfa_2.11-0.4.0-SNAPSHOT.jar \
human__pan.AF0__18.gfa \
human__pan.AF0__18.snappy.parquet
Compare file sizes
$ du -h *.parquet
101M human__pan.AF0__18.gzip.parquet
193M human__pan.AF0__18.snappy.parquet
Compare filter segments by sequence length operation
$ INPUT=/data/graph-assembly/benchmarks/human__pan.AF0__18.gzip.parquet \
spark-shell \
--driver-memory 30G \
--jars adam-gfa_2.11-0.4.0-SNAPSHOT.jar \
-i filterGfa1Records.scala
Filtered human__pan.AF0__18.gzip.parquet to 1796551 records in 4237037551 ns or 4.237 seconds
$ INPUT=/data/graph-assembly/benchmarks/human__pan.AF0__18.snappy.parquet \
spark-shell \
--driver-memory 30G \
--jars adam-gfa_2.11-0.4.0-SNAPSHOT.jar \
-i filterGfa1Records.scala
Filtered human__pan.AF0__18.snappy.parquet to 1796551 records in 4264487979 ns or 4.264 seconds
filterGfa1Records.scala
:
import org.slf4j.LoggerFactory
val logger = LoggerFactory.getLogger("filterGfa1Records")
import org.apache.log4j.{ Level, Logger }
Logger.getLogger("filterGfa1Records").setLevel(Level.INFO)
val inputPath = Option(System.getenv("INPUT"))
if (!inputPath.isDefined) {
logger.error("INPUT environment variable is required")
System.exit(1)
}
import com.github.heuermh.adam.gfa.sql.gfa1.Gfa1Record
val before = System.nanoTime
val count = spark
.read
.parquet(inputPath.get)
.as[Gfa1Record]
.where("recordType = 'S' and length(sequence) > 24")
.count
val elapsed = System.nanoTime - before
logger.info("Filtered %s to %d records in %d ns or %.3f seconds".format(inputPath.get, count, elapsed, elapsed/1.0e09))
System.exit(0)
Transform GFA 1.0 to specific Link
, Path
, Segment
, and Traversal
records in Parquet format
spark-submit \
--conf spark.sql.parquet.compression.codec=gzip \
--driver-memory 30G \
--class com.github.heuermh.adam.gfa.Gfa1ToDataframes \
adam-gfa_2.11-0.4.0-SNAPSHOT.jar \
human__pan.AF0__18.traversals.gfa \
human__pan.AF0__18.traversals.gzip
spark-submit \
--conf spark.sql.parquet.compression.codec=snappy \
--driver-memory 30G \
--class com.github.heuermh.adam.gfa.Gfa1ToDataframes \
adam-gfa_2.11-0.4.0-SNAPSHOT.jar \
human__pan.AF0__18.traversals.gfa \
human__pan.AF0__18.traversals.snappy
Compare file sizes
$ du -h *.parquet
36M human__pan.AF0__18.gzip.links.parquet
17M human__pan.AF0__18.gzip.paths.parquet
49M human__pan.AF0__18.gzip.segments.parquet
16K human__pan.AF0__18.gzip.traversals.parquet
77M human__pan.AF0__18.snappy.links.parquet
31M human__pan.AF0__18.snappy.paths.parquet
87M human__pan.AF0__18.snappy.segments.parquet
16K human__pan.AF0__18.snappy.traversals.parquet
Total of 118M for gzip, 211M for snappy.
Compare filter segments by sequence length operation
$ INPUT=human__pan.AF0__18.gzip.segments.parquet \
spark-shell \
--driver-memory 30G \
--jars adam-gfa_2.11-0.4.0-SNAPSHOT.jar \
-i filterSegments.scala
Filtered human__pan.AF0__18.gzip.segments.parquet to 1796551 segments in 3616933420 ns or 3.617 seconds
$ INPUT=human__pan.AF0__18.snappy.segments.parquet \
spark-shell \
--driver-memory 30G \
--jars adam-gfa_2.11-0.4.0-SNAPSHOT.jar \
-i filterSegments.scala
Filtered human__pan.AF0__18.snappy.segments.parquet to 1796551 segments in 3710054247 ns or 3.710 seconds
filterSegments.scala
:
import org.slf4j.LoggerFactory
val logger = LoggerFactory.getLogger("filterSegments")
import org.apache.log4j.{ Level, Logger }
Logger.getLogger("filterSegments").setLevel(Level.INFO)
val inputPath = Option(System.getenv("INPUT"))
if (!inputPath.isDefined) {
logger.error("INPUT environment variable is required")
System.exit(1)
}
import com.github.heuermh.adam.gfa.sql.gfa1.Segment
val before = System.nanoTime
val count = spark
.read
.parquet(inputPath.get)
.as[Segment]
.where("length(sequence) > 24")
.count
val elapsed = System.nanoTime - before
logger.info("Filtered %s to %d segments in %d ns or %.3f seconds".format(inputPath.get, count, elapsed, elapsed/1.0e09))
System.exit(0)