I set it to write 10MM synthetic records, with fastavro and avro, and then read them back in, each side reading what it wrote, and then verify that the read PCollection
s are equal (via a CoGroupByKey
).
The fastavro side is 3.5x faster:
The fastavro side is 6.3x faster:
Here's the total resource metrics:
The total size of the files written to disk is 87.9MiB, but appears to be ≈1.38GiB uncompressed; here's the CoGroupByKey
's stats:
For some reason when I run it locally (with DirectRunner
), the glob I'm using in ReadAllFromAvro
is picking up 0 files, so the CoGroupByKey
and check
logic are effectively no-op'ing; I haven't figured out why that is yet.
However, running with DataflowRunner
against GCS behaves as expected. I tried adding a time.sleep
in between the two pipelines to see whether the filesystem (I'm on macOS) is racing itself, but that didn't fix it, so I'm thinking maybe it has to do with the handling of *
-globs on different filesystems?
I still need to add this; afaict I don't get this for free just by being an integration test.