Notes on the integration test,
I set it to write 10MM synthetic records, with fastavro and avro, and then read them back in, each side reading what it wrote, and then verify that the read
PCollections are equal (via a
"Write" pipeline: 10MM records
The fastavro side is 3.5x faster:
"Read" pipeline: 10MM records
The fastavro side is 6.3x faster:
Here's the total resource metrics:
The total size of the files written to disk is 87.9MiB, but appears to be ≈1.38GiB uncompressed; here's the
ReadAllFromAvro works in
DataflowRunner, not for me locally via
For some reason when I run it locally (with
DirectRunner), the glob I'm using in
ReadAllFromAvro is picking up 0 files, so the
check logic are effectively no-op'ing; I haven't figured out why that is yet.
However, running with
DataflowRunner against GCS behaves as expected. I tried adding a
time.sleep in between the two pipelines to see whether the filesystem (I'm on macOS) is racing itself, but that didn't fix it, so I'm thinking maybe it has to do with the handling of
*-globs on different filesystems?
Doesn't clean up temporary files
I still need to add this; afaict I don't get this for free just by being an integration test.