ryan-williams/fastavro_it.md

## fastavro_it.md

      
    Raw
  

              fastavro_it.md
            
          
    Notes on the integration test, fastavro_it_test

Benchmarks

I set it to write 10MM synthetic records, with fastavro and avro, and then read them back in, each side reading what it wrote, and then verify that the read PCollections are equal (via a CoGroupByKey).
"Write" pipeline: 10MM records

The fastavro side is 3.5x faster:

"Read" pipeline: 10MM records

The fastavro side is 6.3x faster:

Here's the total resource metrics:

The total size of the files written to disk is 87.9MiB, but appears to be ≈1.38GiB uncompressed; here's the CoGroupByKey's stats:

Issues

Glob-ReadAllFromAvro works in DataflowRunner, not for me locally via DirectRunner

For some reason when I run it locally (with DirectRunner), the glob I'm using in ReadAllFromAvro is picking up 0 files, so the CoGroupByKey and check logic are effectively no-op'ing; I haven't figured out why that is yet.
However, running with DataflowRunner against GCS behaves as expected. I tried adding a time.sleep in between the two pipelines to see whether the filesystem (I'm on macOS) is racing itself, but that didn't fix it, so I'm thinking maybe it has to do with the handling of *-globs on different filesystems?
Doesn't clean up temporary files

I still need to add this; afaict I don't get this for free just by being an integration test.