Skip to content

Instantly share code, notes, and snippets.

@ryan-williams
Created June 26, 2018 03:29
Show Gist options
  • Save ryan-williams/ede5ae61605e7ba6aa655071858ef52b to your computer and use it in GitHub Desktop.
Save ryan-williams/ede5ae61605e7ba6aa655071858ef52b to your computer and use it in GitHub Desktop.
Benchmarks and info about WIP Beam fastavro integration test

Notes on the integration test, fastavro_it_test

Benchmarks

I set it to write 10MM synthetic records, with fastavro and avro, and then read them back in, each side reading what it wrote, and then verify that the read PCollections are equal (via a CoGroupByKey).

"Write" pipeline: 10MM records

The fastavro side is 3.5x faster:

"Read" pipeline: 10MM records

The fastavro side is 6.3x faster:

Here's the total resource metrics:

The total size of the files written to disk is 87.9MiB, but appears to be ≈1.38GiB uncompressed; here's the CoGroupByKey's stats:

Issues

Glob-ReadAllFromAvro works in DataflowRunner, not for me locally via DirectRunner

For some reason when I run it locally (with DirectRunner), the glob I'm using in ReadAllFromAvro is picking up 0 files, so the CoGroupByKey and check logic are effectively no-op'ing; I haven't figured out why that is yet.

However, running with DataflowRunner against GCS behaves as expected. I tried adding a time.sleep in between the two pipelines to see whether the filesystem (I'm on macOS) is racing itself, but that didn't fix it, so I'm thinking maybe it has to do with the handling of *-globs on different filesystems?

Doesn't clean up temporary files

I still need to add this; afaict I don't get this for free just by being an integration test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment