Skip to content

Instantly share code, notes, and snippets.

@skritch
Last active September 1, 2020 17:18
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save skritch/a03880a0f8774efffbc5509afe17003c to your computer and use it in GitHub Desktop.
Save skritch/a03880a0f8774efffbc5509afe17003c to your computer and use it in GitHub Desktop.
A partial implementation of a way to write a batch table to Parquet at a path in Flink, tested on 1.10
@skritch
Copy link
Author

skritch commented Aug 19, 2020

Wrote this because we were on Flink 1.10 and we wanted to completely infer the datatype of the output from the existing Table without using a known AvroSchema. Flink 1.11's Parquet format doesn't seem to want to allow this, and while I'll admit it might be a bad idea, it was the easiest way to handle a schema that was derived from an external source without requiring the output schema be kept up to date with the source.

There might be a way to wrap all this into one of Flink's interfaces, no clue.

Some SQL datatypes may not be handled in this example, and given the use of GenericRecord it might be slow.

Requires these, roughly:

  "org.apache.flink" %% "flink-parquet" % flinkVersion,
  "org.apache.flink" % "flink-avro" % flinkVersion,
  "org.apache.parquet" % "parquet-avro" % "1.11.0",
  "org.apache.flink" %% "flink-hadoop-compatibility" % flinkVersion,
  "org.apache.flink" % "flink-shaded-hadoop-2-uber" % "2.8.3-10.0",

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment