Skip to content

Instantly share code, notes, and snippets.

@vectorijk
Last active August 14, 2018 06:32
GSoC 2018 wrap-up - TPC-H on Apache Beam SQL

GSoC 2018 wrap-up

Student: Kai Jiang (jiangkai@gmail.com)

Mentor: Kenneth Knowles

Progress

https://docs.google.com/spreadsheets/d/12iO0vnPWJC-SFp1dBXd_iClf2ERjewl6IRAC2Z0AzdY/edit#gid=0

  • List TPC-H performances on Spark, Flink and Dataflow
  • List unsupported features Beam SQL missing.

PR opened: https://github.com/apache/beam/pulls/vectorijk

TPC-H batch test suite for Beam SQL branch: https://github.com/vectorijk/beam/tree/tpch

Future Works

Benchmark Beam SQL on Spark both with standalone cluster and yarn

  • Compare with SparkSQL as baseline
  • Beam SQL issues with Spark Runner

Benchmark Beam SQL on Flink both with standalone cluster and yarn

CI (Jenkins regression test)

Conclusion

I would like to thank my mentor Kenneth for this opportunity. It was such a pleasure to work with Beam Community and SQL team.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment