Skip to content

Instantly share code, notes, and snippets.

@vectorijk
Last active August 14, 2018 06:32
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save vectorijk/4372aa5e69b465138f29eb9083952590 to your computer and use it in GitHub Desktop.
Save vectorijk/4372aa5e69b465138f29eb9083952590 to your computer and use it in GitHub Desktop.
GSoC 2018 wrap-up - TPC-H on Apache Beam SQL

GSoC 2018 wrap-up

Student: Kai Jiang (jiangkai@gmail.com)

Mentor: Kenneth Knowles

Progress

https://docs.google.com/spreadsheets/d/12iO0vnPWJC-SFp1dBXd_iClf2ERjewl6IRAC2Z0AzdY/edit#gid=0

  • List TPC-H performances on Spark, Flink and Dataflow
  • List unsupported features Beam SQL missing.

PR opened: https://github.com/apache/beam/pulls/vectorijk

TPC-H batch test suite for Beam SQL branch: https://github.com/vectorijk/beam/tree/tpch

Future Works

Benchmark Beam SQL on Spark both with standalone cluster and yarn

  • Compare with SparkSQL as baseline
  • Beam SQL issues with Spark Runner

Benchmark Beam SQL on Flink both with standalone cluster and yarn

CI (Jenkins regression test)

Conclusion

I would like to thank my mentor Kenneth for this opportunity. It was such a pleasure to work with Beam Community and SQL team.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment