Skip to content

Instantly share code, notes, and snippets.

@SingularBunny
Created November 20, 2019 18:24
Show Gist options
  • Save SingularBunny/7ac022428ef4741f4c24ad8e889fd6a2 to your computer and use it in GitHub Desktop.
Save SingularBunny/7ac022428ef4741f4c24ad8e889fd6a2 to your computer and use it in GitHub Desktop.

Apache Spark Unit Testing Part 2 - Spark SQL

Second part of article series about how to use Spark repository classes for Unit Testing. Spark SQL package has four sub projects each of which has its own test classes:

In context of testing own Spark jobs we will just discuss only three of them (core, catalyst, hive).

Dependencies

https://gist.github.com/fc7f476de6a531819dcacff2e30406e4

1.1 Spark SQL Execution Unit Testing

Suites extending SharedSparkSession are sharing resources (eg. SparkSession) in their tests. That trait initializes the spark session in its beforeAll() implementation before the automatic thread snapshot is performed, so the audit code could fail to report threads leaked by that shared session. The behavior is overridden here to take the snapshot before the spark session is initialized. Extends SQLTestUtils and SharedSparkSessionBase.

Helper trait for SQL test suites where all tests share a single TestSparkSession.

A special SparkSession prepared for testing.

Helper trait that should be extended by all SQL test suites within the Spark code base. This allows subclasses to plugin a custom SQLContext. It comes with test data prepared in advance as well as all implicit conversions used extensively by dataframes. To use implicit methods, import testImplicits._ instead of through the SQLContext. Subclasses should not create SQLContexts in the test suite constructor, which is prone to leaving multiple overlapping SparkContexts in the same JVM. Extends SparkFunSuite, SQLTestUtilsBase and PlanTestBase.

Helper trait that can be extended by all external SQL test suites. This allows subclasses to plugin a custom SQLContext. To use implicit methods, import testImplicits._ instead of through the SQLContext. Subclasses should not create SQLContexts in the test suite constructor, which is prone to leaving multiple overlapping SparkContexts in the same JVM. Extends SQLTestData and PlanTestBase. Contains: https://gist.github.com/f4fd95ecb18fce528670600dad188226

A helper object for importing SQL implicits. Note that the alternative of importing spark.implicits._ is not possible here. This is because we create the SQLContext immediately before the first test is run, but the implicits import is needed in the constructor.

A collection of sample data used in SQL tests.

Great framework to checking results inside SQL package. Contains big amount of DataFrame and Dataset assertions and checks.

Example

https://gist.github.com/1d6cbf9dd7aed3a9a0dcea8ef79448e8

Manages a local spark SparkSession variable, correctly stopping it after each test.

Example

https://gist.github.com/ec410c40ac41d82bd448ad13024c9db7

Metrics Testing

The base for statistics test cases that we want to include in both the hive module (for verifying behavior when using the Hive external catalog) as well as in the sql/core module.

Object with useful methods for columnar based cases. Example https://gist.github.com/a801afcbc73fc3bdfd20749c4a7cd4f4

File Based Tests

A helper trait that provides convenient facilities for file-based data source testing. Specifically, it is used for Parquet and Orc testing. It can be used to write tests that are shared between Parquet and Orc.

Orc Testing

Uses for testing with data in Orc file format.

Parquet Testing

A helper trait that provides convenient facilities for Parquet testing. NOTE: Considering classes Tuple1 ... Tuple22 all extend Product, it would be more convenient to use tuples rather than special case classes when writing test cases/suites. Especially, Tuple1.apply can be used to easily wrap a single type/value.

Helper class for testing Parquet compatibility.

Could check how Spark executes SQL queries.

Base class for writing tests for individual physical operators. For an example of how this class's test helper methods can be used, see SortSuite. Extends SparkFunSuite. Companion object contains helper methods for writing tests of individual physical operators.

Checks if generated queue has appropriated size either JIT optimization might not work.

This object targets to integrate various UDF test cases so that Scalar UDF, Python UDF and Scalar Pandas UDFs can be tested in SBT & Maven tests. The available UDFs are special. It defines an UDF wrapped by cast. So, the input column is casted into string, UDF returns strings as are, and then output column is casted back to the input column. In this way, UDF is virtually no-op. Note that, due to this implementation limitation, complex types such as map, array and struct types do not work with this UDFs because they cannot be same after the cast roundtrip. To register Scala UDF in SQL: https://gist.github.com/c62bbcec2da1517ec6bce313d8d3c51d

To register Python UDF in SQL: https://gist.github.com/895f7d1ebe75c0b64c3dcdc46299443c

To register Scalar Pandas UDF in SQL: https://gist.github.com/8bfd882d405dde89ca367abf81732fbc

To use it in Scala API and SQL: https://gist.github.com/9195efdee2daede5009d2c889109b0b7

Streaming DataFrames and streaming Datasets Testing

A framework for implementing tests for streaming queries and sources. A test consists of a set of steps (expressed as a StreamAction) that are executed in order, blocking as necessary to let the stream catch up. For example, the following adds some data to a stream, blocking until it can verify that the correct values are eventually produced. https://gist.github.com/7f8216daf83c33f3bcee7894c51f29ec

Note that while we do sleep to allow the other thread to progress without spinning, StreamAction checks should not depend on the amount of time spent sleeping. Instead they should check the actual progress of the stream before verifying the required test condition. Currently it is assumed that all streaming queries will eventually complete in 10 seconds to avoid hanging forever in the case of failures. However, individual suites can change this by overriding streamingTimeout. Extends QueryTest with SharedSparkSession.

Extends StreamTest. In addition stores states.

Used for streaming tests that allows checking whether the stream is waiting on the clock at expected times.

1.2 Catalyst Unit Testing

Extends SparkFunSuite and PlanTestBase. There is no other code, just mixin two traits. When you don't need to use SparkFunSuite just PlanTestBase could be used.

Base class for plan tests. Extends PredicateHelper with SQLHelper.

Example

https://gist.github.com/66706216608457fc7ecda05332086e2c

Extension of PlanTest with some useful methods. Also it creates two plan analyzers. Case sensitive and Insensitive

A few helper functions for expression evaluation testing. Mixin this trait to use them. Used basically for Catalyst development.

1.3 Spark Hive Unit Testing

Base class for Spark Hive unit tests. Extends SparkFunSuite.

A locally running test instance of Spark's Hive execution engine. Data from testTables will be automatically loaded whenever a query is run over those tables. Calling reset will delete all tables and other state in the database, leaving the database in a "clean" state. TestHive is singleton object version of this class because instantiating multiple copies of the hive metastore seems to lead to weird non-deterministic failures. Therefore, the execution of test cases that rely on TestHive must be serialized. Extends SQLContext.

Builder that makes HiveClient for test purposes.

Example

https://gist.github.com/363e450309d982d621ad2dcbaaff8819

Allows the creations of tests that execute the same query against both hive and catalyst, comparing the results. The "golden" results from Hive are cached in and retrieved both from the classpath and answerCache to speed up testing. See the documentation of public vals in this class for information on how test execution can be configured using system properties. Extends SparkFunSuite.

A framework for running the query tests that are listed as a set of text files. Test Suites that derive from this class must provide a map of testCaseName to testCaseFiles that should be included. Additionally, there is support for whitelisting and blacklisting tests as development progresses. Extends HiveComparisonTest.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment