Second part of article series about how to use Spark repository classes for Unit Testing. Spark SQL package has four sub projects each of which has its own test classes:
In context of testing own Spark jobs we will just discuss only three of them (core, catalyst, hive).
https://gist.github.com/fc7f476de6a531819dcacff2e30406e4
Suites extending SharedSparkSession
are sharing resources (eg. SparkSession
) in their tests. That trait initializes the spark session in its beforeAll()
implementation before the automatic thread snapshot is performed, so the audit code could fail to report threads leaked by that shared session. The behavior is overridden here to take the snapshot before the spark session is initialized. Extends SQLTestUtils
and SharedSparkSessionBase
.
Helper trait for SQL test suites where all tests share a single TestSparkSession.
A special SparkSession
prepared for testing.
Helper trait that should be extended by all SQL test suites within the Spark code base. This allows subclasses to plugin a custom SQLContext
. It comes with test data prepared in advance as well as all implicit conversions used extensively by dataframes. To use implicit methods, import testImplicits._
instead of through the SQLContext
. Subclasses should not create SQLContext
s in the test suite constructor, which is prone to leaving multiple overlapping SparkContext
s in the same JVM. Extends SparkFunSuite
, SQLTestUtilsBase
and PlanTestBase
.
Helper trait that can be extended by all external SQL test suites. This allows subclasses to plugin a custom SQLContext
. To use implicit methods, import testImplicits._
instead of through the SQLContext
. Subclasses should not create SQLContext
s in the test suite constructor, which is prone to leaving multiple overlapping SparkContext
s in the same JVM. Extends SQLTestData
and PlanTestBase
. Contains:
https://gist.github.com/f4fd95ecb18fce528670600dad188226
A helper object for importing SQL implicits. Note that the alternative of importing spark.implicits._
is not possible here. This is because we create the SQLContext
immediately before the first test is run, but the implicits import is needed in the constructor.
A collection of sample data used in SQL tests.
Great framework to checking results inside SQL package. Contains big amount of DataFrame
and Dataset
assertions and checks.
https://gist.github.com/1d6cbf9dd7aed3a9a0dcea8ef79448e8
Manages a local spark
SparkSession
variable, correctly stopping it after each test.
https://gist.github.com/ec410c40ac41d82bd448ad13024c9db7
The base for statistics test cases that we want to include in both the hive module (for verifying behavior when using the Hive external catalog) as well as in the sql/core module.
Object with useful methods for columnar based cases. Example https://gist.github.com/a801afcbc73fc3bdfd20749c4a7cd4f4
A helper trait that provides convenient facilities for file-based data source testing. Specifically, it is used for Parquet and Orc testing. It can be used to write tests that are shared between Parquet and Orc.
Uses for testing with data in Orc file format.
A helper trait that provides convenient facilities for Parquet testing.
NOTE: Considering classes Tuple1
... Tuple22
all extend Product
, it would be more convenient to use tuples rather than special case classes when writing test cases/suites. Especially, Tuple1.apply
can be used to easily wrap a single type/value.
Helper class for testing Parquet compatibility.
Could check how Spark executes SQL queries.
Base class for writing tests for individual physical operators. For an example of how this class's test helper methods can be used, see SortSuite
. Extends SparkFunSuite
. Companion object contains helper methods for writing tests of individual physical operators.
Checks if generated queue has appropriated size either JIT optimization might not work.
This object targets to integrate various UDF test cases so that Scalar UDF, Python UDF and Scalar Pandas UDFs can be tested in SBT & Maven tests. The available UDFs are special. It defines an UDF wrapped by cast. So, the input column is casted into string, UDF returns strings as are, and then output column is casted back to the input column. In this way, UDF is virtually no-op. Note that, due to this implementation limitation, complex types such as map, array and struct types do not work with this UDFs because they cannot be same after the cast roundtrip. To register Scala UDF in SQL: https://gist.github.com/c62bbcec2da1517ec6bce313d8d3c51d
To register Python UDF in SQL: https://gist.github.com/895f7d1ebe75c0b64c3dcdc46299443c
To register Scalar Pandas UDF in SQL: https://gist.github.com/8bfd882d405dde89ca367abf81732fbc
To use it in Scala API and SQL: https://gist.github.com/9195efdee2daede5009d2c889109b0b7
A framework for implementing tests for streaming queries and sources. A test consists of a set of steps (expressed as a StreamAction
) that are executed in order, blocking as necessary to let the stream catch up. For example, the following adds some data to a stream, blocking until it can verify that the correct values are eventually produced.
https://gist.github.com/7f8216daf83c33f3bcee7894c51f29ec
Note that while we do sleep to allow the other thread to progress without spinning, StreamAction
checks should not depend on the amount of time spent sleeping. Instead they should check the actual progress of the stream before verifying the required test condition. Currently it is assumed that all streaming queries will eventually complete in 10 seconds to avoid hanging forever in the case of failures. However, individual suites can change this by overriding streamingTimeout
. Extends QueryTest
with SharedSparkSession
.
Extends StreamTest
. In addition stores states.
Used for streaming tests that allows checking whether the stream is waiting on the clock at expected times.
Extends SparkFunSuite
and PlanTestBase
.
There is no other code, just mixin two traits. When you don't need to use SparkFunSuite
just PlanTestBase
could be used.
Base class for plan tests. Extends PredicateHelper
with SQLHelper
.
https://gist.github.com/66706216608457fc7ecda05332086e2c
Extension of PlanTest
with some useful methods. Also it creates two plan analyzers. Case sensitive and Insensitive
A few helper functions for expression evaluation testing. Mixin this trait to use them. Used basically for Catalyst development.
Base class for Spark Hive unit tests. Extends SparkFunSuite
.
A locally running test instance of Spark's Hive execution engine. Data from testTables
will be automatically loaded whenever a query is run over those tables. Calling reset
will delete all tables and other state in the database, leaving the database in a "clean" state. TestHive
is singleton object version of this class because instantiating multiple copies of the hive metastore seems to lead to weird non-deterministic failures. Therefore, the execution of test cases that rely on TestHive
must be serialized. Extends SQLContext
.
Builder that makes HiveClient
for test purposes.
https://gist.github.com/363e450309d982d621ad2dcbaaff8819
Allows the creations of tests that execute the same query against both hive and catalyst, comparing the results. The "golden" results from Hive are cached in and retrieved both from the classpath and answerCache
to speed up testing. See the documentation of public val
s in this class for information on how test execution can be configured using system properties. Extends SparkFunSuite
.
A framework for running the query tests that are listed as a set of text files. Test Suites that derive from this class must provide a map of testCaseName
to testCaseFiles
that should be included. Additionally, there is support for whitelisting and blacklisting tests as development progresses. Extends HiveComparisonTest
.