SingularBunny/blog.md Secret

## blog.md

      
    Raw
  

              blog.md
            
          
    Apache Spark Unit Testing Part 2 - Spark SQL

Second part of article series about how to use Spark repository classes for Unit Testing.
Spark SQL package has four sub projects each of which has its own test classes:

sql-core
sql-catalyst
sql-hive-thriftserver
sql-hive

In context of testing own Spark jobs we will just discuss only three of them (core, catalyst, hive).
Dependencies

https://gist.github.com/fc7f476de6a531819dcacff2e30406e4
1.1 Spark SQL Execution Unit Testing

SharedSparkSession

Suites extending SharedSparkSession are sharing resources (eg. SparkSession) in their tests. That trait initializes the spark session in its beforeAll() implementation before the automatic thread snapshot is performed, so the audit code could fail to report threads leaked by that shared session. The behavior is overridden here to take the snapshot before the spark session is initialized. Extends SQLTestUtils and SharedSparkSessionBase.
SharedSparkSessionBase

Helper trait for SQL test suites where all tests share a single TestSparkSession.
TestSparkSession

A special SparkSession prepared for testing.
SQLTestUtils

Helper trait that should be extended by all SQL test suites within the Spark code base. This allows subclasses to plugin a custom SQLContext. It comes with test data prepared in advance as well as all implicit conversions used extensively by dataframes. To use implicit methods, import testImplicits._ instead of through the SQLContext. Subclasses should not create SQLContexts in the test suite constructor, which is prone to leaving multiple overlapping SparkContexts in the same JVM. Extends SparkFunSuite, SQLTestUtilsBase and PlanTestBase.
SQLTestUtilsBase

Helper trait that can be extended by all external SQL test suites. This allows subclasses to plugin a custom SQLContext. To use implicit methods, import testImplicits._ instead of through the SQLContext. Subclasses should not create SQLContexts in the test suite constructor, which is prone to leaving multiple overlapping SparkContexts in the same JVM. Extends SQLTestData and PlanTestBase. Contains:
https://gist.github.com/f4fd95ecb18fce528670600dad188226
A helper object for importing SQL implicits. Note that the alternative of importing spark.implicits._ is not possible here. This is because we create the SQLContext immediately before the first test is run, but the implicits import is needed in the constructor.
SQLTestData

A collection of sample data used in SQL tests.
QueryTest

Great framework to checking results inside SQL package. Contains big amount of DataFrame and Dataset  assertions and checks.
Example

https://gist.github.com/1d6cbf9dd7aed3a9a0dcea8ef79448e8
LocalSparkSession

Manages a local spark SparkSession variable, correctly stopping it after each test.
Example

https://gist.github.com/ec410c40ac41d82bd448ad13024c9db7
Metrics Testing


SQLMetricsTestUtils


InputOutputMetricsHelper


StatisticsCollectionTestBase

The base for statistics test cases that we want to include in both the hive module (for verifying behavior when using the Hive external catalog) as well as in the sql/core module.
ColumnarTestUtils

Object with useful methods for columnar based cases.
Example
https://gist.github.com/a801afcbc73fc3bdfd20749c4a7cd4f4
File Based Tests

FileBasedDataSourceTest

A helper trait that provides convenient facilities for file-based data source testing. Specifically, it is used for Parquet and Orc testing. It can be used to write tests that are shared between Parquet and Orc.
Orc Testing


OrcTest


Uses for testing with data in Orc file format.
Parquet Testing


ParquetTest


A helper trait that provides convenient facilities for Parquet testing.
NOTE: Considering classes Tuple1 ... Tuple22 all extend Product, it would be more convenient to use tuples rather than special case classes when writing test cases/suites. Especially, Tuple1.apply can be used to easily wrap a single type/value.


ParquetCompatibilityTest


Helper class for testing Parquet compatibility.
DataSourceTest

Could check how Spark executes SQL queries.
SparkPlanTest

Base class for writing tests for individual physical operators. For an example of how this class's test helper methods can be used, see SortSuite. Extends SparkFunSuite. Companion object contains helper methods for writing tests of individual physical operators.
BenchmarkQueryTest

Checks if generated queue has appropriated size either JIT optimization might not work.
IntegratedUDFTestUtils

This object targets to integrate various UDF test cases so that Scalar UDF, Python UDF and Scalar Pandas UDFs can be tested in SBT & Maven tests. The available UDFs are special. It defines an UDF wrapped by cast. So, the input column is casted into string, UDF returns strings as are, and then output column is casted back to the input column. In this way, UDF is virtually no-op. Note that, due to this implementation limitation, complex types such as map, array and struct types do not work with this UDFs because they cannot be same after the cast roundtrip. To register Scala UDF in SQL:
https://gist.github.com/c62bbcec2da1517ec6bce313d8d3c51d
To register Python UDF in SQL:
https://gist.github.com/895f7d1ebe75c0b64c3dcdc46299443c
To register Scalar Pandas UDF in SQL:
https://gist.github.com/8bfd882d405dde89ca367abf81732fbc
To use it in Scala API and SQL:
https://gist.github.com/9195efdee2daede5009d2c889109b0b7
Streaming DataFrames and streaming Datasets Testing


StreamTest


A framework for implementing tests for streaming queries and sources. A test consists of a set of steps (expressed as a StreamAction) that are executed in order, blocking as necessary to let the stream catch up.  For example, the following adds some data to a stream, blocking until it can verify that the correct values are eventually produced.
https://gist.github.com/7f8216daf83c33f3bcee7894c51f29ec
Note that while we do sleep to allow the other thread to progress without spinning, StreamAction checks should not depend on the amount of time spent sleeping.  Instead they should check the actual progress of the stream before verifying the required test condition. Currently it is assumed that all streaming queries will eventually complete in 10 seconds to avoid hanging forever in the case of failures. However, individual suites can change this by overriding streamingTimeout.  Extends QueryTest with SharedSparkSession.


StateStoreMetricsTest


Extends StreamTest. In addition stores states.


StreamManualClock


Used for streaming tests that allows checking whether the stream is waiting on the clock at expected times.
1.2 Catalyst Unit Testing

PlanTest

Extends SparkFunSuite and PlanTestBase.
There is no other code, just mixin two traits. When you don't need to use SparkFunSuite just PlanTestBase could be used.
PlanTestBase

Base class for plan tests. Extends  PredicateHelper  with  SQLHelper.
Example

https://gist.github.com/66706216608457fc7ecda05332086e2c
AnalysisTest

Extension of PlanTest with some useful methods. Also it creates two plan analyzers. Case sensitive and Insensitive
ExpressionEvalHelper

A few helper functions for expression evaluation testing. Mixin this trait to use them. Used basically for Catalyst development.
1.3 Spark Hive Unit Testing

TestHiveSingleton

Base class for Spark Hive unit tests.  Extends  SparkFunSuite.
TestHiveContext

A locally running test instance of Spark's Hive execution engine. Data from testTables will be automatically loaded whenever a query is run over those tables. Calling reset will delete all tables and other state in the database, leaving the database in a "clean" state. TestHive is singleton object version of this class because instantiating multiple copies of the hive metastore seems to lead to weird non-deterministic failures.  Therefore, the execution of test cases that rely on TestHive must be serialized. Extends SQLContext.
HiveClientBuilder

Builder that makes HiveClient for test purposes.
Example

https://gist.github.com/363e450309d982d621ad2dcbaaff8819
HiveComparisonTest

Allows the creations of tests that execute the same query against both hive and catalyst, comparing the results. The "golden" results from Hive are cached in and retrieved both from the classpath and answerCache to speed up testing. See the documentation of public vals in this class for information on how test execution can be configured using system properties. Extends SparkFunSuite.
HiveQueryFileTest

A framework for running the query tests that are listed as a set of text files. Test Suites that derive from this class must provide a map of testCaseName to testCaseFiles that should be included. Additionally, there is support for whitelisting and blacklisting tests as development progresses. Extends HiveComparisonTest.