Fokko/Spark Parquet read date

## Spark Parquet read date
C02VF05JHV2T:flink-gcs-fs fdriesprong$ spark-shell
2019-04-17 08:38:30 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://c02vf05jhv2t.localdomain:4040
Spark context available as 'sc' (master = local[*], app id = local-1555483115980).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.0
      /_/

Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_192)
Type in expressions to have them evaluated.
Type :help for more information.

scala> import java.sql.Date
import java.sql.Date

scala> case class Record(dates: Date)
defined class Record

scala> val record = Record(new Date(2019, 4, 17))
warning: there was one deprecation warning; re-run with -deprecation for details
record: Record = Record(3919-05-17)

scala> val data = Array(record, record, record, record, record)
data: Array[Record] = Array(Record(3919-05-17), Record(3919-05-17), Record(3919-05-17), Record(3919-05-17), Record(3919-05-17))

scala> val distData = sc.parallelize(data)
distData: org.apache.spark.rdd.RDD[Record] = ParallelCollectionRDD[0] at parallelize at <console>:27

scala> val df = distData.toDF()
df: org.apache.spark.sql.DataFrame = [dates: date]

scala> df.show()
+----------+
|     dates|
+----------+
|3919-05-17|
|3919-05-17|
|3919-05-17|
|3919-05-17|
|3919-05-17|
+----------+

scala> df.printSchema
root
 |-- dates: date (nullable = true)

scala> df.coalesce(1).write.parquet("/tmp/dates/")


When we open a new Spark shell, and read the data again:

C02VF05JHV2T:flink-gcs-fs fdriesprong$ spark-shell
2019-04-17 08:36:09 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://c02vf05jhv2t.localdomain:4040
Spark context available as 'sc' (master = local[*], app id = local-1555482975355).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.0
      /_/

Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_192)
Type in expressions to have them evaluated.
Type :help for more information.

scala> spark.read.parquet("/tmp/dates/").show()
+----------+
|     dates|
+----------+
|3870-05-26|
|3870-05-26|
|3870-05-26|
|3870-05-26|
|3870-05-26|
+----------+


scala> spark.read.parquet("/tmp/dates/").printSchema()
root
 |-- dates: date (nullable = true)

When we look at the Parquet file:

C02VF05JHV2T:flink-gcs-fs fdriesprong$ parquet-tools schema /tmp/dates/part-00000-619f59f1-0f50-4a00-a119-d2767a32107d-c000.snappy.parquet
message spark_schema {
  optional int32 dates (DATE);
}

C02VF05JHV2T:flink-gcs-fs fdriesprong$ parquet-tools head /tmp/dates/part-00000-619f59f1-0f50-4a00-a119-d2767a32107d-c000.snappy.parquet
dates = 711993

dates = 711993

dates = 711993

dates = 711993

dates = 711993
	C02VF05JHV2T:flink-gcs-fs fdriesprong$ spark-shell
	2019-04-17 08:38:30 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
	Setting default log level to "WARN".
	To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
	Spark context Web UI available at http://c02vf05jhv2t.localdomain:4040
	Spark context available as 'sc' (master = local[*], app id = local-1555483115980).
	Spark session available as 'spark'.
	Welcome to
	____ __
	/ __/__ ___ _____/ /__
	_\ \/ _ \/ _ `/ __/ '_/
	/___/ .__/\_,_/_/ /_/\_\ version 2.4.0
	/_/

	Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_192)
	Type in expressions to have them evaluated.
	Type :help for more information.

	scala> import java.sql.Date
	import java.sql.Date

	scala> case class Record(dates: Date)
	defined class Record

	scala> val record = Record(new Date(2019, 4, 17))
	warning: there was one deprecation warning; re-run with -deprecation for details
	record: Record = Record(3919-05-17)

	scala> val data = Array(record, record, record, record, record)
	data: Array[Record] = Array(Record(3919-05-17), Record(3919-05-17), Record(3919-05-17), Record(3919-05-17), Record(3919-05-17))

	scala> val distData = sc.parallelize(data)
	distData: org.apache.spark.rdd.RDD[Record] = ParallelCollectionRDD[0] at parallelize at <console>:27

	scala> val df = distData.toDF()
	df: org.apache.spark.sql.DataFrame = [dates: date]

	scala> df.show()
	+----------+
	\| dates\|
	+----------+
	\|3919-05-17\|
	\|3919-05-17\|
	\|3919-05-17\|
	\|3919-05-17\|
	\|3919-05-17\|
	+----------+

	scala> df.printSchema
	root
	\|-- dates: date (nullable = true)

	scala> df.coalesce(1).write.parquet("/tmp/dates/")


	When we open a new Spark shell, and read the data again:

	C02VF05JHV2T:flink-gcs-fs fdriesprong$ spark-shell
	2019-04-17 08:36:09 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
	Setting default log level to "WARN".
	To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
	Spark context Web UI available at http://c02vf05jhv2t.localdomain:4040
	Spark context available as 'sc' (master = local[*], app id = local-1555482975355).
	Spark session available as 'spark'.
	Welcome to
	____ __
	/ __/__ ___ _____/ /__
	_\ \/ _ \/ _ `/ __/ '_/
	/___/ .__/\_,_/_/ /_/\_\ version 2.4.0
	/_/

	Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_192)
	Type in expressions to have them evaluated.
	Type :help for more information.

	scala> spark.read.parquet("/tmp/dates/").show()
	+----------+
	\| dates\|
	+----------+
	\|3870-05-26\|
	\|3870-05-26\|
	\|3870-05-26\|
	\|3870-05-26\|
	\|3870-05-26\|
	+----------+


	scala> spark.read.parquet("/tmp/dates/").printSchema()
	root
	\|-- dates: date (nullable = true)

	When we look at the Parquet file:

	C02VF05JHV2T:flink-gcs-fs fdriesprong$ parquet-tools schema /tmp/dates/part-00000-619f59f1-0f50-4a00-a119-d2767a32107d-c000.snappy.parquet
	message spark_schema {
	optional int32 dates (DATE);
	}

	C02VF05JHV2T:flink-gcs-fs fdriesprong$ parquet-tools head /tmp/dates/part-00000-619f59f1-0f50-4a00-a119-d2767a32107d-c000.snappy.parquet
	dates = 711993

	dates = 711993

	dates = 711993

	dates = 711993

	dates = 711993