Created
April 17, 2019 06:42
-
-
Save Fokko/4a890cd868bcafa4bf7016d426e75d93 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
C02VF05JHV2T:flink-gcs-fs fdriesprong$ spark-shell | |
2019-04-17 08:38:30 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable | |
Setting default log level to "WARN". | |
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). | |
Spark context Web UI available at http://c02vf05jhv2t.localdomain:4040 | |
Spark context available as 'sc' (master = local[*], app id = local-1555483115980). | |
Spark session available as 'spark'. | |
Welcome to | |
____ __ | |
/ __/__ ___ _____/ /__ | |
_\ \/ _ \/ _ `/ __/ '_/ | |
/___/ .__/\_,_/_/ /_/\_\ version 2.4.0 | |
/_/ | |
Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_192) | |
Type in expressions to have them evaluated. | |
Type :help for more information. | |
scala> import java.sql.Date | |
import java.sql.Date | |
scala> case class Record(dates: Date) | |
defined class Record | |
scala> val record = Record(new Date(2019, 4, 17)) | |
warning: there was one deprecation warning; re-run with -deprecation for details | |
record: Record = Record(3919-05-17) | |
scala> val data = Array(record, record, record, record, record) | |
data: Array[Record] = Array(Record(3919-05-17), Record(3919-05-17), Record(3919-05-17), Record(3919-05-17), Record(3919-05-17)) | |
scala> val distData = sc.parallelize(data) | |
distData: org.apache.spark.rdd.RDD[Record] = ParallelCollectionRDD[0] at parallelize at <console>:27 | |
scala> val df = distData.toDF() | |
df: org.apache.spark.sql.DataFrame = [dates: date] | |
scala> df.show() | |
+----------+ | |
| dates| | |
+----------+ | |
|3919-05-17| | |
|3919-05-17| | |
|3919-05-17| | |
|3919-05-17| | |
|3919-05-17| | |
+----------+ | |
scala> df.printSchema | |
root | |
|-- dates: date (nullable = true) | |
scala> df.coalesce(1).write.parquet("/tmp/dates/") | |
When we open a new Spark shell, and read the data again: | |
C02VF05JHV2T:flink-gcs-fs fdriesprong$ spark-shell | |
2019-04-17 08:36:09 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable | |
Setting default log level to "WARN". | |
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). | |
Spark context Web UI available at http://c02vf05jhv2t.localdomain:4040 | |
Spark context available as 'sc' (master = local[*], app id = local-1555482975355). | |
Spark session available as 'spark'. | |
Welcome to | |
____ __ | |
/ __/__ ___ _____/ /__ | |
_\ \/ _ \/ _ `/ __/ '_/ | |
/___/ .__/\_,_/_/ /_/\_\ version 2.4.0 | |
/_/ | |
Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_192) | |
Type in expressions to have them evaluated. | |
Type :help for more information. | |
scala> spark.read.parquet("/tmp/dates/").show() | |
+----------+ | |
| dates| | |
+----------+ | |
|3870-05-26| | |
|3870-05-26| | |
|3870-05-26| | |
|3870-05-26| | |
|3870-05-26| | |
+----------+ | |
scala> spark.read.parquet("/tmp/dates/").printSchema() | |
root | |
|-- dates: date (nullable = true) | |
When we look at the Parquet file: | |
C02VF05JHV2T:flink-gcs-fs fdriesprong$ parquet-tools schema /tmp/dates/part-00000-619f59f1-0f50-4a00-a119-d2767a32107d-c000.snappy.parquet | |
message spark_schema { | |
optional int32 dates (DATE); | |
} | |
C02VF05JHV2T:flink-gcs-fs fdriesprong$ parquet-tools head /tmp/dates/part-00000-619f59f1-0f50-4a00-a119-d2767a32107d-c000.snappy.parquet | |
dates = 711993 | |
dates = 711993 | |
dates = 711993 | |
dates = 711993 | |
dates = 711993 | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment