Skip to content

Instantly share code, notes, and snippets.

@Fokko
Created April 17, 2019 06:42
Show Gist options
  • Save Fokko/4a890cd868bcafa4bf7016d426e75d93 to your computer and use it in GitHub Desktop.
Save Fokko/4a890cd868bcafa4bf7016d426e75d93 to your computer and use it in GitHub Desktop.
C02VF05JHV2T:flink-gcs-fs fdriesprong$ spark-shell
2019-04-17 08:38:30 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://c02vf05jhv2t.localdomain:4040
Spark context available as 'sc' (master = local[*], app id = local-1555483115980).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.4.0
/_/
Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_192)
Type in expressions to have them evaluated.
Type :help for more information.
scala> import java.sql.Date
import java.sql.Date
scala> case class Record(dates: Date)
defined class Record
scala> val record = Record(new Date(2019, 4, 17))
warning: there was one deprecation warning; re-run with -deprecation for details
record: Record = Record(3919-05-17)
scala> val data = Array(record, record, record, record, record)
data: Array[Record] = Array(Record(3919-05-17), Record(3919-05-17), Record(3919-05-17), Record(3919-05-17), Record(3919-05-17))
scala> val distData = sc.parallelize(data)
distData: org.apache.spark.rdd.RDD[Record] = ParallelCollectionRDD[0] at parallelize at <console>:27
scala> val df = distData.toDF()
df: org.apache.spark.sql.DataFrame = [dates: date]
scala> df.show()
+----------+
| dates|
+----------+
|3919-05-17|
|3919-05-17|
|3919-05-17|
|3919-05-17|
|3919-05-17|
+----------+
scala> df.printSchema
root
|-- dates: date (nullable = true)
scala> df.coalesce(1).write.parquet("/tmp/dates/")
When we open a new Spark shell, and read the data again:
C02VF05JHV2T:flink-gcs-fs fdriesprong$ spark-shell
2019-04-17 08:36:09 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://c02vf05jhv2t.localdomain:4040
Spark context available as 'sc' (master = local[*], app id = local-1555482975355).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.4.0
/_/
Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_192)
Type in expressions to have them evaluated.
Type :help for more information.
scala> spark.read.parquet("/tmp/dates/").show()
+----------+
| dates|
+----------+
|3870-05-26|
|3870-05-26|
|3870-05-26|
|3870-05-26|
|3870-05-26|
+----------+
scala> spark.read.parquet("/tmp/dates/").printSchema()
root
|-- dates: date (nullable = true)
When we look at the Parquet file:
C02VF05JHV2T:flink-gcs-fs fdriesprong$ parquet-tools schema /tmp/dates/part-00000-619f59f1-0f50-4a00-a119-d2767a32107d-c000.snappy.parquet
message spark_schema {
optional int32 dates (DATE);
}
C02VF05JHV2T:flink-gcs-fs fdriesprong$ parquet-tools head /tmp/dates/part-00000-619f59f1-0f50-4a00-a119-d2767a32107d-c000.snappy.parquet
dates = 711993
dates = 711993
dates = 711993
dates = 711993
dates = 711993
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment