Skip to content

Instantly share code, notes, and snippets.

@ad1happy2go
Last active May 17, 2023 15:53
Show Gist options
  • Save ad1happy2go/b51bf6d131a838da1c40b69f9afae1fb to your computer and use it in GitHub Desktop.
Save ad1happy2go/b51bf6d131a838da1c40b69f9afae1fb to your computer and use it in GitHub Desktop.
hadoop fs -cp s3://rxusandbox-us-west-2/testcases/stocks/data/schema.avsc /tmp/
hadoop fs -cp s3://rxusandbox-us-west-2/testcases/stocks/data/source /tmp/source_parquet
NOW=$(date '+%Y%m%dt%H%M%S')
bin/spark-submit --master yarn --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
~/hudi-utilities-bundle_2.12-0.13.0.jar --target-base-path /tmp/deltastreamertest/stocks${NOW} \
--target-table stocks${NOW} --table-type COPY_ON_WRITE --base-file-format PARQUET \
--source-class org.apache.hudi.utilities.sources.JsonDFSSource \
--source-ordering-field ts --payload-class org.apache.hudi.common.model.DefaultHoodieRecordPayload \
--schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider \
--hoodie-conf hoodie.deltastreamer.schemaprovider.source.schema.file=/tmp/schema.avsc \
--hoodie-conf hoodie.deltastreamer.schemaprovider.target.schema.file=/tmp/schema.avsc \
--op UPSERT --enable-sync --spark-master yarn \
--hoodie-conf hoodie.deltastreamer.source.dfs.root=/tmp/source_parquet \
--hoodie-conf hoodie.datasource.write.recordkey.field=symbol \
--hoodie-conf hoodie.datasource.write.partitionpath.field=date --hoodie-conf hoodie.datasource.write.precombine.field=ts \
--hoodie-conf hoodie.datasource.write.keygenerator.type=SIMPLE --hoodie-conf hoodie.datasource.write.hive_style_partitioning=false \
--hoodie-conf hoodie.metadata.enable=true \
--hoodie-conf hoodie.datasource.hive_sync.mode=hms \
--hoodie-conf hoodie.datasource.hive_sync.skip_ro_suffix=true \
--hoodie-conf hoodie.datasource.hive_sync.ignore_exceptions=false \
--hoodie-conf hoodie.datasource.hive_sync.auto_create_database=true \
--hoodie-conf hoodie.datasource.hive_sync.database=default \
--hoodie-conf hoodie.datasource.hive_sync.partition_fields=date \
--hoodie-conf hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor \
--hoodie-conf hoodie.datasource.hive_sync.sync_as_datasource=true --hoodie-conf hoodie.datasource.hive_sync.sync_comment=true
0: jdbc:hive2://localhost:10000> show tables;
INFO : Compiling command(queryId=hive_20230517100039_99ce3157-4da7-45f7-a53e-05f3c2236d7b): show tables
INFO : Concurrency mode is disabled, not creating a lock manager
INFO : Semantic Analysis Completed (retrial = false)
INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:tab_name, type:string, comment:from deserializer)], properties:null)
INFO : EXPLAIN output for queryid hive_20230517100039_99ce3157-4da7-45f7-a53e-05f3c2236d7b : STAGE DEPENDENCIES:
Stage-0 is a root stage [DDL]
Stage-1 depends on stages: Stage-0 [FETCH]
STAGE PLANS:
Stage: Stage-0
Show Table Operator:
Show Tables
database name: default
result file: file:/mnt/tmp/hive/e19c7def-27eb-493c-b1f2-4583464acbd6/hive_2023-05-17_10-00-39_071_6164531810747429971-2/-local-10000
Stage: Stage-1
Fetch Operator
limit: -1
Processor Tree:
ListSink
INFO : Completed compiling command(queryId=hive_20230517100039_99ce3157-4da7-45f7-a53e-05f3c2236d7b); Time taken: 0.006 seconds
INFO : Concurrency mode is disabled, not creating a lock manager
INFO : Executing command(queryId=hive_20230517100039_99ce3157-4da7-45f7-a53e-05f3c2236d7b): show tables
INFO : Starting task [Stage-0:DDL] in serial mode
INFO : Completed executing command(queryId=hive_20230517100039_99ce3157-4da7-45f7-a53e-05f3c2236d7b); Time taken: 0.005 seconds
INFO : OK
INFO : Concurrency mode is disabled, not creating a lock manager
+------------------------+
| tab_name |
+------------------------+
| hudi_trips_cow_ext |
| stocks20230517t095743 |
| table1 |
| test_hudi_pyspark |
+------------------------+
4 rows selected (0.029 seconds)
0: jdbc:hive2://localhost:10000> select * from stocks20230517t095743 limit 2;
INFO : Compiling command(queryId=hive_20230517100043_cd41b059-f1da-497e-969e-a612f1ce245b): select * from stocks20230517t095743 limit 2
INFO : Concurrency mode is disabled, not creating a lock manager
INFO : Semantic Analysis Completed (retrial = false)
INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:stocks20230517t095743._hoodie_commit_time, type:string, comment:null), FieldSchema(name:stocks20230517t095743._hoodie_commit_seqno, type:string, comment:null), FieldSchema(name:stocks20230517t095743._hoodie_record_key, type:string, comment:null), FieldSchema(name:stocks20230517t095743._hoodie_partition_path, type:string, comment:null), FieldSchema(name:stocks20230517t095743._hoodie_file_name, type:string, comment:null), FieldSchema(name:stocks20230517t095743.volume, type:bigint, comment:null), FieldSchema(name:stocks20230517t095743.ts, type:string, comment:null), FieldSchema(name:stocks20230517t095743.symbol, type:string, comment:null), FieldSchema(name:stocks20230517t095743.year, type:int, comment:null), FieldSchema(name:stocks20230517t095743.month, type:string, comment:null), FieldSchema(name:stocks20230517t095743.high, type:double, comment:null), FieldSchema(name:stocks20230517t095743.low, type:double, comment:null), FieldSchema(name:stocks20230517t095743.key, type:string, comment:null), FieldSchema(name:stocks20230517t095743.close, type:double, comment:null), FieldSchema(name:stocks20230517t095743.open, type:double, comment:null), FieldSchema(name:stocks20230517t095743.day, type:string, comment:null), FieldSchema(name:stocks20230517t095743.date, type:string, comment:null)], properties:null)
INFO : EXPLAIN output for queryid hive_20230517100043_cd41b059-f1da-497e-969e-a612f1ce245b : STAGE DEPENDENCIES:
Stage-0 is a root stage [FETCH]
STAGE PLANS:
Stage: Stage-0
Fetch Operator
limit: 2
Partition Description:
Partition
input format: org.apache.hudi.hadoop.HoodieParquetInputFormat
output format: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
partition values:
date 2018-08-31
properties:
bucket_count 0
column.name.delimiter ,
columns _hoodie_commit_time,_hoodie_commit_seqno,_hoodie_record_key,_hoodie_partition_path,_hoodie_file_name,volume,ts,symbol,year,month,high,low,key,close,open,day
columns.comments
columns.types string:string:string:string:string:bigint:string:string:int:string:double:double:string:double:double:string
file.inputformat org.apache.hudi.hadoop.HoodieParquetInputFormat
file.outputformat org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
hoodie.query.as.ro.table false
location hdfs://ip-172-31-23-26.us-east-2.compute.internal:8020/tmp/deltastreamertest/stocks20230517t095743/2018/08/31
name default.stocks20230517t095743
numFiles 1
partition_columns date
partition_columns.types string
path /tmp/deltastreamertest/stocks20230517t095743
serialization.ddl struct stocks20230517t095743 { string _hoodie_commit_time, string _hoodie_commit_seqno, string _hoodie_record_key, string _hoodie_partition_path, string _hoodie_file_name, i64 volume, string ts, string symbol, i32 year, string month, double high, double low, string key, double close, double open, string day}
serialization.format 1
serialization.lib org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
totalSize 440885
transient_lastDdlTime 1684317558
serde: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
input format: org.apache.hudi.hadoop.HoodieParquetInputFormat
output format: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
properties:
EXTERNAL TRUE
bucket_count 0
column.name.delimiter ,
columns _hoodie_commit_time,_hoodie_commit_seqno,_hoodie_record_key,_hoodie_partition_path,_hoodie_file_name,volume,ts,symbol,year,month,high,low,key,close,open,day
columns.comments
columns.types string:string:string:string:string:bigint:string:string:int:string:double:double:string:double:double:string
file.inputformat org.apache.hudi.hadoop.HoodieParquetInputFormat
file.outputformat org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
hoodie.query.as.ro.table false
last_commit_time_sync 20230517095901508
location hdfs://ip-172-31-23-26.us-east-2.compute.internal:8020/tmp/deltastreamertest/stocks20230517t095743
name default.stocks20230517t095743
partition_columns date
partition_columns.types string
path /tmp/deltastreamertest/stocks20230517t095743
serialization.ddl struct stocks20230517t095743 { string _hoodie_commit_time, string _hoodie_commit_seqno, string _hoodie_record_key, string _hoodie_partition_path, string _hoodie_file_name, i64 volume, string ts, string symbol, i32 year, string month, double high, double low, string key, double close, double open, string day}
serialization.format 1
serialization.lib org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
spark.sql.sources.provider hudi
spark.sql.sources.schema.numPartCols 1
spark.sql.sources.schema.numParts 1
spark.sql.sources.schema.part.0 {"type":"struct","fields":[{"name":"_hoodie_commit_time","type":"string","nullable":true,"metadata":{}},{"name":"_hoodie_commit_seqno","type":"string","nullable":true,"metadata":{}},{"name":"_hoodie_record_key","type":"string","nullable":true,"metadata":{}},{"name":"_hoodie_partition_path","type":"string","nullable":true,"metadata":{}},{"name":"_hoodie_file_name","type":"string","nullable":true,"metadata":{}},{"name":"volume","type":"long","nullable":false,"metadata":{}},{"name":"ts","type":"string","nullable":false,"metadata":{}},{"name":"symbol","type":"string","nullable":false,"metadata":{}},{"name":"year","type":"integer","nullable":false,"metadata":{}},{"name":"month","type":"string","nullable":false,"metadata":{}},{"name":"high","type":"double","nullable":false,"metadata":{}},{"name":"low","type":"double","nullable":false,"metadata":{}},{"name":"key","type":"string","nullable":false,"metadata":{}},{"name":"close","type":"double","nullable":false,"metadata":{}},{"name":"open","type":"double","nullable":false,"metadata":{}},{"name":"day","type":"string","nullable":false,"metadata":{}},{"name":"date","type":"string","nullable":false,"metadata":{}}]}
spark.sql.sources.schema.partCol.0 date
transient_lastDdlTime 1684317556
serde: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
name: default.stocks20230517t095743
name: default.stocks20230517t095743
Processor Tree:
TableScan
alias: stocks20230517t095743
GatherStats: false
Select Operator
expressions: _hoodie_commit_time (type: string), _hoodie_commit_seqno (type: string), _hoodie_record_key (type: string), _hoodie_partition_path (type: string), _hoodie_file_name (type: string), volume (type: bigint), ts (type: string), symbol (type: string), year (type: int), month (type: string), high (type: double), low (type: double), key (type: string), close (type: double), open (type: double), day (type: string), date (type: string)
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12, _col13, _col14, _col15, _col16
Limit
Number of rows: 2
ListSink
INFO : Completed compiling command(queryId=hive_20230517100043_cd41b059-f1da-497e-969e-a612f1ce245b); Time taken: 0.073 seconds
INFO : Concurrency mode is disabled, not creating a lock manager
INFO : Executing command(queryId=hive_20230517100043_cd41b059-f1da-497e-969e-a612f1ce245b): select * from stocks20230517t095743 limit 2
INFO : Completed executing command(queryId=hive_20230517100043_cd41b059-f1da-497e-969e-a612f1ce245b); Time taken: 0.001 seconds
INFO : OK
INFO : Concurrency mode is disabled, not creating a lock manager
+--------------------------------------------+---------------------------------------------+-------------------------------------------+-----------------------------------------------+----------------------------------------------------+-------------------------------+---------------------------+-------------------------------+-----------------------------+------------------------------+-----------------------------+----------------------------+----------------------------+------------------------------+-----------------------------+----------------------------+-----------------------------+
| stocks20230517t095743._hoodie_commit_time | stocks20230517t095743._hoodie_commit_seqno | stocks20230517t095743._hoodie_record_key | stocks20230517t095743._hoodie_partition_path | stocks20230517t095743._hoodie_file_name | stocks20230517t095743.volume | stocks20230517t095743.ts | stocks20230517t095743.symbol | stocks20230517t095743.year | stocks20230517t095743.month | stocks20230517t095743.high | stocks20230517t095743.low | stocks20230517t095743.key | stocks20230517t095743.close | stocks20230517t095743.open | stocks20230517t095743.day | stocks20230517t095743.date |
+--------------------------------------------+---------------------------------------------+-------------------------------------------+-----------------------------------------------+----------------------------------------------------+-------------------------------+---------------------------+-------------------------------+-----------------------------+------------------------------+-----------------------------+----------------------------+----------------------------+------------------------------+-----------------------------+----------------------------+-----------------------------+
| 20230517095901508 | 20230517095901508_0_0 | SSTI | 2018/08/31 | f86c1334-32d6-438a-947f-3cfc2071f97a-0_0-22-32_20230517095901508.parquet | 4359 | 2018-08-31 10:58:00 | SSTI | 2018 | 08 | 57.07 | 56.9 | SSTI_2018-08-31 10 | 57.07 | 56.9 | 31 | 2018-08-31 |
| 20230517095901508 | 20230517095901508_0_1 | VICR | 2018/08/31 | f86c1334-32d6-438a-947f-3cfc2071f97a-0_0-22-32_20230517095901508.parquet | 1144 | 2018-08-31 10:58:00 | VICR | 2018 | 08 | 61.7 | 61.7 | VICR_2018-08-31 10 | 61.7 | 61.7 | 31 | 2018-08-31 |
+--------------------------------------------+---------------------------------------------+-------------------------------------------+-----------------------------------------------+----------------------------------------------------+-------------------------------+---------------------------+-------------------------------+-----------------------------+------------------------------+-----------------------------+----------------------------+----------------------------+------------------------------+-----------------------------+----------------------------+-----------------------------+
2 rows selected (0.229 seconds)
0: jdbc:hive2://localhost:10000>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment