Last active
May 17, 2023 15:53
-
-
Save ad1happy2go/b51bf6d131a838da1c40b69f9afae1fb to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
hadoop fs -cp s3://rxusandbox-us-west-2/testcases/stocks/data/schema.avsc /tmp/ | |
hadoop fs -cp s3://rxusandbox-us-west-2/testcases/stocks/data/source /tmp/source_parquet | |
NOW=$(date '+%Y%m%dt%H%M%S') | |
bin/spark-submit --master yarn --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \ | |
~/hudi-utilities-bundle_2.12-0.13.0.jar --target-base-path /tmp/deltastreamertest/stocks${NOW} \ | |
--target-table stocks${NOW} --table-type COPY_ON_WRITE --base-file-format PARQUET \ | |
--source-class org.apache.hudi.utilities.sources.JsonDFSSource \ | |
--source-ordering-field ts --payload-class org.apache.hudi.common.model.DefaultHoodieRecordPayload \ | |
--schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider \ | |
--hoodie-conf hoodie.deltastreamer.schemaprovider.source.schema.file=/tmp/schema.avsc \ | |
--hoodie-conf hoodie.deltastreamer.schemaprovider.target.schema.file=/tmp/schema.avsc \ | |
--op UPSERT --enable-sync --spark-master yarn \ | |
--hoodie-conf hoodie.deltastreamer.source.dfs.root=/tmp/source_parquet \ | |
--hoodie-conf hoodie.datasource.write.recordkey.field=symbol \ | |
--hoodie-conf hoodie.datasource.write.partitionpath.field=date --hoodie-conf hoodie.datasource.write.precombine.field=ts \ | |
--hoodie-conf hoodie.datasource.write.keygenerator.type=SIMPLE --hoodie-conf hoodie.datasource.write.hive_style_partitioning=false \ | |
--hoodie-conf hoodie.metadata.enable=true \ | |
--hoodie-conf hoodie.datasource.hive_sync.mode=hms \ | |
--hoodie-conf hoodie.datasource.hive_sync.skip_ro_suffix=true \ | |
--hoodie-conf hoodie.datasource.hive_sync.ignore_exceptions=false \ | |
--hoodie-conf hoodie.datasource.hive_sync.auto_create_database=true \ | |
--hoodie-conf hoodie.datasource.hive_sync.database=default \ | |
--hoodie-conf hoodie.datasource.hive_sync.partition_fields=date \ | |
--hoodie-conf hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor \ | |
--hoodie-conf hoodie.datasource.hive_sync.sync_as_datasource=true --hoodie-conf hoodie.datasource.hive_sync.sync_comment=true | |
0: jdbc:hive2://localhost:10000> show tables; | |
INFO : Compiling command(queryId=hive_20230517100039_99ce3157-4da7-45f7-a53e-05f3c2236d7b): show tables | |
INFO : Concurrency mode is disabled, not creating a lock manager | |
INFO : Semantic Analysis Completed (retrial = false) | |
INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:tab_name, type:string, comment:from deserializer)], properties:null) | |
INFO : EXPLAIN output for queryid hive_20230517100039_99ce3157-4da7-45f7-a53e-05f3c2236d7b : STAGE DEPENDENCIES: | |
Stage-0 is a root stage [DDL] | |
Stage-1 depends on stages: Stage-0 [FETCH] | |
STAGE PLANS: | |
Stage: Stage-0 | |
Show Table Operator: | |
Show Tables | |
database name: default | |
result file: file:/mnt/tmp/hive/e19c7def-27eb-493c-b1f2-4583464acbd6/hive_2023-05-17_10-00-39_071_6164531810747429971-2/-local-10000 | |
Stage: Stage-1 | |
Fetch Operator | |
limit: -1 | |
Processor Tree: | |
ListSink | |
INFO : Completed compiling command(queryId=hive_20230517100039_99ce3157-4da7-45f7-a53e-05f3c2236d7b); Time taken: 0.006 seconds | |
INFO : Concurrency mode is disabled, not creating a lock manager | |
INFO : Executing command(queryId=hive_20230517100039_99ce3157-4da7-45f7-a53e-05f3c2236d7b): show tables | |
INFO : Starting task [Stage-0:DDL] in serial mode | |
INFO : Completed executing command(queryId=hive_20230517100039_99ce3157-4da7-45f7-a53e-05f3c2236d7b); Time taken: 0.005 seconds | |
INFO : OK | |
INFO : Concurrency mode is disabled, not creating a lock manager | |
+------------------------+ | |
| tab_name | | |
+------------------------+ | |
| hudi_trips_cow_ext | | |
| stocks20230517t095743 | | |
| table1 | | |
| test_hudi_pyspark | | |
+------------------------+ | |
4 rows selected (0.029 seconds) | |
0: jdbc:hive2://localhost:10000> select * from stocks20230517t095743 limit 2; | |
INFO : Compiling command(queryId=hive_20230517100043_cd41b059-f1da-497e-969e-a612f1ce245b): select * from stocks20230517t095743 limit 2 | |
INFO : Concurrency mode is disabled, not creating a lock manager | |
INFO : Semantic Analysis Completed (retrial = false) | |
INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:stocks20230517t095743._hoodie_commit_time, type:string, comment:null), FieldSchema(name:stocks20230517t095743._hoodie_commit_seqno, type:string, comment:null), FieldSchema(name:stocks20230517t095743._hoodie_record_key, type:string, comment:null), FieldSchema(name:stocks20230517t095743._hoodie_partition_path, type:string, comment:null), FieldSchema(name:stocks20230517t095743._hoodie_file_name, type:string, comment:null), FieldSchema(name:stocks20230517t095743.volume, type:bigint, comment:null), FieldSchema(name:stocks20230517t095743.ts, type:string, comment:null), FieldSchema(name:stocks20230517t095743.symbol, type:string, comment:null), FieldSchema(name:stocks20230517t095743.year, type:int, comment:null), FieldSchema(name:stocks20230517t095743.month, type:string, comment:null), FieldSchema(name:stocks20230517t095743.high, type:double, comment:null), FieldSchema(name:stocks20230517t095743.low, type:double, comment:null), FieldSchema(name:stocks20230517t095743.key, type:string, comment:null), FieldSchema(name:stocks20230517t095743.close, type:double, comment:null), FieldSchema(name:stocks20230517t095743.open, type:double, comment:null), FieldSchema(name:stocks20230517t095743.day, type:string, comment:null), FieldSchema(name:stocks20230517t095743.date, type:string, comment:null)], properties:null) | |
INFO : EXPLAIN output for queryid hive_20230517100043_cd41b059-f1da-497e-969e-a612f1ce245b : STAGE DEPENDENCIES: | |
Stage-0 is a root stage [FETCH] | |
STAGE PLANS: | |
Stage: Stage-0 | |
Fetch Operator | |
limit: 2 | |
Partition Description: | |
Partition | |
input format: org.apache.hudi.hadoop.HoodieParquetInputFormat | |
output format: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat | |
partition values: | |
date 2018-08-31 | |
properties: | |
bucket_count 0 | |
column.name.delimiter , | |
columns _hoodie_commit_time,_hoodie_commit_seqno,_hoodie_record_key,_hoodie_partition_path,_hoodie_file_name,volume,ts,symbol,year,month,high,low,key,close,open,day | |
columns.comments | |
columns.types string:string:string:string:string:bigint:string:string:int:string:double:double:string:double:double:string | |
file.inputformat org.apache.hudi.hadoop.HoodieParquetInputFormat | |
file.outputformat org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat | |
hoodie.query.as.ro.table false | |
location hdfs://ip-172-31-23-26.us-east-2.compute.internal:8020/tmp/deltastreamertest/stocks20230517t095743/2018/08/31 | |
name default.stocks20230517t095743 | |
numFiles 1 | |
partition_columns date | |
partition_columns.types string | |
path /tmp/deltastreamertest/stocks20230517t095743 | |
serialization.ddl struct stocks20230517t095743 { string _hoodie_commit_time, string _hoodie_commit_seqno, string _hoodie_record_key, string _hoodie_partition_path, string _hoodie_file_name, i64 volume, string ts, string symbol, i32 year, string month, double high, double low, string key, double close, double open, string day} | |
serialization.format 1 | |
serialization.lib org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe | |
totalSize 440885 | |
transient_lastDdlTime 1684317558 | |
serde: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe | |
input format: org.apache.hudi.hadoop.HoodieParquetInputFormat | |
output format: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat | |
properties: | |
EXTERNAL TRUE | |
bucket_count 0 | |
column.name.delimiter , | |
columns _hoodie_commit_time,_hoodie_commit_seqno,_hoodie_record_key,_hoodie_partition_path,_hoodie_file_name,volume,ts,symbol,year,month,high,low,key,close,open,day | |
columns.comments | |
columns.types string:string:string:string:string:bigint:string:string:int:string:double:double:string:double:double:string | |
file.inputformat org.apache.hudi.hadoop.HoodieParquetInputFormat | |
file.outputformat org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat | |
hoodie.query.as.ro.table false | |
last_commit_time_sync 20230517095901508 | |
location hdfs://ip-172-31-23-26.us-east-2.compute.internal:8020/tmp/deltastreamertest/stocks20230517t095743 | |
name default.stocks20230517t095743 | |
partition_columns date | |
partition_columns.types string | |
path /tmp/deltastreamertest/stocks20230517t095743 | |
serialization.ddl struct stocks20230517t095743 { string _hoodie_commit_time, string _hoodie_commit_seqno, string _hoodie_record_key, string _hoodie_partition_path, string _hoodie_file_name, i64 volume, string ts, string symbol, i32 year, string month, double high, double low, string key, double close, double open, string day} | |
serialization.format 1 | |
serialization.lib org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe | |
spark.sql.sources.provider hudi | |
spark.sql.sources.schema.numPartCols 1 | |
spark.sql.sources.schema.numParts 1 | |
spark.sql.sources.schema.part.0 {"type":"struct","fields":[{"name":"_hoodie_commit_time","type":"string","nullable":true,"metadata":{}},{"name":"_hoodie_commit_seqno","type":"string","nullable":true,"metadata":{}},{"name":"_hoodie_record_key","type":"string","nullable":true,"metadata":{}},{"name":"_hoodie_partition_path","type":"string","nullable":true,"metadata":{}},{"name":"_hoodie_file_name","type":"string","nullable":true,"metadata":{}},{"name":"volume","type":"long","nullable":false,"metadata":{}},{"name":"ts","type":"string","nullable":false,"metadata":{}},{"name":"symbol","type":"string","nullable":false,"metadata":{}},{"name":"year","type":"integer","nullable":false,"metadata":{}},{"name":"month","type":"string","nullable":false,"metadata":{}},{"name":"high","type":"double","nullable":false,"metadata":{}},{"name":"low","type":"double","nullable":false,"metadata":{}},{"name":"key","type":"string","nullable":false,"metadata":{}},{"name":"close","type":"double","nullable":false,"metadata":{}},{"name":"open","type":"double","nullable":false,"metadata":{}},{"name":"day","type":"string","nullable":false,"metadata":{}},{"name":"date","type":"string","nullable":false,"metadata":{}}]} | |
spark.sql.sources.schema.partCol.0 date | |
transient_lastDdlTime 1684317556 | |
serde: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe | |
name: default.stocks20230517t095743 | |
name: default.stocks20230517t095743 | |
Processor Tree: | |
TableScan | |
alias: stocks20230517t095743 | |
GatherStats: false | |
Select Operator | |
expressions: _hoodie_commit_time (type: string), _hoodie_commit_seqno (type: string), _hoodie_record_key (type: string), _hoodie_partition_path (type: string), _hoodie_file_name (type: string), volume (type: bigint), ts (type: string), symbol (type: string), year (type: int), month (type: string), high (type: double), low (type: double), key (type: string), close (type: double), open (type: double), day (type: string), date (type: string) | |
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12, _col13, _col14, _col15, _col16 | |
Limit | |
Number of rows: 2 | |
ListSink | |
INFO : Completed compiling command(queryId=hive_20230517100043_cd41b059-f1da-497e-969e-a612f1ce245b); Time taken: 0.073 seconds | |
INFO : Concurrency mode is disabled, not creating a lock manager | |
INFO : Executing command(queryId=hive_20230517100043_cd41b059-f1da-497e-969e-a612f1ce245b): select * from stocks20230517t095743 limit 2 | |
INFO : Completed executing command(queryId=hive_20230517100043_cd41b059-f1da-497e-969e-a612f1ce245b); Time taken: 0.001 seconds | |
INFO : OK | |
INFO : Concurrency mode is disabled, not creating a lock manager | |
+--------------------------------------------+---------------------------------------------+-------------------------------------------+-----------------------------------------------+----------------------------------------------------+-------------------------------+---------------------------+-------------------------------+-----------------------------+------------------------------+-----------------------------+----------------------------+----------------------------+------------------------------+-----------------------------+----------------------------+-----------------------------+ | |
| stocks20230517t095743._hoodie_commit_time | stocks20230517t095743._hoodie_commit_seqno | stocks20230517t095743._hoodie_record_key | stocks20230517t095743._hoodie_partition_path | stocks20230517t095743._hoodie_file_name | stocks20230517t095743.volume | stocks20230517t095743.ts | stocks20230517t095743.symbol | stocks20230517t095743.year | stocks20230517t095743.month | stocks20230517t095743.high | stocks20230517t095743.low | stocks20230517t095743.key | stocks20230517t095743.close | stocks20230517t095743.open | stocks20230517t095743.day | stocks20230517t095743.date | | |
+--------------------------------------------+---------------------------------------------+-------------------------------------------+-----------------------------------------------+----------------------------------------------------+-------------------------------+---------------------------+-------------------------------+-----------------------------+------------------------------+-----------------------------+----------------------------+----------------------------+------------------------------+-----------------------------+----------------------------+-----------------------------+ | |
| 20230517095901508 | 20230517095901508_0_0 | SSTI | 2018/08/31 | f86c1334-32d6-438a-947f-3cfc2071f97a-0_0-22-32_20230517095901508.parquet | 4359 | 2018-08-31 10:58:00 | SSTI | 2018 | 08 | 57.07 | 56.9 | SSTI_2018-08-31 10 | 57.07 | 56.9 | 31 | 2018-08-31 | | |
| 20230517095901508 | 20230517095901508_0_1 | VICR | 2018/08/31 | f86c1334-32d6-438a-947f-3cfc2071f97a-0_0-22-32_20230517095901508.parquet | 1144 | 2018-08-31 10:58:00 | VICR | 2018 | 08 | 61.7 | 61.7 | VICR_2018-08-31 10 | 61.7 | 61.7 | 31 | 2018-08-31 | | |
+--------------------------------------------+---------------------------------------------+-------------------------------------------+-----------------------------------------------+----------------------------------------------------+-------------------------------+---------------------------+-------------------------------+-----------------------------+------------------------------+-----------------------------+----------------------------+----------------------------+------------------------------+-----------------------------+----------------------------+-----------------------------+ | |
2 rows selected (0.229 seconds) | |
0: jdbc:hive2://localhost:10000> |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment