Skip to content

Instantly share code, notes, and snippets.

@garystafford
Last active December 30, 2021 22:31
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save garystafford/6e2b6318b04cff685058a3a6d864e499 to your computer and use it in GitHub Desktop.
Save garystafford/6e2b6318b04cff685058a3a6d864e499 to your computer and use it in GitHub Desktop.
export DATA_LAKE_BUCKET="<your_data_lake_bucket_name>"
# artworks data, MoR table type, 1x bulk insert
spark-submit \
--jars /usr/lib/spark/jars/spark-avro.jar,/usr/lib/hudi/hudi-utilities-bundle.jar \
--conf spark.sql.catalogImplementation=hive \
--class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer /usr/lib/hudi/hudi-utilities-bundle.jar \
--table-type MERGE_ON_READ \
--source-ordering-field __source_ts_ms \
--props "s3://${DATA_LAKE_BUCKET}/hudi/deltastreamer_artworks_apicurio_mor.properties" \
--source-class org.apache.hudi.utilities.sources.AvroDFSSource \
--target-base-path "s3://${DATA_LAKE_BUCKET}/moma/artworks_mor/" \
--target-table moma_mor.artworks \
--schemaprovider-class org.apache.hudi.utilities.schema.SchemaRegistryProvider \
--enable-sync \
--op BULK_INSERT \
--filter-dupes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment