Skip to content

Instantly share code, notes, and snippets.

@alexeykudinkin
Created August 18, 2022 02:36
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save alexeykudinkin/e5ab4e4605aad2fff40a8704cd1975cf to your computer and use it in GitHub Desktop.
Save alexeykudinkin/e5ab4e4605aad2fff40a8704cd1975cf to your computer and use it in GitHub Desktop.
Ingesting from Apache Pulsar to Apache Hudi
################################################################################
# Step 1: Start Pulsar in standalone mode (using Docker)
################################################################################
docker run -it -p 6650:6650 -p 8080:8080 --mount source=pulsardata,target=/pulsar/data --mount source=pulsarconf,target=/pulsar/conf apachepulsar/pulsar:2.10.1 bin/pulsar standalone
################################################################################
# Step 4: Ingest using Hudi's DeltaStreamer utility
################################################################################
export TOPIC_NAME=stonks_prod
./bin/spark-submit \
--master 'local[2]' \
--deploy-mode client \
--packages io.streamnative.connectors:pulsar-spark-connector_2.12:3.1.1.4 \
--class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer ~/code/github/apache/hudi/packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.12-0.13.0-SNAPSHOT.jar \
--table-type COPY_ON_WRITE \
--source-class org.apache.hudi.utilities.sources.PulsarSource \
--source-ordering-field ts \
--target-base-path file:///tmp/pulsar/$TOPIC_NAME \
--target-table $TOPIC_NAME \
--hoodie-conf hoodie.datasource.write.recordkey.field=key \
--hoodie-conf hoodie.datasource.write.partitionpath.field=date \
--hoodie-conf hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.SimpleKeyGenerator \
--hoodie-conf hoodie.deltastreamer.source.pulsar.topic=$TOPIC_NAME \
--hoodie-conf hoodie.deltastreamer.source.pulsar.offset.autoResetStrategy=EARLIEST \
--hoodie-conf hoodie.deltastreamer.source.pulsar.endpoint.service.url=pulsar://localhost:6650 \
--hoodie-conf hoodie.deltastreamer.source.pulsar.endpoint.admin.url=http://localhost:8080
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment