Skip to content

Instantly share code, notes, and snippets.

@thealmightygrant
Last active April 17, 2018 03:28
Show Gist options
  • Save thealmightygrant/c3324ae4b8868511b6693ac99f16ed29 to your computer and use it in GitHub Desktop.
Save thealmightygrant/c3324ae4b8868511b6693ac99f16ed29 to your computer and use it in GitHub Desktop.
WTF Kafka Connect?!
<section>
<h1>WTF Kafka Connect?!</h1>
<h2>Grant Sherrick</h2>
</section>
<section id="where-we-began">
<h2>We started with trying to get our data from Kafka to S3.</h2>
<h2 class="fragment">We ran into a few issues...</h2>
<ul>
<li class="fragment">the Dockerfile</li>
<li class="fragment">Data Flushing</li>
<li class="fragment">Persistence</li>
<li class="fragment">Partitioning</li>
<li class="fragment">Heap Size</li>
</ul>
</section>
<section id="dockerfile">
<h2>When debugging a new application, it's helpful to know how it's running.</h2>
<a href="https://hub.docker.com/r/confluentinc/cp-kafka-connect/">FROM: confluentinc/cp-kafka-connect</a><br>
<a class="fragment" href="https://docs.confluent.io/3.2.1/installation/docker/docs/development.html">How to put on your Kafka Dockers one leg at a time!</a>
</section>
<section id="flushing">
<h2>Connectors output <code class="hljs cpp" style="display: initial;">flush.size</code> records per file...</h2>
<br>
<h3 class="fragment">What happens when <code class="hljs cpp" style="display: initial;">flush.size - 1</code> records exist on a topic?</h3>
<br>
<h3 class="fragment">What happens when <code class="hljs cpp" style="display: initial;">flush.size - 1</code> records exist on each partition?</h3>
</section>
<section id="rotate-schedule">
<h2>Let's have this output to S3 more regularly...</h2>
<ul>
<li><a href="https://docs.confluent.io/3.2.1/connect/connect-storage-cloud/kafka-connect-s3/docs/configuration_options.html">To the docs!</a></li>
<li class="fragment"><a href="https://github.com/confluentinc/kafka-connect-storage-cloud/issues/27">WTF?</a></li>
<li class="fragment"><a href="https://docs.confluent.io/3.2.1/connect/connect-hdfs/docs/configuration_options.html#connector">To the other docs!</a></li>
</ul>
</section>
<section id="persistence">
<h2>We didn't quite understand connector persistence when we started working with Kafka Connect</h2>
<ul class="fragment">
<li>Connectors are persistent across Kafka Connect restarts.</li>
<li>Connector offsets are persistent across Kafka Connect restarts.</li>
<li>Connectors can be deleted or updated, connector names are unique.</li>
<li>Connectors can be paused, this will also be persistent.</li>
</ul>
</section>
<section id="partitioning1">
<h2>Hive style partitioning is the default storage for the S3 Connector.</h2>
<pre><code>/topics/health-metrics/year=2018/month=02/day=27</pre></code>
<p>Partitions should be used to reduce the amount of folders that need to be traversed by commonly used queries.</p>
</section>
<section id="partitioning2">
<h2>Custom Partitioning can be performed on any field or Kafka related data</h2>
<h4>This requires extending the DefaultPartitioner</h4>
</section>
<section id="heap-size">
<h2>We've had a few issues with overruning heap size.</h2>
<h4 class="fragment">This is dependent on a great number of things:</h4>
<ul class="fragment">
<li>Num Topic Partitions/Tasks</li>
<li>Message Size</li>
<li>Other Connectors</li>
<li>s3.part.size (records cannot be split!)</li>
</ul>
<pre class="fragment"><code class="hljs nohighlight">Heap max size >= s3.part.size * num active partitions + (at least) 100 MB</code></pre>
</section>
<section id="conclusion">
<h2>Thanks!</h2>
</section>
<section id="useful-links">
<h2>Some Useful Links:</h2>
<ul>
<li><a href="https://speakerdeck.com/rmoff/real-time-data-integration-at-scale-with-kafka-connect-dublin-apache-kafka-meetup-04-jul-2017">Some in depth Kafka Connect slides</a></li>
<li><a href="http://kafka.apache.org/documentation.html#connect_transforms"><b>Solid</b> docs on Kafka Connect Transforms</a></li>
<li><a href="https://www.confluent.io/blog/simplest-useful-kafka-connect-data-pipeline-world-thereabouts-part-3/">An Example of Using Transforms IRL</a></li>
<li><a href="https://stackoverflow.com/questions/44014975/kafka-consumer-api-vs-stream-api">Streams vs Consumers vs Producers</a></li>
<li><a href="https://www.confluent.io/wp-content/uploads/confluent-kafka-definitive-guide-complete.pdf">The definitive guide to Kafka (a whole chapter on Kafka Connect!)</a></li>
<li><a href="https://cwiki.apache.org/confluence/display/Hive/Design#Design-HiveDataModel">On Kafka Connect (a.k.a. Hive) Partitions</a></li>
</ul>
</section>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment