Skip to content

Instantly share code, notes, and snippets.

@calonso
Last active May 16, 2017 19:11
Show Gist options
  • Save calonso/22b9325087b408b1651c19933c45c636 to your computer and use it in GitHub Desktop.
Save calonso/22b9325087b408b1651c19933c45c636 to your computer and use it in GitHub Desktop.

The old days

  • An hourly rotated log system
  • Issues:
    • EOF generation and propagation
    • Manual intervention required to continue after errors

First streaming architecture

https://labs.spotify.com/2016/02/25/spotifys-event-delivery-the-road-to-the-cloud-part-i/

  • Key requirement: To deliver complete data with a predictable latency and make it available to our developers via well-defined interface.
  • Event (structured data) as unit of streamed information.
  • syslog as the events source
  • Completeness vs Latency Dataflow
  • Hourly buckets
  • Avro format in destination (Hadoop)
  • Checkout Apache Crunch
  • Issues:
    • Highly coupled system
    • Transmitting the data across DCs weakens the system
    • ...

The second approach

https://labs.spotify.com/2016/03/03/spotifys-event-delivery-the-road-to-the-cloud-part-ii/ & https://labs.spotify.com/2016/03/10/spotifys-event-delivery-the-road-to-the-cloud-part-iii/ & Reliable export of Pub/Sub streams to CloudStorage

  • Introduce a queue for:
    • Low latency
    • Reliability
    • Persistence of undelivered messages, even in the face of errors on other systems (such as Hadoop)
  • Add structure to data early on
  • One topic per event type
  • From Kafka to CG Pub/Sub.
    • To avoid Kafka's instabilities
    • Load testing
    • GC clients are auto generated, not necessarily efficient. Luckily exposed API is clear enough to write your own one.
    • Batch and compress data.
  • From Hadoop + Hive to GCS + BigQuery
    • Reduce operations (have others solve hard problems for us)
  • Dataflow:
    • Offers batch + streaming. Crunch only batch
    • Windowing to partition streams per time.
    • GroupByKey works in memory
    • Watermark feature to close a window
  • Avro

Key takeaways

  • One topic per data type
  • Only ACK when the data is in the final destination (final GCS bucket).
  • Bucket the data
  • Each consumer is auto scaled based on CPU usage, to optimize resource utilisation.
  • Data may come delayed. When should we close a bucket then?
  • Data, once saved is immutable (no backfilling should occur if late events)
  • Reduce operations:
    • Recover automatically from errors if possible. Otherwise you can't scale
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment