calonso/Spotify_event_streaming_evolution.md

## Spotify_event_streaming_evolution.md

      
    Raw
  

              Spotify_event_streaming_evolution.md
            
          
    The old days


An hourly rotated log system
Issues:

EOF generation and propagation
Manual intervention required to continue after errors


First streaming architecture

https://labs.spotify.com/2016/02/25/spotifys-event-delivery-the-road-to-the-cloud-part-i/

Key requirement: To deliver complete data with a predictable latency and make it available to our developers via well-defined interface.
Event (structured data) as unit of streamed information.
syslog as the events source
Completeness vs Latency Dataflow
Hourly buckets
Avro format in destination (Hadoop)
Checkout Apache Crunch
Issues:

Highly coupled system
Transmitting the data across DCs weakens the system
...


The second approach

https://labs.spotify.com/2016/03/03/spotifys-event-delivery-the-road-to-the-cloud-part-ii/ & https://labs.spotify.com/2016/03/10/spotifys-event-delivery-the-road-to-the-cloud-part-iii/ &
Reliable export of Pub/Sub streams to CloudStorage

Introduce a queue for:

Low latency
Reliability
Persistence of undelivered messages, even in the face of errors on other systems (such as Hadoop)


Add structure to data early on
One topic per event type
From Kafka to CG Pub/Sub.

To avoid Kafka's instabilities
Load testing
GC clients are auto generated, not necessarily efficient. Luckily exposed API is clear enough to write your own one.
Batch and compress data.


From Hadoop + Hive to GCS + BigQuery

Reduce operations (have others solve hard problems for us)


Dataflow:

Offers batch + streaming. Crunch only batch
Windowing to partition streams per time.
GroupByKey works in memory
Watermark feature to close a window


Avro

Key takeaways


One topic per data type
Only ACK when the data is in the final destination (final GCS bucket).
Bucket the data
Each consumer is auto scaled based on CPU usage, to optimize resource utilisation.
Data may come delayed. When should we close a bucket then?
Data, once saved is immutable (no backfilling should occur if late events)
Reduce operations:

Recover automatically from errors if possible. Otherwise you can't scale