Skip to content

Instantly share code, notes, and snippets.

@jkingsman
Created June 26, 2020 03:32
Show Gist options
  • Save jkingsman/643546df789f4c21c1e796bb4723f939 to your computer and use it in GitHub Desktop.
Save jkingsman/643546df789f4c21c1e796bb4723f939 to your computer and use it in GitHub Desktop.
Kinesis for Log Processing
  • Kinesis is streaming data management and analytics
    • Ingestion through Kinesis Steams & Kinesis Firehose
    • Analysis through Kinesis Data Analytics
  • Kinesis Streams
    • Stream management system
    • Capacity measured in shards of 1MB/s read and 2MB/s write
    • Producers write data into stream
      • Can be EC2 instance, application, server, IOT, whatever
    • Consumers receive the data
      • Can be EC2, Lambda, etc.
    • Each consumer consumes a particular shard
    • Consumers can store data, aggregated data, results, whatever they want in other services (S3, Redshift, etc.)
  • Kinesis Firehose
    • Stream buffering/concatenation/transformation is permitted!
    • Loads data into Redshift/S3/ElasticSearch/Splunk in near-realtime
  • Can load into Streams or Firehose via:
    • HTTPS PUT commands
    • Kinesis Producer library (within code)
    • Kinesis Agent (log file monitoring)
      • Also handles rotation, retry, etc.
  • Data can be encrypted at all times with KMS
Streams Firehose
Customizable but more complicated Easy and simple
Ideal for building custom analysis applications Ideal for dumping into storage for third party analysis
Shards must be provisioned by customer Service autoscales to meet demand
Data available to consumers in subseconds Streamed data available within ~60s
No pre-manipulation of the data Data transformation & batching permitted (via lambda)
  • Kinesis Analytics
    • Query data in real time with SQL
    • Store output as a different stream, to S3/ES/etc.
Athena for Log Processing
  • Interactive query tool for S3
  • Uses Presto (distributed SQL engine) to query, Apache Hive to do schema manipulation
  • Allows projection of a schema at read time
  • Since data is in S3, no need to load or aggregate data -- a bolt-on query engine
  • Can query encrypted S3 objects
  • Data protected with TLS/HTTPS in flight
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment