jkingsman/kinesis.md

## kinesis.md

      
    Raw
  

              kinesis.md
            
          
    Kinesis for Log Processing


Kinesis is streaming data management and analytics

Ingestion through Kinesis Steams & Kinesis Firehose
Analysis through Kinesis Data Analytics


Kinesis Streams

Stream management system
Capacity measured in shards of 1MB/s read and 2MB/s write
Producers write data into stream

Can be EC2 instance, application, server, IOT, whatever


Consumers receive the data

Can be EC2, Lambda, etc.


Each consumer consumes a particular shard
Consumers can store data, aggregated data, results, whatever they want in other services (S3, Redshift, etc.)


Kinesis Firehose

Stream buffering/concatenation/transformation is permitted!
Loads data into Redshift/S3/ElasticSearch/Splunk in near-realtime


Can load into Streams or Firehose via:

HTTPS PUT commands
Kinesis Producer library (within code)
Kinesis Agent (log file monitoring)

Also handles rotation, retry, etc.


Data can be encrypted at all times with KMS


Streams
Firehose


Customizable but more complicated
Easy and simple


Ideal for building custom analysis applications
Ideal for dumping into storage for third party analysis


Shards must be provisioned by customer
Service autoscales to meet demand


Data available to consumers in subseconds
Streamed data available within ~60s


No pre-manipulation of the data
Data transformation & batching permitted (via lambda)


Kinesis Analytics

Query data in real time with SQL
Store output as a different stream, to S3/ES/etc.


Athena for Log Processing


Interactive query tool for S3
Uses Presto (distributed SQL engine) to query, Apache Hive to do schema manipulation
Allows projection of a schema at read time
Since data is in S3, no need to load or aggregate data -- a bolt-on query engine
Can query encrypted S3 objects
Data protected with TLS/HTTPS in flight
Streams	Firehose
Customizable but more complicated	Easy and simple
Ideal for building custom analysis applications	Ideal for dumping into storage for third party analysis
Shards must be provisioned by customer	Service autoscales to meet demand
Data available to consumers in subseconds	Streamed data available within ~60s
No pre-manipulation of the data	Data transformation & batching permitted (via lambda)