2022-07-20
Real-time data processing using kinesis and lambda
Intro: I'm Sarah, (twitter: serverlesssarah), work at Lego group.
Written today's talk on medium - implementation detail there
- why data streams
- architecture
- overview of kinesis
examples:
- click stream from website for analytics
- take data and personalize marketing emails based on what people do on the website
- IOT devices - eg F1 racecar, taking in sensor data, analysing so team can improve car
This talk: web dev examples
Events:
- user logged in
- product A added to cart
Logic ✨
Output:
- ad campaigns
- product performance
- A/B testing
AWS here, but concepts apply elsewhere
SQS more queueing, Kinesis more streaming
Can have multiple consumers on a kinesis stream. Multiple services can ingest event A - SQS doesn't allow this.
Kinesis allows replaying records - if system goes down, you can replay.
Large capacity: you can batch thousands of records with Kinesis, not with SQS
Data analytics Data Firehose Data Streams Video Streams
Firehose and Streams often get mixed up
Firehose is loading data (eg into a data lake)
Streams is more about how other services might want to ingest the data
Events come in from website, stream them to consumers with kinesis data streams, then process in lambda consumers, that direct to eventbridge or S3 or whatever
Simple data stream: broken down into shards. Within shard, in-order delivery; multiple shards have no order guarantees between shards.
Scaling bottlenecks:
- too many writes to data stream
- too many reads
Too many writes -> need to consider more shards.
Per shard: 1MB/s writes or 1000 records/s. Reads: 2MB/s or 5 read/s (with maximum 1000 records per read)
Metrics are your friend: enhanced shard-level metrics.
For a single shard, you get stream-level metrics out of the box which is fine.
Set up CloudWatch alarms on the metrics we discussed. Then you can find the shard causing the problem.
Increasing the number of shards. We need a shard key. We might want each user's records to be in a single shard.
Bad partitioning: eg on event id rather than user id.
Hot/cold shards: if you partition on something that is "lumpy", one shard can get overloaded but others are underworked.
You pay for uptime of shards.
But on-demand mode - promises to be all-serverless method but I'm not so sure.
For me: serverless is pay-as-you-go. But on-demand charges you for convenience, and you always pay for a single shard.
Fairly new offering. Lots of people sticking with provisioned at the moment.
It's the best way of doing data streams in a serverless architecture.
Lambda can't keep up with data stream. Means:
- data not flowing in realtime
- if data lives in data stream longer than retention (24hrs) then data loss (!)
Metrics are your friend ✨
Kinesis: GetRecords.IteratorAgeMilliseconds. Want as close to zero as possible. Lambda: Duration.
Increase parallelisation on event mapping. One shard read by up to 10 lambdas. Still maintain order of processing - sequence number assigned to each event.
Horizontal scaling.
Beyond that, boost memory which also boosts CPU.
Also batch your records. If you have millions of records coming in, you don't want to invoke millions of times. Batch up instead.
Two conditions to tweak: batch size & batch window; whichever is met first will invoke the lambda.
(Q: batch window with parallelisation?)
There are documents on how to automate the process. But requires you to implement own infra: CloudWatch alarms, eventbridge, lambda etc.
Spiky workloads - maybe on-demand is for you, and ensure lambda can cope with load.
If you have constant flow, you might not have to tweak.
Resolving Bottlenecks of Lambda Triggered By Kinesis - part 1/2 on medium
with increased paralleliszation, how does it maintain order?
A: I think each is assigned a sequence number?
AWS: don't think you can guarantee ordering with parallelisation within shard. but there are ways you can manage - but you have to create sequence number as part of producer? Not sure on nuance here.
Presumably kinesis comes into its own with large quantities of data and batching. When is it too much for eventbridge?
A: eventbridge is about orchestration of system, not for high streams of data. Not sure there's any hard limit on eventbridge.
A: Also eventbridge is a bit more expensive.
AWS: half-second latency on eventbridge.
Are batches contiguous within parallelisation?
A: yes
Multiple consumers for one stream. I had a problem a few years ago where I couldn't define more than five consumers. has that been fixed?
A: still a limitation I think.
twitter - julian_wood ; Senior developer advocate
"serverless is someone else's servers" - same with wireless! there are loads of wires behind the scenes!
Previously only windows powershell
Now PowerShell core
- cross/platform
- scripting language
- comand shell
- automation and configuration tool and framework.
Unix people pooh-pooh powershell.
How do you get data from your systems? Parsing files with sed/awk. Powershell: everything is an object already.
Powershell loves reacting to events!
Why Lambda?
- run code without provisioning infra or servers
- pricing
Existing .NET solution since 2018 - used .NET runtime. Worked well. Piggybacked on .NET available in Lambda and compiles PS code into .NET binaries.
Can only return last output from pipeline, so people had to strucutre their code differently.
Couldn't see code in lambda console.
Managed runtimes
- node
- python
- .NET
- ruby
- python
- Go
- Java
All on Amazon linux.
Custom runtimes
- COBOL
- fortran
- erlang
- php
- different OS
Can use Lambda to run any code you can conceive of
"isn't it really hard?"
Share code between functions
Use lambda layers
- Create blank bootstrap function on Amazon Linux 2
- Custom runtime implemented with lambda layer
- the layer adds powershell to my function
- create file examplehandler.ps1
- paste powershell code into it
- specify the file as the handler
- test with test event
If you're doing anything with AWS tools for PowerShell, you want to do some AWS stuff.
- edit the code to
- import AWS tools
- get all regions
adding lambda layer with the AWS tools for powershell (for the dependency)
This is one function. If I had 1k functions, I would use an IaC tool, use the same layer across many functions.
Custom runtime: just powershell files.
Lambda layers: PowerShell custom runtime lambda layer, modules.
Build function with: Powershell custom runtime lambda layer, module layers, code.
building layers and function
- using linux or WSL
- using docker CLI
- using PowerShell for Windows
AWS SAM to test locally
AWS CLI / SAM to deploy to Lambda