philandstuff/serverless-london-2022-07.md

## serverless-london-2022-07.md

      
    Raw
  

              serverless-london-2022-07.md
            
          
    Serverless London Meetup

2022-07-20
Sarah, Lego group, Kinesis

Real-time data processing using kinesis and lambda
Intro: I'm Sarah, (twitter: serverlesssarah), work at Lego group.
Written today's talk on medium - implementation detail there
Schedule


why data streams
architecture
overview of kinesis

Why data streams?

examples:

click stream from website for analytics
take data and personalize marketing emails based on what people do on the website
IOT devices - eg F1 racecar, taking in sensor data, analysing so team can improve car

This talk: web dev examples
how

Events:

user logged in
product A added to cart

Logic ✨
Output:

ad campaigns
product performance
A/B testing

leverage the cloud

AWS here, but concepts apply elsewhere
SQS more queueing, Kinesis more streaming
Can have multiple consumers on a kinesis stream. Multiple services can ingest event A - SQS doesn't allow this.
Kinesis allows replaying records - if system goes down, you can replay.
Large capacity: you can batch thousands of records with Kinesis, not with SQS
But which kinesis? 🤔

Data analytics
Data Firehose
Data Streams
Video Streams
Firehose and Streams often get mixed up
Firehose is loading data (eg into a data lake)
Streams is more about how other services might want to ingest the data
ARchitecture

Events come in from website, stream them to consumers with kinesis data streams, then process in lambda consumers, that direct to eventbridge or S3 or whatever
Simple data stream: broken down into shards. Within shard, in-order delivery; multiple shards have no order guarantees between shards.
Scaling bottlenecks:

too many writes to data stream
too many reads

Too many writes -> need to consider more shards.
Per shard: 1MB/s writes or 1000 records/s.
Reads: 2MB/s or 5 read/s (with maximum 1000 records per read)
Metrics are your friend: enhanced shard-level metrics.
For a single shard, you get stream-level metrics out of the box which is fine.
Set up CloudWatch alarms on the metrics we discussed. Then you can find the shard causing the problem.
Increasing the number of shards.  We need a shard key. We might want each user's records to be in a single shard.
Bad partitioning: eg on event id rather than user id.
Hot/cold shards: if you partition on something that is "lumpy", one shard can get overloaded but others are underworked.
You pay for uptime of shards.
But on-demand mode - promises to be all-serverless method but I'm not so sure.
But is kinesis serverless?

For me: serverless is pay-as-you-go. But on-demand charges you for convenience, and you always pay for a single shard.
Fairly new offering. Lots of people sticking with provisioned at the moment.
It's the best way of doing data streams in a serverless architecture.
Challenges: the second bottleneck

Lambda can't keep up with data stream. Means:

data not flowing in realtime
if data lives in data stream longer than retention (24hrs) then data loss (!)

Metrics are your friend ✨
Kinesis: GetRecords.IteratorAgeMilliseconds. Want as close to zero as possible.
Lambda: Duration.
Increase parallelisation on event mapping.  One shard read by up to 10 lambdas.  Still maintain order of processing - sequence number assigned to each event.
Horizontal scaling.
Beyond that, boost memory which also boosts CPU.
Also batch your records.  If you have millions of records coming in, you don't want to invoke millions of times. Batch up instead.
Two conditions to tweak: batch size & batch window; whichever is met first will invoke the lambda.
(Q: batch window with parallelisation?)
There are documents on how to automate the process. But requires you to implement own infra: CloudWatch alarms, eventbridge, lambda etc.
Spiky workloads - maybe on-demand is for you, and ensure lambda can cope with load.
If you have constant flow, you might not have to tweak.
Resolving Bottlenecks of Lambda Triggered By Kinesis - part 1/2 on medium
Q&A

with increased paralleliszation, how does it maintain order?
A: I think each is assigned a sequence number?
AWS: don't think you can guarantee ordering with parallelisation within shard. but there are ways you can manage - but you have to create sequence number as part of producer? Not sure on nuance here.

Presumably kinesis comes into its own with large quantities of data and batching. When is it too much for eventbridge?
A: eventbridge is about orchestration of system, not for high streams of data. Not sure there's any hard limit on eventbridge.
A: Also eventbridge is a bit more expensive.
AWS: half-second latency on eventbridge.

Are batches contiguous within parallelisation?
A: yes

Multiple consumers for one stream. I had a problem a few years ago where I couldn't define more than five consumers. has that been fixed?
A: still a limitation I think.
Powershell on lambda

twitter - julian_wood ; Senior developer advocate
"serverless is someone else's servers" - same with wireless! there are loads of wires behind the scenes!
even easier way to run Lambda functions written in powershell

Powershell

Previously only windows powershell
Now PowerShell core

cross/platform
scripting language
comand shell
automation and configuration tool and framework.

Unix people pooh-pooh powershell.
How do you get data from your systems?  Parsing files with sed/awk.
Powershell: everything is an object already.
Powershell loves reacting to events!
Why Lambda?

run code without provisioning infra or servers
pricing

PowerShell runtime for lambda

Existing .NET solution since 2018 - used .NET runtime. Worked well. Piggybacked on .NET available in Lambda and compiles PS code into .NET binaries.
Can only return last output from pipeline, so people had to strucutre their code differently.
Couldn't see code in lambda console.
lambda runtimes

Managed runtimes

node
python
.NET
ruby
python
Go
Java

All on Amazon linux.
Custom runtimes

COBOL
fortran
erlang
php
different OS

Can use Lambda to run any code you can conceive of
"isn't it really hard?"
Share code between functions
Use lambda layers
Demo


Create blank bootstrap function on Amazon Linux 2
Custom runtime implemented with lambda layer
the layer adds powershell to my function
create file examplehandler.ps1
paste powershell code into it
specify the file as the handler
test with test event

If you're doing anything with AWS tools for PowerShell, you want to do some AWS stuff.

edit the code to

import AWS tools
get all regions


adding lambda layer with the AWS tools for powershell (for the dependency)
This is one function. If I had 1k functions, I would use an IaC tool, use the same layer across many functions.
What just happened

Custom runtime: just powershell files.
Lambda layers: PowerShell custom runtime lambda layer, modules.
Build function with: Powershell custom runtime lambda layer, module layers, code.
Deploying

building layers and function

using linux or WSL
using docker CLI
using PowerShell for Windows

AWS SAM to test locally
AWS CLI / SAM to deploy to Lambda