Skip to content

Instantly share code, notes, and snippets.

@philandstuff
Created July 21, 2022 07:33
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save philandstuff/34bcd4062205ecccc8375262f7012426 to your computer and use it in GitHub Desktop.
Save philandstuff/34bcd4062205ecccc8375262f7012426 to your computer and use it in GitHub Desktop.
Serverless London July 2022 notes

Serverless London Meetup

2022-07-20

Sarah, Lego group, Kinesis

Real-time data processing using kinesis and lambda

Intro: I'm Sarah, (twitter: serverlesssarah), work at Lego group.

Written today's talk on medium - implementation detail there

Schedule

  • why data streams
  • architecture
  • overview of kinesis

Why data streams?

examples:

  • click stream from website for analytics
  • take data and personalize marketing emails based on what people do on the website
  • IOT devices - eg F1 racecar, taking in sensor data, analysing so team can improve car

This talk: web dev examples

how

Events:

  • user logged in
  • product A added to cart

Logic ✨

Output:

  • ad campaigns
  • product performance
  • A/B testing

leverage the cloud

AWS here, but concepts apply elsewhere

SQS more queueing, Kinesis more streaming

Can have multiple consumers on a kinesis stream. Multiple services can ingest event A - SQS doesn't allow this.

Kinesis allows replaying records - if system goes down, you can replay.

Large capacity: you can batch thousands of records with Kinesis, not with SQS

But which kinesis? 🤔

Data analytics Data Firehose Data Streams Video Streams

Firehose and Streams often get mixed up

Firehose is loading data (eg into a data lake)

Streams is more about how other services might want to ingest the data

ARchitecture

Events come in from website, stream them to consumers with kinesis data streams, then process in lambda consumers, that direct to eventbridge or S3 or whatever

Simple data stream: broken down into shards. Within shard, in-order delivery; multiple shards have no order guarantees between shards.

Scaling bottlenecks:

  • too many writes to data stream
  • too many reads

Too many writes -> need to consider more shards.

Per shard: 1MB/s writes or 1000 records/s. Reads: 2MB/s or 5 read/s (with maximum 1000 records per read)

Metrics are your friend: enhanced shard-level metrics.

For a single shard, you get stream-level metrics out of the box which is fine.

Set up CloudWatch alarms on the metrics we discussed. Then you can find the shard causing the problem.

Increasing the number of shards. We need a shard key. We might want each user's records to be in a single shard.

Bad partitioning: eg on event id rather than user id.

Hot/cold shards: if you partition on something that is "lumpy", one shard can get overloaded but others are underworked.

You pay for uptime of shards.

But on-demand mode - promises to be all-serverless method but I'm not so sure.

But is kinesis serverless?

For me: serverless is pay-as-you-go. But on-demand charges you for convenience, and you always pay for a single shard.

Fairly new offering. Lots of people sticking with provisioned at the moment.

It's the best way of doing data streams in a serverless architecture.

Challenges: the second bottleneck

Lambda can't keep up with data stream. Means:

  • data not flowing in realtime
  • if data lives in data stream longer than retention (24hrs) then data loss (!)

Metrics are your friend ✨

Kinesis: GetRecords.IteratorAgeMilliseconds. Want as close to zero as possible. Lambda: Duration.

Increase parallelisation on event mapping. One shard read by up to 10 lambdas. Still maintain order of processing - sequence number assigned to each event.

Horizontal scaling.

Beyond that, boost memory which also boosts CPU.

Also batch your records. If you have millions of records coming in, you don't want to invoke millions of times. Batch up instead.

Two conditions to tweak: batch size & batch window; whichever is met first will invoke the lambda.

(Q: batch window with parallelisation?)

There are documents on how to automate the process. But requires you to implement own infra: CloudWatch alarms, eventbridge, lambda etc.

Spiky workloads - maybe on-demand is for you, and ensure lambda can cope with load.

If you have constant flow, you might not have to tweak.

Resolving Bottlenecks of Lambda Triggered By Kinesis - part 1/2 on medium

Q&A

with increased paralleliszation, how does it maintain order?

A: I think each is assigned a sequence number?

AWS: don't think you can guarantee ordering with parallelisation within shard. but there are ways you can manage - but you have to create sequence number as part of producer? Not sure on nuance here.


Presumably kinesis comes into its own with large quantities of data and batching. When is it too much for eventbridge?

A: eventbridge is about orchestration of system, not for high streams of data. Not sure there's any hard limit on eventbridge.

A: Also eventbridge is a bit more expensive.

AWS: half-second latency on eventbridge.


Are batches contiguous within parallelisation?

A: yes


Multiple consumers for one stream. I had a problem a few years ago where I couldn't define more than five consumers. has that been fixed?

A: still a limitation I think.

Powershell on lambda

twitter - julian_wood ; Senior developer advocate

"serverless is someone else's servers" - same with wireless! there are loads of wires behind the scenes!

even easier way to run Lambda functions written in powershell

Powershell

Previously only windows powershell

Now PowerShell core

  • cross/platform
  • scripting language
  • comand shell
  • automation and configuration tool and framework.

Unix people pooh-pooh powershell.

How do you get data from your systems? Parsing files with sed/awk. Powershell: everything is an object already.

Powershell loves reacting to events!

Why Lambda?

  • run code without provisioning infra or servers
  • pricing

PowerShell runtime for lambda

Existing .NET solution since 2018 - used .NET runtime. Worked well. Piggybacked on .NET available in Lambda and compiles PS code into .NET binaries.

Can only return last output from pipeline, so people had to strucutre their code differently.

Couldn't see code in lambda console.

lambda runtimes

Managed runtimes

  • node
  • python
  • .NET
  • ruby
  • python
  • Go
  • Java

All on Amazon linux.

Custom runtimes

  • COBOL
  • fortran
  • erlang
  • php
  • different OS

Can use Lambda to run any code you can conceive of

"isn't it really hard?"

Share code between functions

Use lambda layers

Demo

  • Create blank bootstrap function on Amazon Linux 2
  • Custom runtime implemented with lambda layer
  • the layer adds powershell to my function
  • create file examplehandler.ps1
  • paste powershell code into it
  • specify the file as the handler
  • test with test event

If you're doing anything with AWS tools for PowerShell, you want to do some AWS stuff.

  • edit the code to
    • import AWS tools
    • get all regions

adding lambda layer with the AWS tools for powershell (for the dependency)

This is one function. If I had 1k functions, I would use an IaC tool, use the same layer across many functions.

What just happened

Custom runtime: just powershell files.

Lambda layers: PowerShell custom runtime lambda layer, modules.

Build function with: Powershell custom runtime lambda layer, module layers, code.

Deploying

building layers and function

  • using linux or WSL
  • using docker CLI
  • using PowerShell for Windows

AWS SAM to test locally

AWS CLI / SAM to deploy to Lambda

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment