Skip to content

Instantly share code, notes, and snippets.

@dannycoates
Last active July 22, 2019 12:40
Show Gist options
  • Save dannycoates/3443fe173be50e6e35ac to your computer and use it in GitHub Desktop.
Save dannycoates/3443fe173be50e6e35ac to your computer and use it in GitHub Desktop.
AWS Lambda for ETL

Experimenting with AWS Lambda for ETL

A lot of us are interested in doing more analysis with our service logs so I thought I'd share an experiment I'm doing with Sync. The main idea is to transform the raw logs into something that'll be nice to query and generate reports with in Redshift.

The Pipeline

Pipeline Diagram

Logs make their way into an S3 bucket (lets call it the 'raw' bucket) where we've got a lambda listening for new data. This lambda reads the raw heka protobuf gzipped data, does some transformation and writes a new file to a different S3 bucket (the 'processed' bucket) in a format that is redshift friendly (like json or csv). There's another lambda listening on the processed bucket that loads this data into Redshift.

A lambda can run for no longer than 5 minutes, so the raw files need to be small enough to finish processing in that time. For Sync that means about 400MB of uncompressed data (~60MB gzipped) depending on the transform logic.

Here's a complete example of a transform lambda that simply converts the data from protobuf to json format. This finishes in about 170 seconds running on the largest lambda size (1536MB).

var Zlib = require('zlib')
var AWS = require('aws-sdk')
var HekaDecodeStream = require('heka-decode-stream')
var through = require('through2')

var s3 = new AWS.S3()

exports.handler = function (event, context) {
	var sourceBucket = event.Records[0].s3.bucket.name
	var sourceKey = event.Records[0].s3.object.key
	var outputBucket = sourceBucket + '-json'
	var outputKey = 'json-' + sourceKey

	var output = s3.getObject({ Bucket: sourceBucket, Key: sourceKey }).createReadStream()
		.pipe(Zlib.createGunzip())
		.pipe(HekaDecodeStream())
		.pipe(through.obj(function (o, _, cb) {
			//TODO: do transformation here
			cb(null, JSON.stringify(o) + '\n')
		}))
		.pipe(Zlib.createGzip())

	s3.upload({ Bucket: outputBucket, Key: outputKey, Body: output },
		function (err, data) {
			context.done(err, data)
		})
}

To load the processed files into redshift I used the aws-lambda-redshift-loader. Configuring it wasn't quite as straightforward as the Readme made it out to be, but it works as advertised after that. I used COPY options of gzip json 'auto'

Cleanup & Error recovery

The redshift loader can send SNS notifications on successful and failed loads. For successes we can use another lambda to delete the processed S3 files, and errors can get reported via email or something else.

Transform lambda failures can trigger a CloudWatch alert that we can watch with yet another lambda or something else.

Development / Deployment Workflow

I haven't spent any time on this yet. So far I've been using the aws cli to deploy and test changes. We'll need something better to push changes through dev, stage and production.

Notes

Lambdas are priced by memory size and in 100ms time slices. Larger lambdas mean both more memory and faster cpu. In my light testing for these tasks the fastest lambda ended up costing the least because its processing speed outpaced the price premium.

Lambda logging via console.log() goes to CloudWatch. Using the web ui to view them works ok for looking at a single test run, but is horrible when a bunch are running.

Its super fun to kick off a transform of an entire day's data (>20GB compressed) to 300+ lambdas and have the whole thing done in about five minutes! I have seen some individual lambda runtimes double when doing it like that though.

Overall I really like this approach so far. Its nice to write the transform logic in my native tongue (javascript), the code can be pretty small, there's no servers or cron jobs to manage, and there's a lot of flexibility in how lambdas are linked together. With a good deployment workflow I think lambdas are a promising prospect.

@MatthewDaniels
Copy link

Nice write up - I have been experimenting with Lambda for ETL (and various other things) - have you taken a look at Apex (http://apex.run/) or the Serverless Framework (https://serverless.com/). These are great tools to manage Lambda projects. They wrap up the CLI to enable provisioning, uploading and updating of Lambdas (and in the case of Serverless, almost anything else in AWS land) - these really helped me to get environments setup with config descriptors vs doing things either on the CLI myself or with AWS Console UI (sigh).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment