dannycoates/etl.md

## etl.md

      
    Raw
  

              etl.md
            
          
    Experimenting with AWS Lambda for ETL

A lot of us are interested in doing more analysis with our service logs so I thought I'd share an experiment I'm doing with Sync. The main idea is to transform the raw logs into something that'll be nice to query and generate reports with in Redshift.
The Pipeline


Logs make their way into an S3 bucket (lets call it the 'raw' bucket) where we've got a lambda listening for new data. This lambda reads the raw heka protobuf gzipped data, does some transformation and writes a new file to a different S3 bucket (the 'processed' bucket) in a format that is redshift friendly (like json or csv). There's another lambda listening on the processed bucket that loads this data into Redshift.
A lambda can run for no longer than 5 minutes, so the raw files need to be small enough to finish processing in that time. For Sync that means about 400MB of uncompressed data (~60MB gzipped) depending on the transform logic.
Here's a complete example of a transform lambda that simply converts the data from protobuf to json format. This finishes in about 170 seconds running on the largest lambda size (1536MB).
var Zlib = require('zlib')
var AWS = require('aws-sdk')
var HekaDecodeStream = require('heka-decode-stream')
var through = require('through2')

var s3 = new AWS.S3()

exports.handler = function (event, context) {
	var sourceBucket = event.Records[0].s3.bucket.name
	var sourceKey = event.Records[0].s3.object.key
	var outputBucket = sourceBucket + '-json'
	var outputKey = 'json-' + sourceKey

	var output = s3.getObject({ Bucket: sourceBucket, Key: sourceKey }).createReadStream()
		.pipe(Zlib.createGunzip())
		.pipe(HekaDecodeStream())
		.pipe(through.obj(function (o, _, cb) {
			//TODO: do transformation here
			cb(null, JSON.stringify(o) + '\n')
		}))
		.pipe(Zlib.createGzip())

	s3.upload({ Bucket: outputBucket, Key: outputKey, Body: output },
		function (err, data) {
			context.done(err, data)
		})
}
To load the processed files into redshift I used the aws-lambda-redshift-loader. Configuring it wasn't quite as straightforward as the Readme made it out to be, but it works as advertised after that. I used COPY options of gzip json 'auto'
Cleanup & Error recovery

The redshift loader can send SNS notifications on successful and failed loads. For successes we can use another lambda to delete the processed S3 files, and errors can get reported via email or something else.
Transform lambda failures can trigger a CloudWatch alert that we can watch with yet another lambda or something else.
Development / Deployment Workflow

I haven't spent any time on this yet. So far I've been using the aws cli to deploy and test changes. We'll need something better to push changes through dev, stage and production.
Notes

Lambdas are priced by memory size and in 100ms time slices. Larger lambdas mean both more memory and faster cpu. In my light testing for these tasks the fastest lambda ended up costing the least because its processing speed outpaced the price premium.
Lambda logging via console.log() goes to CloudWatch. Using the web ui to view them works ok for looking at a single test run, but is horrible when a bunch are running.
Its super fun to kick off a transform of an entire day's data (>20GB compressed) to 300+ lambdas and have the whole thing done in about five minutes! I have seen some individual lambda runtimes double when doing it like that though.
Overall I really like this approach so far. Its nice to write the transform logic in my native tongue (javascript), the code can be pretty small, there's no servers or cron jobs to manage, and there's a lot of flexibility in how lambdas are linked together. With a good deployment workflow I think lambdas are a promising prospect.