Skip to content

Instantly share code, notes, and snippets.

@brad-jones
Created August 11, 2021 11:44
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save brad-jones/7dd975bccb51353aefe836adfbde7221 to your computer and use it in GitHub Desktop.
Save brad-jones/7dd975bccb51353aefe836adfbde7221 to your computer and use it in GitHub Desktop.
Ideas about a new ETL Framework / SDK / Service

ETL Framework Thingo

Ok so my issue is essentially the same one that the steampipe tool has set out to solve.

Everyone that has worked with cloud service providers understands how hours can be wasted bouncing back and forth between tools to answer simple questions about your environment.

https://steampipe.io/blog/introducing-steampipe

Why not just use steampipe then? Because it's too damn new.

While the idea is amazing & works well, until it doesn't for your particular use case. For example I wanted to query the patch compliance of my ssm managed instances but that data is not currently exposed through the aws plugin.

There is a current open issue turbot/steampipe-plugin-aws#323 & this PR turbot/steampipe-plugin-aws#359.

And thats great but it doesn't really help me today. Sure I could checkout that PR & build stuff manually but then whats the next thing I'm going to run into? I have already discovered that I can't connect the Github plugin to Github Enterprise.

Why not contribute to steampipe and help make it awesome? Because IMO there is always going to be some other data source that you would need to build a plugin for.

Given that aws ssm list-resource-compliance-summaries already gives me the information I need, I want to just use that directly somehow in an easy, maintainable & scalable way. Data projects often start out simple and then grow more complex.

The real power of steampipe is that it allows us to join many data sets & thats what this framework need to make happen but for simple CLI calls like the above example.

Another reason I am considering building this new tool is because I was already thinking about putting something like https://github.com/graphile/postgraphile in front of steampipe.

Given that JSON is our primary data format we will be dealing with, why fight it and stick it inside an SQL db, as good as Postgres is... lets stick that JSON data into a NoSQL db, perhaps even something embedded. Then we stick a graphql server in front of that.

And the final part of the journey that a tool like steampipe misses out is the actual reporting / visualization. I want some sort of semi-declarative way of producing reports & dashboards as thats the thing my boss wants to see.

Another tool that I am inspired by is https://github.com/Jeffail/benthos. benthos is a very intreating project but after playing with it, I decided it was really designed for eventing and not the sort of reporting I need to do today.

Extract

Step 1. we need to extract some data from somewhere.

At a high level there are 3 types of data sources that we need to support.

  • The Local Filesystem
  • Child Processes
  • Network Requests

The extract step should provide generous caching such that transforming & loading can be run many times over without hitting the same data sources over and over again.

As well, concerns like rate limiting & retry functionality should be provided by the tool.

Transform

Step 2. we need to able transform that extracted data into JSON.

The tool will provide out of the box transformations for simple things like.

  • xml
  • csv
  • hcl
  • toml
  • ini
  • etc...

More complex transformations (eg: html scraping) should be able to be provided by the consumer of the tool in their projects without having to build some standalone plugin first.

Although that complex transformation should still be able to be shared and distributed to others if desired.

Load

Step 3. load the extracted & transformed data into some sort of real or embedded database.

The ETL process should be allowed to be recursive in nature. You will often find that after loading a set of data, a list of Auto Scaling Groups perhaps, that you will then need to extract, transform & load say a list of the EC2 instances attached to those ASGs.

You might be saying why don't you just download a list of all the ASGs & a list of all the EC2 instances and then join those 2 datasets in the database.

This is essentially what steampipe does such a good job of.

And this is could be totally valid for smaller datasets but for larger ones, like say public Github.com you can't feasibly download all the repository metadata.

Candidates:

Query

Step 4. once we have the data into some sort of db, we query that db.

What about GraphQL? Maybe we don't need that after all... although perhaps GraphQL just becomes the normalised query language which would allow us to easily swap out the underlying datastore.

This is where we answer our questions that we originally set out to answer.

  • How many ec2 instances need to be patched?
  • Are there any IAM roles with * permissions?
  • Which team owns those lambdas?
  • etc...

Visualize

Step 5. great we have answers but my boss wants a pretty graph not a JSON dump.

Although maybe my boss might want a JSON dump too...

So this where we use some sort of frontend charting lib to produce fancy reports & dashboards.

Ideally defining & building these reports / dashboards will be as declarative as possible. I really do not want to have to deal CSS, JS & HTML unless I really want to.

The output from this stage would be static, similar in principal to a static site generator. To refresh the data you run the entire thing again and replace the output.

But if we are using GraphQL in the query stage maybe we can offer 2 modes of operation, a simple static mode as described but then also a daemon mode with a real-time dashboard... in daemon mode we deploy as a docker container, in static mode we deploy to S3/Github Pages/or similar.

Candidates:

What this tool is not

It is not designed to replace things like AWS Glue/Athena & other big data technologies.

This tool is designed to help answer simple to mildly complex questions easily and quickly. The effort required to get an answer with this tool should be less than a days work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment