brad-jones/ETLFrameworkThingo.md

## ETLFrameworkThingo.md

      
    Raw
  

              ETLFrameworkThingo.md
            
          
    ETL Framework Thingo

Ok so my issue is essentially the same one that the steampipe tool has set
out to solve.

Everyone that has worked with cloud service providers understands how hours
can be wasted bouncing back and forth between tools to answer simple questions
about your environment.

https://steampipe.io/blog/introducing-steampipe
Why not just use steampipe then? Because it's too damn new.
While the idea is amazing & works well, until it doesn't for your particular
use case. For example I wanted to query the patch compliance of my ssm managed
instances but that data is not currently exposed through the aws plugin.
There is a current open issue turbot/steampipe-plugin-aws#323
& this PR turbot/steampipe-plugin-aws#359.
And thats great but it doesn't really help me today. Sure I could checkout that
PR & build stuff manually but then whats the next thing I'm going to run into?
I have already discovered that I can't connect the Github plugin to Github Enterprise.
Why not contribute to steampipe and help make it awesome? Because IMO there is
always going to be some other data source that you would need to build a plugin
for.
Given that aws ssm list-resource-compliance-summaries already gives me the
information I need, I want to just use that directly somehow in an easy, maintainable
& scalable way. Data projects often start out simple and then grow more complex.
The real power of steampipe is that it allows us to join many data sets &
thats what this framework need to make happen but for simple CLI calls like
the above example.
Another reason I am considering building this new tool is because I was already
thinking about putting something like https://github.com/graphile/postgraphile
in front of steampipe.
Given that JSON is our primary data format we will be dealing with, why fight it
and stick it inside an SQL db, as good as Postgres is... lets stick that JSON
data into a NoSQL db, perhaps even something embedded. Then we stick a graphql
server in front of that.
And the final part of the journey that a tool like steampipe misses out is the
actual reporting / visualization. I want some sort of semi-declarative way of
producing reports & dashboards as thats the thing my boss wants to see.
Another tool that I am inspired by is https://github.com/Jeffail/benthos.
benthos is a very intreating project but after playing with it, I decided it
was really designed for eventing and not the sort of reporting I need to do today.
Extract

Step 1. we need to extract some data from somewhere.
At a high level there are 3 types of data sources that we need to support.

The Local Filesystem
Child Processes
Network Requests

The extract step should provide generous caching such that transforming &
loading can be run many times over without hitting the same data sources
over and over again.
As well, concerns like rate limiting & retry functionality should be
provided by the tool.
Transform

Step 2. we need to able transform that extracted data into JSON.
The tool will provide out of the box transformations for simple things like.

xml
csv
hcl
toml
ini
etc...

More complex transformations (eg: html scraping) should be able to be provided
by the consumer of the tool in their projects without having to build some
standalone plugin first.
Although that complex transformation should still be able to be shared and
distributed to others if desired.
Load

Step 3. load the extracted & transformed data into some sort of real or
embedded database.
The ETL process should be allowed to be recursive in nature. You will often
find that after loading a set of data, a list of Auto Scaling Groups perhaps,
that you will then need to extract, transform & load say a list of the EC2
instances attached to those ASGs.
You might be saying why don't you just download a list of all the ASGs
& a list of all the EC2 instances and then join those 2 datasets in
the database.

This is essentially what steampipe does such a good job of.

And this is could be totally valid for smaller datasets but for larger ones,
like say public Github.com you can't feasibly download all the repository
metadata.
Candidates:

https://github.com/prisma/prisma
https://github.com/dgraph-io/dgraph
https://github.com/cayleygraph/cayley
https://github.com/kontist/type-graphql-filter
https://github.com/mattn/go-sqlite3
https://github.com/256dpi/lungo

Query

Step 4. once we have the data into some sort of db, we query that db.
What about GraphQL? Maybe we don't need that after all... although perhaps
GraphQL just becomes the normalised query language which would allow us to
easily swap out the underlying datastore.
This is where we answer our questions that we originally set out to answer.

How many ec2 instances need to be patched?
Are there any IAM roles with * permissions?
Which team owns those lambdas?
etc...

Visualize

Step 5. great we have answers but my boss wants a pretty graph not a JSON dump.
Although maybe my boss might want a JSON dump too...
So this where we use some sort of frontend charting lib to produce fancy
reports & dashboards.
Ideally defining & building these reports / dashboards will be as declarative
as possible. I really do not want to have to deal CSS, JS & HTML unless I really
want to.
The output from this stage would be static, similar in principal to a
static site generator. To refresh the data you run the entire thing again
and replace the output.
But if we are using GraphQL in the query stage maybe we can offer 2 modes of
operation, a simple static mode as described but then also a daemon mode with
a real-time dashboard... in daemon mode we deploy as a docker container,
in static mode we deploy to S3/Github Pages/or similar.
Candidates:

https://cube.dev/
https://d3js.org/
https://www.chartjs.org/
https://www.highcharts.com/

What this tool is not

It is not designed to replace things like AWS Glue/Athena & other big data
technologies.
This tool is designed to help answer simple to mildly complex questions easily
and quickly. The effort required to get an answer with this tool should be less
than a days work.