Skip to content

Instantly share code, notes, and snippets.

@rtimmons
Last active April 6, 2022 18:24
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save rtimmons/d5cbe70be8f829fb95fa2ccbee9f3c5b to your computer and use it in GitHub Desktop.
Save rtimmons/d5cbe70be8f829fb95fa2ccbee9f3c5b to your computer and use it in GitHub Desktop.

Lunch & Learn: SQL and ETLs for Presto and MARS

Blurb

Presto lets us trade APIs for queryable data-structures. Many APIs are already in Presto including Jira, TIPS, and DAG data. MARS uses Presto to run batch-procesing jobs against these data-structures. SQL is the (unfortunate) lingua-franca in this land.

This is a new and foreign situation for some of us, but it's not scary. If you know some MQL or even basic Python you can pick up Presto's SQL easily to quickly explore data within dev-prod and the rest of the company. And MARS is both simpler and more powerful than you might guess.

In a few days everyone in dev-prod and STAR will automatically have access to Presto. Let's hit the ground running.

I'll do a quick end-to-end demo of exploring the Jira, TIPS, and DAG data catalogs. Then we'll do a whirlwind tour of SQL (including what makes Presto's SQL not so terrible). Then I'll put that SQL into a multi-step MARS job. Then I'll use the resulting data in JupyterHub. Then I'll give a quick description of how we'll keep this new kingdom tidy.

If you are keen on following along, check out the DBeaver onboarding page: https://github.com/10gen/mongo-presto/blob/master/docs/configuration/dbeaver/connection.md

Setup

  • ensure on VPN
  • ensure can access MongoDB Cloud here

Slides

SQL, Presto, MARS, and Jupyter (and Jira!)

 ryan.timmons@
 2021-12-03

Recording here. Good luck following that guy talking.

Pretend Goal: Data Consistency TIPS vs Cedar

Cedar thinks it's sent some data to TIPS but TIPS thinks otherwise. Both can't be correct. Let's see.

Agenda:

  1. SQL and Presto
  2. Marsing
  3. "Policies" and getting started
  4. Using this in Jupyterhub

Access and DBeaver

🦫🦫 install & configure 🦫🦫

Elephants: SQL????

Yup. No excuses. Our offerings aren't ready yet. BUT: Presto does give us a back-door to using MQL. Stay tuned.

SQL: Tables

SELECT
  *
FROM
  evergreen_cedar_atlas.cedar.perf_results AS cedars
WHERE
  cedars.created_at < DATE_ADD('day', -1, CURRENT_TIMESTAMP)
  AND
  CARDINALITY(rollups.stats) > 2;

Note the "data" column. Cmd+Click on the object. Tab to switch view.

data[1].cedar_perf_result_id

How the heck? And how do we use it.

Let's first talk about what's going on behind the scenes here. Where did this data come from and how does presto know what it looks like?

How Presto Prestoes

Sneak-Peek of MARS:

  • defines a dag of jobs
  • jobs take data from one place and put it another with some config in between
  • sources are many including atlas
  • destinations are many including atlas (but usually it's presto)
  • other destinations too? hold on we'll find out.

Our Atlas-Backed Catalogs

These use the mongodb connector. Analytics nodes with presto user accounts and roles facilitating the connections.

  • dev_prod_build_baron_atlas (DAG)
  • dev_prod_performance_atlas (TIPS)
  • evergreen_cedar_atlas (EVG) (but test results are wip)
  • patch_build_optimizer_atlas (STM)
  • tpm_build_failures_atlas (TPMs)
  • evergreen_metrics_atlas (ignore!)
  • Missing: evergreen app (for now)

Note: Security is a thing (Mana)

Our AWS Catalogs

In awsdatacatalog. These use the hive connector.

  • dev_prod_staging
  • evergreen_base (private!)
  • dev_prod_live

Alation

How do you know what's what? WIP!

SQL: Presto Data Types

Looking back at the results of the select query from before.

SELECT
  *
FROM
  dev_prod_performance_atlas.expanded_metrics.time_series

Note the "data" column. Cmd+Click on the object. Tab to switch view.

data[1].cedar_perf_result_id

SQL: Unnest

SELECT
  data
FROM
  dev_prod_performance_atlas.expanded_metrics.time_series
select series.*, datas.*
from 
    dev_prod_performance_atlas.expanded_metrics.time_series as series,
    unnest(data) as datas
with x (variant, tasks) as (
  values
  	('variant1', array['task-1a', 'task-1b']),
  	('variant2', array['task-2a', 'task-2b'])
)
select  *
from    x;

select 	variant,
		unnested.newname
from 	
	x,
	unnest(tasks) as unnested(newname)
;

SQL: Scope and Subqueries

select series.*, datas.*
from 
    dev_prod_performance_atlas.expanded_metrics.time_series as series,
    unnest(data) as datas
WITH series_points AS (
   select series.*, datas.*
   from 
     dev_prod_performance_atlas.expanded_metrics.time_series as series,
     unnest(data) as datas
)
SELECT *
FROM   series_points

SQL: Cartesian Joins

with 
tips_series as (
    select series.*, datas.*
    from 
        dev_prod_performance_atlas.expanded_metrics.time_series as series,
        unnest(data) as datas
),
cedar_points as (
    select cedars._id, stats.*
    from evergreen_cedar_atlas.cedar.perf_results as cedars,
    UNNEST(cedars.rollups."stats") as stats
    where cedars.created_at < date_add('day', -1, current_timestamp)
)
select  *
from    tips_series ts,
        cedar_points cs
where   ts.cedar_perf_result_id = cs._id;

SQL: "Natural" (left/right inner) joins

select  *
from    tips_series ts,
        cedar_points cs
where   ts.cedar_perf_result_id = cs._id;

-- vs

SELECT  *
FROM    
    tips_series      AS ts
    LEFT JOIN 
	    cedar_points AS cs
	ON ts.cedar_perf_result_id = cs._id;

SQL: Outer Joins

select *
from a, b
where a_id = b_id

              .───────.   .───────.
           ,─'         ,─'         '─.
         ,'          ,'   `.          `.
        β•±           β•±       β•²           β•²
β”Œβ”€β”€β”€β”  ;           ;         :           :   β”Œβ”€β”€β”€β”
β”‚ a β”‚  β”‚     ?     β”‚             ?       β”‚   β”‚ b β”‚
β””β”€β”€β”€β”˜  β”‚           β”‚    β–²    β”‚           β”‚   β””β”€β”€β”€β”˜
       :           :    β”‚    ;           ;
        β•²           β•²   β”‚   β•±           β•±
         β•²           β•²  β”‚  β•±           β•±
          `.          `.β”‚,'          ,'
            '─.       ,─│─.       ,─'
               `─────'  β”‚  `─────'
                        β”‚
                β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                β”‚ a_id = b_id  β”‚
                β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

In a but not b?

select *
from
  a left outer join b
  on a_id = b_id
and b_id is null

SQL: The solution

with 
tips_series as (
    select series.*, datas.*
    from 
        dev_prod_performance_atlas.expanded_metrics.time_series as series,
        unnest(data) as datas
),
cedar_points as (
    select cedars._id, stats.*
    from evergreen_cedar_atlas.cedar.perf_results as cedars,
    UNNEST(cedars.rollups."stats") as stats
    where cedars.created_at < date_add('day', -1, current_timestamp)
)
select  *
from    cedar_points cs
    left outer join time_series ts on ts.cedar_perf_result_id = cs._id
where ts.cedar_perf_result_id is null

SQL: Relational Calculus

select *
from a
where not exists (
  select 1
  from b
  where b_id = a_id
)

Useful when CBO isn't doing what you want. Or when calc is just clearer.

SQL: Relational Calculus

In a but not b?

select *
from
  a left outer join b
  on a_id = b_id
and b_id is null
select *
from a
where not exists (
  select 1
  from b
  where b_id = a_id
);

Warning

  • Joining across Atlas clusters is often very slow.

  • So let's not run the query "hot" against the cluster

    MOVING ON: Marsing

Agenda:

  1. SQL and Presto
  2. Marsing
  3. "Policies" and getting started
  4. Using this in Jupyterhub

MARS

Sneak-Peek of MARS:

  • defines a dag of jobs
  • jobs take data from one place and put it another with some config in between
  • sources are many including atlas
  • destinations are many including atlas (but usually it's presto)
  • ...and ADL
  • AND THEN
  • runs them based on triggers like cron or webhooks etc
  • allows for alerting and retry management, etc

MARS for Snapshots

https://mars.dataplatform.prod.corp.mongodb.com/app/job/c22837b4-45b0-43e2-a7d2-124079f9cef9

MARS for Reporting

  • Complex DAGs and output destinations
  • AirFlow?
  • Python?

ETLs Repo

rtimmons/mongo-etls (STILL VERY WIP and will create a 10gen repo when it's for reals)

Agenda:

  1. SQL and Presto
  2. Marsing
  3. "Policies" and getting started
  4. Using this in Jupyterhub

Design and Policies

Notable Policies:

  • All Data is Exposed (even if ugly)
  • Analytics Nodes are Private
  • Views Data is Stable; Changes to Views Follow a Process
  • Services are the APIs for OLTP
  • Presto is the API for OLAP

JypterHub

Example notebook -> Or on docs

When?

  • Everyone in {dev-prod, Kelsey's org} has access to Presto now. If you are not in these groups and want access, please slack me ryan.timmons.
  • Kick the tires and try things out.
  • If you want to create mars jobs to materialize data from atlas or whatever please let me know.

ANYWAY

  • "Policies" are going out for leads review on Monday
  • Once approved, we'll set the repo and documentation up and start rolling it out for official use.
  • TLDR: Hopefully production-ready by January.

RCS

My Reference: NWA2H57UT3ZZ117PTJPY1
@quick(Lunch & Learn SQL and ETLs for Presto and MARS) @ft

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment