tfoldi/README.md Secret

## README.md

      
    Raw
  

              README.md
            
          
    Scaling out Tableau Extracts - Building a distributed, multi-node MPP Hyper Cluster

Tableau Hyper Database ("Extract") is a great engine; it's one of the reasons people obsessed with Tableau analytics. However, being a single node database server, it has its limits (performance and data size) - that we are going to break in this blog post. With adding multi-node capacity, partitioning and distributed query processing Hyper will act like a real enterprise MPP data warehouse - running on modern infrastructure like kubernetes.
At the end of this document, you will learn how to build an MPP version of the hyper database, design parallel ingestion of Hyper database.
Background and Fundamentals

Tableau's Hyper database was built from scratch with defacto standard functionalities (LLVM code generation, in-memory capabilities, columnar data store, etc.) with Postgres network and SQL dialect compatibility. It is a pretty neat database, fast, convenient and standard. With the new Extract API you can issue a full set of Postgres like SQL statements including COPY to move data in and out in a bulk way. No doubt, with minor tweaking we should be able to access Hyper with libpq based applications like psql and PostgresODBC.
MPP, Shared-nothing

What is missing here is the horizontal scalability: to leverage multiple server worker nodes to speed up single queries and provide availability for data ingestions. Traditionally, this is what the MPP (Massive Parallel Processing) architecture provides: running multiple database worker nodes with dedicated, partial datasets and aggregator/master node(s) that combines the results from the worker nodes. Imagine the following SQL statements:
SELECT sum(profit), state from SALES
In a shared-nothing world the query hits first the aggregator node that dispatches the query to its worker node.
SELECT sum(profit), state from SALES -- aggregator executes this query on on worker1 that has the value for East Coast states
-- 
SELECT sum(profit), state from SALES -- aggregator executes this query on on worker2 that has the value for West Coast states
Workers execute the queries on their partial set of data and returning the aggregated dataset to the master node. Then,  the master node combines the aggregated values as an additional step. Again, most of the work happens on the workers while the master node aggregates the aggregates. Pretty simple.
This is the way how Teradata, Yellobrick, Vertica, Hadoop Hive, Spark, Redshift or Greenplum works.
But what would it take to convert Hyper Database to this architecture?
How we are going to do this?

We need a few things here to make this conversion happen:

Build worker nodes that are generic Hyper Database services running independently from each other

Create docker image from Hyper Database that is remotely accessible,
Deploy it on kubernetes as a Service to manage its elasticity


Build a master node that will act as an aggregator. Postgres 11 has database link like functionality that can push down queries to other Postgres databases (and hyper acts like a Postgres too!).

Deploy Postgres 11 on kubernetes,
Set up foreign data wrapper for hyper,
Import and synchronize metadata across hyper workers and Postgres master nodes


Validate if aggregation happens on the shared-nothing worker nodes

Building Hyper Docker image

Hyper Database is part of all Tableau products: Desktop, Server, Online, Prep and standalone Extract API package. The easiest way is to start from an empty docker template like centos, download the Extract API, build a new hyper database with hyperd configure and start services with hyperd start. The final Dockerfile will look like:
FROM centos:latest

MAINTAINER "Tamas Foldi" <tfoldi@starschema.net>

ADD https://downloads.tableau.com/tssoftware/extractapi-linux-x86_64-2019-2-2.tar.gz /tmp

RUN mkdir /hyper && \
    tar xvzf /tmp/extractapi-linux-x86_64-2019-2-2.tar.gz --strip 1 -C /hyper && \
    /hyper/bin/hyper/hyperd configure --log-dir /hyper/ -d /hyper/hyper.hyper --no-password --no-ssl --init-user tableau_internal_user

EXPOSE 7483

CMD /hyper/bin/hyper/hyperd start --log-dir /hyper/ -d /hyper/hyper.hyper --skip-license --no-password --no-ssl --listen-connection tab.tcp://0.0.0.0:7483
The image is deployed to docker hub, so basically you can just pull it without building it. After starting the image we can easily connect to it:
$ docker run -d -t tfoldi/tableau-hyperd-docker:release
490ea4faea9704683c35a78809b964737d3fca4156ffda3838b1fa54443b5d64
$ docker inspect 490ea4faea9704683c35a78809b964737d3fca4156ffda3838b1fa54443b5d64 | grep IPAddress
            "SecondaryIPAddresses": null,
            "IPAddress": "172.17.0.2",
                    "IPAddress": "172.17.0.2",
$ psql --host 172.17.0.2 --port 7483 --user tableau_internal_user
psql (9.2.24, server 9.1.0)
WARNING: psql version 9.2, server version 9.1.
         Some psql features might not work.
Type "help" for help.

tableau_internal_user=# create table foo (bar integer);
CREATE TABLE
We can easily start a new hyper daemon and connect to its standard port. Now let's deploy it on kubernetes with the following Service and Deployment configuration:
Service definition
apiVersion: v1
kind: Service
metadata:
  name: hyperd-mpp
  labels:
    app: hyperd-mpp
    role: worker
spec:
  ports:
  - port: 7483
    protocol: TCP
  selector:
    app: hyperd-mpp
    role: worker
Deployment/Pod definition. The replicas key defines how many hyperd workers should run at a given time.
apiVersion: apps/v1
kind: Deployment
metadata:
  name: hyperd-mpp-deployment
  labels:
    app: hyperd-mpp
    role: worker
spec:
  replicas: 2
  selector:
    matchLabels:
      app: hyperd-mpp
      role: worker
  template:
    metadata:
      labels:
        app: hyperd-mpp
        role: worker
    spec:
      containers:
      - name: hyperd-worker
        image: tfoldi/tableau-hyperd-docker
        ports:
        - containerPort: 7483
If all looks good, we can apply these and have our Hyper workers up and running:
[tfoldi@kompi]% kubectl get all -l app=hyperd-mpp,role=worker
NAME                                        READY   STATUS    RESTARTS   AGE
pod/hyperd-mpp-deployment-c5fd49756-ccg9p   1/1     Running   0          20h
pod/hyperd-mpp-deployment-c5fd49756-x6q77   1/1     Running   0          20h

NAME                 TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
service/hyperd-mpp   ClusterIP   10.100.200.35   <none>        7483/TCP   20h

NAME                                    READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/hyperd-mpp-deployment   2/2     2            2           20h

NAME                                              DESIRED   CURRENT   READY   AGE
replicaset.apps/hyperd-mpp-deployment-c5fd49756   2         2         2       20h
Lovely, now, let's move to the more interesting part, how we are going to add an aggregator on top of this.
The Aggregator / Master node

First step is to quickly deploy a Postgres 10 container to our kubernetes cluster using this Service file with this Deployment file. If all looks good we should have similar topology in our kubernetes:
$ kubectl get all -l app=hyperd-mpp
NAME                                        READY   STATUS    RESTARTS   AGE
pod/hyperd-mpp-deployment-c5fd49756-ccg9p   1/1     Running   0          20h
pod/hyperd-mpp-deployment-c5fd49756-x6q77   1/1     Running   0          20h
pod/postgres-master-69f8d75f9c-bltwx        1/1     Running   0          20h

NAME                      TYPE           CLUSTER-IP      EXTERNAL-IP                                                                  PORT(S)          AGE
service/hyperd-mpp        ClusterIP      10.100.200.35   <none>                                                                       7483/TCP         20h
service/postgres-master   LoadBalancer   10.100.43.171   a52bd342.kubeschema   5432:31679/TCP   19h

NAME                                    READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/hyperd-mpp-deployment   2/2     2            2           20h
deployment.apps/postgres-master         1/1     1            1           20h

NAME                                              DESIRED   CURRENT   READY   AGE
replicaset.apps/hyperd-mpp-deployment-c5fd49756   2         2         2       20h
replicaset.apps/postgres-master-69f8d75f9c        1         1         1       20h
Testing the system

Our excellent MPP infrastructure is up; we have two hyper workers, a Postgres master. Next step is to configure and test it. Let's add some test data to the hyper nodes first. For my very basic first test, I will distribute the data (shard key) by date. Worker1 will have data for 2016 while worker2 will have for 2017.
Creating the tables

To create a table on of the node, connect to its console, open psql and issue a CREATE TABLE statement.
$ kubectl exec -it hyperd-mpp-deployment-c5fd49756-ccg9p -- /bin/sh
sh-4.2# psql --host localhost --port 7483 --user tableau_internal_user
tableau_internal_user=# CREATE TABLE temperatures_2017(  at      date,
    city    text,
    mintemp integer,
    maxtemp integer
);
CREATE TABLE
Then do it for 2016 on the other worker.
We can insert the data directly from here or later thru the master node. Typical MPP data ingestion flow is to extract the data in parallel from other downstream systems and upload to the workers independently. As you image, running 10-20 COPY statements in parallel on different workers will speed up the extract creation process by 10-20x.
Linking workers with the master/aggregator

Now it's time to tell our aggregator where are its hyper workers. After logging on to our postgres11, we should add this default foreign data wrapper to the system, define remote servers and user mapping. The whole process is described in this fantastic document, but with another Postgres database instead of hyper. However, as hyper acts as Postgres, we should do the same.
% psql --host 100.10.10.23  -U postgres --password
postgres=# CREATE EXTENSION postgres_fdw;
CREATE EXTENSION
postgres=# CREATE SERVER hyper1 FOREIGN DATA WRAPPER postgres_fdw
    OPTIONS (host '192.168.143.173', port '7483', dbname '/hyper/hyper.hyper');
CREATE SERVER
postgres=# CREATE USER MAPPING FOR postgres SERVER hyper1 OPTIONS (user 'tableau_internal_user');
CREATE USER MAPPING
-- repeat for hyper2 database
Next step is to create the table for external consumers (=Tableau Dashboards). Let's call our table as temperatures and add temperatures_2016 and temperatures_2017 from hyper workers as partitions to this table.
postgres=# CREATE TABLE temperatures (
    at      date,
    city    text,
    mintemp integer,
    maxtemp integer
)
PARTITION BY RANGE (at);
CREATE TABLE
postgres=# CREATE FOREIGN TABLE temperatures_2016
    PARTITION OF temperatures
    FOR VALUES FROM ('2016-01-01') TO ('2017-01-01')
    SERVER hyper1;
CREATE FOREIGN TABLE
postgres=# CREATE FOREIGN TABLE temperatures_2017
    PARTITION OF temperatures
    FOR VALUES FROM ('2017-01-01') TO ('2018-01-01')
    SERVER hyper1;
CREATE FOREIGN TABLE
Everything seems fine, time to add a few records:
postgres=# INSERT INTO temperatures (at, city, mintemp, maxtemp) VALUES ('2016-08-03', 'London', 63, 73);
INSERT 0 1
postgres=# INSERT INTO temperatures VALUES ('2017-08-03', 'London', 63, 90);
INSERT 0 1
Time to test, if we can issue one single SELECT statement that collects the records from the underlying hyper workers:
postgres=# select * from temperatures;
     at     |  city  | mintemp | maxtemp
------------+--------+---------+---------
 2016-08-03 | London |      63 |      73
 2017-08-03 | London |      63 |      90
(2 rows)
postgres=# explain (verbose, costs off) select * from temperatures;
                                                     QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
 Append
   ->  Foreign Scan on public.temperatures_2016
         Output: temperatures_2016.at, temperatures_2016.city, temperatures_2016.mintemp, temperatures_2016.maxtemp
         Remote SQL: SELECT at, city, mintemp, maxtemp FROM public.temperatures_2016
   ->  Foreign Scan on public.temperatures_2017
         Output: temperatures_2017.at, temperatures_2017.city, temperatures_2017.mintemp, temperatures_2017.maxtemp
         Remote SQL: SELECT at, city, mintemp, maxtemp FROM public.temperatures_2017
(7 rows)
As we can see, Postgres split the query into two separate foreign scan, then combining them into a single result set.
However, we have some issue here with the aggregation push down:
postgres=# explain (verbose, costs off) select avg(maxtemp),city from temperatures group by 2;
                                  QUERY PLAN
------------------------------------------------------------------------------
 HashAggregate
   Output: avg(temperatures_2016.maxtemp), temperatures_2016.city
   Group Key: temperatures_2016.city
   ->  Append
         ->  Foreign Scan on public.temperatures_2016
               Output: temperatures_2016.city, temperatures_2016.maxtemp
               Remote SQL: SELECT city, maxtemp FROM public.temperatures_2016
         ->  Foreign Scan on public.temperatures_2017
               Output: temperatures_2017.city, temperatures_2017.maxtemp
               Remote SQL: SELECT city, maxtemp FROM public.temperatures_2017
(10 rows)
As we can see, when issuing aggregated queries on a partitioned table with foreign data tables aggregation is not executed on the workers. This is a showstopper for us: considering the use case, we cannot have all the aggregation on the master node. We expect billions of rows on each worker, streaming them to the master would eliminate all the performance gain we are expecting.
We have some hope: if we execute aggregate statements directly on the foreign tables, Postgres correctly push it down to the workers:
postgres=# explain (verbose, costs off) select avg(maxtemp),city from temperatures_2017 group by 2;
                                    QUERY PLAN
----------------------------------------------------------------------------------
 Foreign Scan
   Output: (avg(maxtemp)), city
   Relations: Aggregate on (public.temperatures_2017)
   Remote SQL: SELECT avg(maxtemp), city FROM public.temperatures_2017 GROUP BY 2
(4 rows)
Tableau Union

It seems we should change strategy here, instead of using partitioned tables we might need to create union from foreign tables directly in tableau.
[Image]
However, foreign tables are not visible in Tableau Desktop by default. This is a product issue, Tableau''s schema inspection queries do not include foreign tables. Quick workaround: create views from these foreign tables:
postgres=# create view v_temperatures_2017 as select * from temperatures_2017;
CREATE VIEW
postgres=# create view v_temperatures_2016 as select * from temperatures_2016;
CREATE VIEW
It seems the workaround works (this is why it is called as a workaround):
After quickly building the union we can see what queries as generated by Tableau and how it gets executed by our Hyper MPP Cluster:
postgres=# explain verbose SELECT AVG("t0"."maxtemp") AS "avg:maxtemp:ok",
  "t0"."city" AS "city"
FROM (
  SELECT "t1"."city" AS "city", "t1"."maxtemp" AS "maxtemp"
  FROM (
    SELECT CAST("v_temperatures_2016"."city" AS TEXT) AS "city",
      "v_temperatures_2016"."maxtemp" AS "maxtemp"
    FROM "public"."v_temperatures_2016" "v_temperatures_2016"
  ) "t1"
   UNION  ALL
  SELECT "t2"."city" AS "city", "t2"."maxtemp" AS "maxtemp"
  FROM (
    SELECT CAST("v_temperatures_2017"."city" AS TEXT) AS "city",
      "v_temperatures_2017"."maxtemp" AS "maxtemp"
    FROM "public"."v_temperatures_2017" "v_temperatures_2017"
  ) "t2"
) "t0"
GROUP BY 2;
                                           QUERY PLAN
------------------------------------------------------------------------------------------------
 HashAggregate  (cost=356.50..359.00 rows=200 width=64)
   Output: avg(temperatures_2016.maxtemp), temperatures_2016.city
   Group Key: temperatures_2016.city
   ->  Append  (cost=100.00..342.85 rows=2730 width=36)
         ->  Foreign Scan on public.temperatures_2016  (cost=100.00..150.95 rows=1365 width=36)
               Output: temperatures_2016.maxtemp, temperatures_2016.city
               Remote SQL: SELECT city, maxtemp FROM public.temperatures_2016
         ->  Foreign Scan on public.temperatures_2017  (cost=100.00..150.95 rows=1365 width=36)
               Output: temperatures_2017.maxtemp, temperatures_2017.city
               Remote SQL: SELECT city, maxtemp FROM public.temperatures_2017
(10 rows)
Still not the one we want. Tableau does not
Tableau Custom SQL

With Tableau Custom SQLs we can control what queries needs to be executed. No surprise here, by setting a proper custom SQL our MPP cluster runs the right queries with the right push down.
postgres=# explain verbose SELECT CAST("Custom SQL Query"."city" AS TEXT) AS "city",
  SUM("Custom SQL Query"."maxtemp") AS "sum:maxtemp:ok"
FROM (
  select sum(maxtemp) maxtemp, sum(mintemp) mintemp,
    count(maxtemp) maxtemp_count, count(mintemp) mintemp_count,
  city, at from temperatures_2016 group by 5,6
  union all
  select sum(maxtemp) maxtemp, sum(mintemp) mintemp,
    count(maxtemp) maxtemp_count, count(mintemp) mintemp_count,
  city, at from temperatures_2017 group by 5,6
) "Custom SQL Query"
GROUP BY 1;
                                                                      QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=284.80..287.30 rows=200 width=64)
   Output: "*SELECT* 1".city, sum("*SELECT* 1".maxtemp)
   Group Key: "*SELECT* 1".city
   ->  Append  (cost=109.60..282.80 rows=400 width=40)
         ->  Subquery Scan on "*SELECT* 1"  (cost=109.60..140.40 rows=200 width=40)
               Output: "*SELECT* 1".city, "*SELECT* 1".maxtemp
               ->  Foreign Scan  (cost=109.60..138.40 rows=200 width=68)
                     Output: (sum(temperatures_2016.maxtemp)), NULL::bigint, NULL::bigint, NULL::bigint, temperatures_2016.city, temperatures_2016.at
                     Relations: Aggregate on (public.temperatures_2016)
                     Remote SQL: SELECT sum(maxtemp), NULL::bigint, city, at FROM public.temperatures_2016 GROUP BY 3, 4
         ->  Subquery Scan on "*SELECT* 2"  (cost=109.60..140.40 rows=200 width=40)
               Output: "*SELECT* 2".city, "*SELECT* 2".maxtemp
               ->  Foreign Scan  (cost=109.60..138.40 rows=200 width=68)
                     Output: (sum(temperatures_2017.maxtemp)), NULL::bigint, NULL::bigint, NULL::bigint, temperatures_2017.city, temperatures_2017.at
                     Relations: Aggregate on (public.temperatures_2017)
                     Remote SQL: SELECT sum(maxtemp), NULL::bigint, city, at FROM public.temperatures_2017 GROUP BY 3, 4
(16 rows)
This is some sort of (partial maybe?) victory. With a combination of Parameters (even dynamic parameters), we can tackle complex use cases on large data sets. But we have to predefine the aggregated calculations (sum, count) in advance, we have to add all attributes  to the SELECT's field selection and define parameters for all quick filters. Doable, but not comfortable. Still, sacrificing convenience for extreme performance for critical use cases - that's what we do all the time.
Conclusion

After spending a few hours, I was able to build a distributed hyper database powered shared-nothing, MPP database cluster. It supports parallel ingestions, horizontal and vertical scaling, spreading queries across servers in a kubernetes cluster. The only drawback is the custom SQL based data source definition in Tableau as Postgres has some limitations on aggregate push-back on partitioned tables.
In the next article, I will perform a few performance measurements to see how this system scales along with the number of workers using real-life use cases.
All sources are uploaded to: https://github.com/tfoldi/tableau-hyperd-docker
If you have any questions feel free to drop a comment, I am happy to answer.