Tableau Hyper Database ("Extract") is a great engine; it's one of the reasons people obsessed with Tableau analytics. However, being a single node database server, it has its limits (performance and data size) - that we are going to break in this blog post. With adding multi-node capacity, partitioning and distributed query processing Hyper will act like a real enterprise MPP data warehouse - running on modern infrastructure like kubernetes.
At the end of this document, you will learn how to build an MPP version of the hyper database, design parallel ingestion of Hyper database.
Tableau's Hyper database was built from scratch with defacto standard functionalities (LLVM code generation, in-memory capabilities, columnar data store, etc.) with Postgres network and SQL dialect compatibility. It is a pretty neat database, fast, convenient and standard. With the new Extract API you can issue a full set of Postgres like SQL statements including COPY to move data in and out in a bulk way. No doubt, with minor tweaking we should be able to access Hyper with libpq
based applications like psql
and PostgresODBC
.
What is missing here is the horizontal scalability: to leverage multiple server worker nodes to speed up single queries and provide availability for data ingestions. Traditionally, this is what the MPP (Massive Parallel Processing) architecture provides: running multiple database worker nodes with dedicated, partial datasets and aggregator/master node(s) that combines the results from the worker nodes. Imagine the following SQL statements:
SELECT sum(profit), state from SALES
In a shared-nothing world the query hits first the aggregator node that dispatches the query to its worker node.
SELECT sum(profit), state from SALES -- aggregator executes this query on on worker1 that has the value for East Coast states
--
SELECT sum(profit), state from SALES -- aggregator executes this query on on worker2 that has the value for West Coast states
Workers execute the queries on their partial set of data and returning the aggregated dataset to the master node. Then, the master node combines the aggregated values as an additional step. Again, most of the work happens on the workers while the master node aggregates the aggregates. Pretty simple. This is the way how Teradata, Yellobrick, Vertica, Hadoop Hive, Spark, Redshift or Greenplum works.
But what would it take to convert Hyper Database to this architecture?
We need a few things here to make this conversion happen:
- Build worker nodes that are generic Hyper Database services running independently from each other
- Create docker image from Hyper Database that is remotely accessible,
- Deploy it on kubernetes as a Service to manage its elasticity
- Build a master node that will act as an aggregator. Postgres 11 has database link like functionality that can push down queries to other Postgres databases (and hyper acts like a Postgres too!).
- Deploy Postgres 11 on kubernetes,
- Set up foreign data wrapper for hyper,
- Import and synchronize metadata across hyper workers and Postgres master nodes
- Validate if aggregation happens on the shared-nothing worker nodes
Hyper Database is part of all Tableau products: Desktop, Server, Online, Prep and standalone Extract API package. The easiest way is to start from an empty docker template like centos
, download the Extract API, build a new hyper database with hyperd configure
and start services with hyperd start
. The final Dockerfile
will look like:
FROM centos:latest
MAINTAINER "Tamas Foldi" <tfoldi@starschema.net>
ADD https://downloads.tableau.com/tssoftware/extractapi-linux-x86_64-2019-2-2.tar.gz /tmp
RUN mkdir /hyper && \
tar xvzf /tmp/extractapi-linux-x86_64-2019-2-2.tar.gz --strip 1 -C /hyper && \
/hyper/bin/hyper/hyperd configure --log-dir /hyper/ -d /hyper/hyper.hyper --no-password --no-ssl --init-user tableau_internal_user
EXPOSE 7483
CMD /hyper/bin/hyper/hyperd start --log-dir /hyper/ -d /hyper/hyper.hyper --skip-license --no-password --no-ssl --listen-connection tab.tcp://0.0.0.0:7483
The image is deployed to docker hub, so basically you can just pull it without building it. After starting the image we can easily connect to it:
$ docker run -d -t tfoldi/tableau-hyperd-docker:release
490ea4faea9704683c35a78809b964737d3fca4156ffda3838b1fa54443b5d64
$ docker inspect 490ea4faea9704683c35a78809b964737d3fca4156ffda3838b1fa54443b5d64 | grep IPAddress
"SecondaryIPAddresses": null,
"IPAddress": "172.17.0.2",
"IPAddress": "172.17.0.2",
$ psql --host 172.17.0.2 --port 7483 --user tableau_internal_user
psql (9.2.24, server 9.1.0)
WARNING: psql version 9.2, server version 9.1.
Some psql features might not work.
Type "help" for help.
tableau_internal_user=# create table foo (bar integer);
CREATE TABLE
We can easily start a new hyper daemon and connect to its standard port. Now let's deploy it on kubernetes with the following Service and Deployment configuration:
Service definition
apiVersion: v1
kind: Service
metadata:
name: hyperd-mpp
labels:
app: hyperd-mpp
role: worker
spec:
ports:
- port: 7483
protocol: TCP
selector:
app: hyperd-mpp
role: worker
Deployment/Pod definition. The replicas
key defines how many hyperd workers should run at a given time.
apiVersion: apps/v1
kind: Deployment
metadata:
name: hyperd-mpp-deployment
labels:
app: hyperd-mpp
role: worker
spec:
replicas: 2
selector:
matchLabels:
app: hyperd-mpp
role: worker
template:
metadata:
labels:
app: hyperd-mpp
role: worker
spec:
containers:
- name: hyperd-worker
image: tfoldi/tableau-hyperd-docker
ports:
- containerPort: 7483
If all looks good, we can apply these and have our Hyper workers up and running:
[tfoldi@kompi]% kubectl get all -l app=hyperd-mpp,role=worker
NAME READY STATUS RESTARTS AGE
pod/hyperd-mpp-deployment-c5fd49756-ccg9p 1/1 Running 0 20h
pod/hyperd-mpp-deployment-c5fd49756-x6q77 1/1 Running 0 20h
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/hyperd-mpp ClusterIP 10.100.200.35 <none> 7483/TCP 20h
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/hyperd-mpp-deployment 2/2 2 2 20h
NAME DESIRED CURRENT READY AGE
replicaset.apps/hyperd-mpp-deployment-c5fd49756 2 2 2 20h
Lovely, now, let's move to the more interesting part, how we are going to add an aggregator on top of this.
First step is to quickly deploy a Postgres 10 container to our kubernetes cluster using this Service file with this Deployment file. If all looks good we should have similar topology in our kubernetes:
$ kubectl get all -l app=hyperd-mpp
NAME READY STATUS RESTARTS AGE
pod/hyperd-mpp-deployment-c5fd49756-ccg9p 1/1 Running 0 20h
pod/hyperd-mpp-deployment-c5fd49756-x6q77 1/1 Running 0 20h
pod/postgres-master-69f8d75f9c-bltwx 1/1 Running 0 20h
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/hyperd-mpp ClusterIP 10.100.200.35 <none> 7483/TCP 20h
service/postgres-master LoadBalancer 10.100.43.171 a52bd342.kubeschema 5432:31679/TCP 19h
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/hyperd-mpp-deployment 2/2 2 2 20h
deployment.apps/postgres-master 1/1 1 1 20h
NAME DESIRED CURRENT READY AGE
replicaset.apps/hyperd-mpp-deployment-c5fd49756 2 2 2 20h
replicaset.apps/postgres-master-69f8d75f9c 1 1 1 20h
Our excellent MPP infrastructure is up; we have two hyper workers, a Postgres master. Next step is to configure and test it. Let's add some test data to the hyper nodes first. For my very basic first test, I will distribute the data (shard key) by date. Worker1 will have data for 2016 while worker2 will have for 2017.
To create a table on of the node, connect to its console, open psql
and issue a CREATE TABLE
statement.
$ kubectl exec -it hyperd-mpp-deployment-c5fd49756-ccg9p -- /bin/sh
sh-4.2# psql --host localhost --port 7483 --user tableau_internal_user
tableau_internal_user=# CREATE TABLE temperatures_2017( at date,
city text,
mintemp integer,
maxtemp integer
);
CREATE TABLE
Then do it for 2016 on the other worker.
We can insert the data directly from here or later thru the master node. Typical MPP data ingestion flow is to extract the data in parallel from other downstream systems and upload to the workers independently. As you image, running 10-20 COPY statements in parallel on different workers will speed up the extract creation process by 10-20x.
Now it's time to tell our aggregator where are its hyper workers. After logging on to our postgres11, we should add this default foreign data wrapper to the system, define remote servers and user mapping. The whole process is described in this fantastic document, but with another Postgres database instead of hyper. However, as hyper acts as Postgres, we should do the same.
% psql --host 100.10.10.23 -U postgres --password
postgres=# CREATE EXTENSION postgres_fdw;
CREATE EXTENSION
postgres=# CREATE SERVER hyper1 FOREIGN DATA WRAPPER postgres_fdw
OPTIONS (host '192.168.143.173', port '7483', dbname '/hyper/hyper.hyper');
CREATE SERVER
postgres=# CREATE USER MAPPING FOR postgres SERVER hyper1 OPTIONS (user 'tableau_internal_user');
CREATE USER MAPPING
-- repeat for hyper2 database
Next step is to create the table for external consumers (=Tableau Dashboards). Let's call our table as temperatures
and add temperatures_2016
and temperatures_2017
from hyper workers as partitions to this table.
postgres=# CREATE TABLE temperatures (
at date,
city text,
mintemp integer,
maxtemp integer
)
PARTITION BY RANGE (at);
CREATE TABLE
postgres=# CREATE FOREIGN TABLE temperatures_2016
PARTITION OF temperatures
FOR VALUES FROM ('2016-01-01') TO ('2017-01-01')
SERVER hyper1;
CREATE FOREIGN TABLE
postgres=# CREATE FOREIGN TABLE temperatures_2017
PARTITION OF temperatures
FOR VALUES FROM ('2017-01-01') TO ('2018-01-01')
SERVER hyper1;
CREATE FOREIGN TABLE
Everything seems fine, time to add a few records:
postgres=# INSERT INTO temperatures (at, city, mintemp, maxtemp) VALUES ('2016-08-03', 'London', 63, 73);
INSERT 0 1
postgres=# INSERT INTO temperatures VALUES ('2017-08-03', 'London', 63, 90);
INSERT 0 1
Time to test, if we can issue one single SELECT
statement that collects the records from the underlying hyper workers:
postgres=# select * from temperatures;
at | city | mintemp | maxtemp
------------+--------+---------+---------
2016-08-03 | London | 63 | 73
2017-08-03 | London | 63 | 90
(2 rows)
postgres=# explain (verbose, costs off) select * from temperatures;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
Append
-> Foreign Scan on public.temperatures_2016
Output: temperatures_2016.at, temperatures_2016.city, temperatures_2016.mintemp, temperatures_2016.maxtemp
Remote SQL: SELECT at, city, mintemp, maxtemp FROM public.temperatures_2016
-> Foreign Scan on public.temperatures_2017
Output: temperatures_2017.at, temperatures_2017.city, temperatures_2017.mintemp, temperatures_2017.maxtemp
Remote SQL: SELECT at, city, mintemp, maxtemp FROM public.temperatures_2017
(7 rows)
As we can see, Postgres split the query into two separate foreign scan, then combining them into a single result set.
However, we have some issue here with the aggregation push down:
postgres=# explain (verbose, costs off) select avg(maxtemp),city from temperatures group by 2;
QUERY PLAN
------------------------------------------------------------------------------
HashAggregate
Output: avg(temperatures_2016.maxtemp), temperatures_2016.city
Group Key: temperatures_2016.city
-> Append
-> Foreign Scan on public.temperatures_2016
Output: temperatures_2016.city, temperatures_2016.maxtemp
Remote SQL: SELECT city, maxtemp FROM public.temperatures_2016
-> Foreign Scan on public.temperatures_2017
Output: temperatures_2017.city, temperatures_2017.maxtemp
Remote SQL: SELECT city, maxtemp FROM public.temperatures_2017
(10 rows)
As we can see, when issuing aggregated queries on a partitioned table with foreign data tables aggregation is not executed on the workers. This is a showstopper for us: considering the use case, we cannot have all the aggregation on the master node. We expect billions of rows on each worker, streaming them to the master would eliminate all the performance gain we are expecting. We have some hope: if we execute aggregate statements directly on the foreign tables, Postgres correctly push it down to the workers:
postgres=# explain (verbose, costs off) select avg(maxtemp),city from temperatures_2017 group by 2;
QUERY PLAN
----------------------------------------------------------------------------------
Foreign Scan
Output: (avg(maxtemp)), city
Relations: Aggregate on (public.temperatures_2017)
Remote SQL: SELECT avg(maxtemp), city FROM public.temperatures_2017 GROUP BY 2
(4 rows)
It seems we should change strategy here, instead of using partitioned tables we might need to create union from foreign tables directly in tableau.
[Image]
However, foreign tables are not visible in Tableau Desktop by default. This is a product issue, Tableau''s schema inspection queries do not include foreign tables. Quick workaround: create views from these foreign tables:
postgres=# create view v_temperatures_2017 as select * from temperatures_2017;
CREATE VIEW
postgres=# create view v_temperatures_2016 as select * from temperatures_2016;
CREATE VIEW
It seems the workaround works (this is why it is called as a workaround):
After quickly building the union we can see what queries as generated by Tableau and how it gets executed by our Hyper MPP Cluster:
postgres=# explain verbose SELECT AVG("t0"."maxtemp") AS "avg:maxtemp:ok",
"t0"."city" AS "city"
FROM (
SELECT "t1"."city" AS "city", "t1"."maxtemp" AS "maxtemp"
FROM (
SELECT CAST("v_temperatures_2016"."city" AS TEXT) AS "city",
"v_temperatures_2016"."maxtemp" AS "maxtemp"
FROM "public"."v_temperatures_2016" "v_temperatures_2016"
) "t1"
UNION ALL
SELECT "t2"."city" AS "city", "t2"."maxtemp" AS "maxtemp"
FROM (
SELECT CAST("v_temperatures_2017"."city" AS TEXT) AS "city",
"v_temperatures_2017"."maxtemp" AS "maxtemp"
FROM "public"."v_temperatures_2017" "v_temperatures_2017"
) "t2"
) "t0"
GROUP BY 2;
QUERY PLAN
------------------------------------------------------------------------------------------------
HashAggregate (cost=356.50..359.00 rows=200 width=64)
Output: avg(temperatures_2016.maxtemp), temperatures_2016.city
Group Key: temperatures_2016.city
-> Append (cost=100.00..342.85 rows=2730 width=36)
-> Foreign Scan on public.temperatures_2016 (cost=100.00..150.95 rows=1365 width=36)
Output: temperatures_2016.maxtemp, temperatures_2016.city
Remote SQL: SELECT city, maxtemp FROM public.temperatures_2016
-> Foreign Scan on public.temperatures_2017 (cost=100.00..150.95 rows=1365 width=36)
Output: temperatures_2017.maxtemp, temperatures_2017.city
Remote SQL: SELECT city, maxtemp FROM public.temperatures_2017
(10 rows)
Still not the one we want. Tableau does not
With Tableau Custom SQLs we can control what queries needs to be executed. No surprise here, by setting a proper custom SQL our MPP cluster runs the right queries with the right push down.
postgres=# explain verbose SELECT CAST("Custom SQL Query"."city" AS TEXT) AS "city",
SUM("Custom SQL Query"."maxtemp") AS "sum:maxtemp:ok"
FROM (
select sum(maxtemp) maxtemp, sum(mintemp) mintemp,
count(maxtemp) maxtemp_count, count(mintemp) mintemp_count,
city, at from temperatures_2016 group by 5,6
union all
select sum(maxtemp) maxtemp, sum(mintemp) mintemp,
count(maxtemp) maxtemp_count, count(mintemp) mintemp_count,
city, at from temperatures_2017 group by 5,6
) "Custom SQL Query"
GROUP BY 1;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=284.80..287.30 rows=200 width=64)
Output: "*SELECT* 1".city, sum("*SELECT* 1".maxtemp)
Group Key: "*SELECT* 1".city
-> Append (cost=109.60..282.80 rows=400 width=40)
-> Subquery Scan on "*SELECT* 1" (cost=109.60..140.40 rows=200 width=40)
Output: "*SELECT* 1".city, "*SELECT* 1".maxtemp
-> Foreign Scan (cost=109.60..138.40 rows=200 width=68)
Output: (sum(temperatures_2016.maxtemp)), NULL::bigint, NULL::bigint, NULL::bigint, temperatures_2016.city, temperatures_2016.at
Relations: Aggregate on (public.temperatures_2016)
Remote SQL: SELECT sum(maxtemp), NULL::bigint, city, at FROM public.temperatures_2016 GROUP BY 3, 4
-> Subquery Scan on "*SELECT* 2" (cost=109.60..140.40 rows=200 width=40)
Output: "*SELECT* 2".city, "*SELECT* 2".maxtemp
-> Foreign Scan (cost=109.60..138.40 rows=200 width=68)
Output: (sum(temperatures_2017.maxtemp)), NULL::bigint, NULL::bigint, NULL::bigint, temperatures_2017.city, temperatures_2017.at
Relations: Aggregate on (public.temperatures_2017)
Remote SQL: SELECT sum(maxtemp), NULL::bigint, city, at FROM public.temperatures_2017 GROUP BY 3, 4
(16 rows)
This is some sort of (partial maybe?) victory. With a combination of Parameters (even dynamic parameters), we can tackle complex use cases on large data sets. But we have to predefine the aggregated calculations (sum, count) in advance, we have to add all attributes to the SELECT
's field selection and define parameters for all quick filters. Doable, but not comfortable. Still, sacrificing convenience for extreme performance for critical use cases - that's what we do all the time.
After spending a few hours, I was able to build a distributed hyper database powered shared-nothing, MPP database cluster. It supports parallel ingestions, horizontal and vertical scaling, spreading queries across servers in a kubernetes cluster. The only drawback is the custom SQL based data source definition in Tableau as Postgres has some limitations on aggregate push-back on partitioned tables.
In the next article, I will perform a few performance measurements to see how this system scales along with the number of workers using real-life use cases.
All sources are uploaded to: https://github.com/tfoldi/tableau-hyperd-docker
If you have any questions feel free to drop a comment, I am happy to answer.