FOR BI PROS Tableau
Scaling out Tableau Extracts – Building a distributed, multi-node MPP Hyper Cluster
August 11, 2019
0
, , , , , , , ,

Tableau Hyper Database (“Extract”) is a great engine; it’s one of the reasons people obsessed with Tableau analytics. However, being a single node database server, it has its limits (performance and data size) – that we are going to break in this blog post. With adding multi-node capacity, partitioning and distributed query processing Hyper will act like a real enterprise MPP data warehouse – running on modern infrastructure like kubernetes.

At the end of this document, you will learn how to build an MPP version of the hyper database and use it with your dashboards.

Background and Fundamentals

Tableau’s Hyper database was built from scratch with defacto standard functionalities (LLVM code generation, in-memory capabilities, columnar data store, etc.) with Postgres network and SQL dialect compatibility. It is a pretty neat database, fast, convenient and standard. With the new Extract API you can issue a full set of Postgres like SQL statements including COPY to move data in and out in a bulk way. No doubt, with minor tweaking we should be able to access Hyper with libpq based applications like psql and PostgresODBC.

MPP, Shared-nothing

What is missing here is the horizontal scalability: to leverage multiple server worker nodes to speed up single queries and provide availability for data ingestions. Traditionally, this is what the MPP (Massive Parallel Processing) architecture provides: running multiple database worker nodes with dedicated, partial datasets and aggregator/master node(s) that combines the results from the worker nodes. Imagine the following SQL statements:

In a shared-nothing world the query hits first the aggregator node that dispatches the query to its worker node.

Workers execute the queries on their partial set of data and returning the aggregated dataset to the master node. Then, the master node combines the aggregated values as an additional step. Again, most of the work happens on the workers while the master node aggregates the aggregates. Pretty simple. This is the way how Teradata, Yellobrick, Vertica, Hadoop Hive, Spark, Redshift or Greenplum works.

But what would it take to convert Hyper Database to this architecture?

How we are going to do this?

We need a few things here to make this conversion happen:

  1. Build worker nodes that are generic Hyper Database services running independently from each other
    • Create docker image from Hyper Database that is remotely accessible,
    • Deploy it on kubernetes as a Service to manage its elasticity
  2. Build a master node that will act as an aggregator. Postgres 11 has database link like functionality that can push down queries to other Postgres databases (and hyper acts like a Postgres too!).
    • Deploy Postgres 11 on kubernetes,
    • Set up foreign data wrapper for hyper,
    • Import and synchronize metadata across hyper workers and Postgres master nodes
  3. Validate if aggregation happens on the shared-nothing worker nodes

Building Hyper Docker image

Hyper Database is part of all Tableau products: Desktop, Server, Online, Prep and standalone Extract API package. The easiest way is to start from an empty docker template like centos, download the Extract API, build a new hyper database with hyperd configure and start services with hyperd start. The final Dockerfile will look like:

The image is deployed to docker hub, so basically you can just pull it without building it. After starting the image we can easily connect to it:

We can easily start a new hyper daemon and connect to its standard port. Now let’s deploy it on kubernetes with the following Service and Deployment configuration:

Service definition

Deployment/Pod definition. The replicas key defines how many hyperd workers should run at a given time.

If all looks good, we can apply these and have our Hyper workers up and running:

Lovely, now, let’s move to the more interesting part, how we are going to add an aggregator on top of this.

The Aggregator / Master node

First step is to quickly deploy a Postgres 10 container to our kubernetes cluster using this Service file with this Deployment file. If all looks good we should have similar topology in our kubernetes:

Testing the system

Our excellent MPP infrastructure is up; we have two hyper workers, a Postgres master. Next step is to configure and test it. Let’s add some test data to the hyper nodes first. For my very basic first test, I will distribute the data (shard key) by date. Worker1 will have data for 2016 while worker2 will have for 2017.

Creating the tables

To create a table on of the node, connect to its console, open psql and issue a CREATE TABLE statement.

Then do it for 2016 on the other worker.

We can insert the data directly from here or later thru the master node. Typical MPP data ingestion flow is to extract the data in parallel from other downstream systems and upload to the workers independently. As you image, running 10-20 COPY statements in parallel on different workers will speed up the extract creation process by 10-20x.

Linking workers with the master/aggregator

Now it’s time to tell our aggregator where are its hyper workers. After logging on to our postgres11, we should add this default foreign data wrapper to the system, define remote servers and user mapping. The whole process is described in this fantastic document, but with another Postgres database instead of hyper. However, as hyper acts as Postgres, we should do the same.

Next step is to create the table for external consumers (=Tableau Dashboards). Let’s call our table as temperatures and add temperatures_2016 and temperatures_2017 from hyper workers as partitions to this table.

Everything seems fine, time to add a few records:

Time to test, if we can issue one single SELECT statement that collects the records from the underlying hyper workers:

As we can see, Postgres split the query into two separate foreign scan, then combining them into a single result set.

However, we have some issue here with the aggregation push down:

As we can see, when issuing aggregated queries on a partitioned table with foreign data tables aggregation is not executed on the workers. This is a showstopper for us: considering the use case, we cannot have all the aggregation on the master node. We expect billions of rows on each worker, streaming them to the master would eliminate all the performance gain we are expecting. We have some hope: if we execute aggregate statements directly on the foreign tables, Postgres correctly push it down to the workers:

Tableau Union

It seems we should change strategy here, instead of using partitioned tables we might need to create union from foreign tables directly in tableau.

However, foreign tables are not visible in Tableau Desktop by default. This is a product issue, Tableau”s schema inspection queries do not include foreign tables. Quick workaround: create views from these foreign tables:

It seems the workaround works (this is why it is called as a workaround):

After quickly building the union we can see what queries as generated by Tableau and how it gets executed by our Hyper MPP Cluster:

Building union from the two views on foreign tables, defined earlier

Still not the one we want. Tableau does not push aggregation down to standalone table level.

Tableau Custom SQL

With Tableau Custom SQLs we can control what queries needs to be executed. No surprise here, by setting a proper custom SQL our MPP cluster runs the right queries with the right push down.

This is some sort of (partial maybe?) victory. With a combination of Parameters (even dynamic parameters), we can tackle complex use cases on large data sets. But we have to predefine the aggregated calculations (sum, count) in advance, we have to add all attributes to the SELECT‘s field selection and define parameters for all quick filters. Doable, but not comfortable. Still, sacrificing convenience for extreme performance for critical use cases – that’s what we do all the time.

Conclusion

After spending a few hours, I was able to build a distributed hyper database powered shared-nothing, MPP database cluster. It supports parallel ingestions, horizontal and vertical scaling, spreading queries across servers in a kubernetes cluster. The only drawback is the custom SQL based data source definition in Tableau as Postgres has some limitations on aggregate push-back on partitioned tables.

In the next article, I will perform a few performance measurements to see how this system scales along with the number of workers using real-life use cases.

All sources are uploaded to: https://github.com/tfoldi/tableau-hyperd-docker

If you have any questions feel free to drop a comment, I am happy to answer.

Tamás Földi

Director of IT Development at Starschema
Decades of experience with data processing and state of the art programming. From nuclear bomb explosion simulation to distributed file systems. ethical hacking, real time stream processing practically I always had a great fun with those geeky ones and zeros.
Tamás Földi

Related items

/ You may check this items as well

Tableau External Services API: Adding Haskell Expressions as Calculations

Tableau Hyper Database (“Extract”) is ...

Read more
Pasted image at 2018_01_09 04_59 PM

Python Experiments in Tableau 1. – Add live currency conversion to Tableau Dashboards using TabPy

Tableau Hyper Database (“Extract”) is ...

Read more
Tableau Docker

HOWTO: Tableau Server Linux in Docker Container

Tableau Hyper Database (“Extract”) is ...

Read more