FOR ANALYSTS FOR BI PROS FOR EVERYONE LEARN Tableau
Tableau Performance: Why Embedded Data Source is Faster than Published
April 16, 2017
3
, , , , , , ,

Optimize ExtractIt started with one of my colleague who managed a Tableau case from our end. His client was constantly dissatisfied: her workbooks were significantly faster with embedded data sources compared to published ones. Even if we told her that this is normal, this is the way how life works, so better accept it and let it go. But no, she wanted an explanation. Well, here it is.

Spoiler: this post is about Optimize Extract feature in Tableau and how it works internally.

So why it’s slower?

There are many things that make published data sources slower than having a single workbook with embedded data sources. First of all Published Datasources rely on data server process. This is definitely an additional layer in the processing. First, you hit vizql server, then it goes to dataserver (which can be located on a separate server host), then to tdeserver. This is an additional step, there are some serialization and de-serialization work in place (but the data itself is in binary), but this isn’t supposed to be huge overhead.

Open a Workbook with Published Extract Based Datasource

Open a Workbook with Published Extract Based Datasource

So no, not this additional proxy step is responsible for the performance difference. Usually the reason is the optimize extract which works differently in published data sources and embedded data sources.

Optimize Extract

Have you ever wondered what is that magic submenu option in Extract menu? “Optimize“. It must be something great, huh? Sometimes you see during extract generation that it is optimizing. But what exactly happing behind the scenes – let’s see it in details. The documentation says that it can do two things:

  • Materialized Calculated Fields
  • Create acceleration views for relevant quick filters

And now switch to practice mode and check how they’re working internally.

Materialized Calculated Fields

Let’s start from full scratch with a data connection pointing to a Tableau Server repository. Create a connection to roles  and make an extracted data source from it:

Screen Shot 2017-04-15 at 19.19.07

With columns like:

Screen Shot 2017-04-15 at 19.19.29

Now we should have our columns in our TDE. Let’s see how the generated tableau extract looks inside. First check what schemas and tables we have:

As you see we have the usual [SYS]  schema where all extract metadata stored. This is interesting but not for now. $TableauMetadata  in the [Extract]  is a metadata key-value store and it usually stores the data source definition XML. This isn’t even interesting at all.

However, we have the [Extract]  table in the [Extract]  schema as well where we find our extracted table. That’s something! Let’s have a look into it:

We already see the first materialized column  Number of Records  with value 1 everywhere. This wasn’t in the postgres repository but the extraction process materialized for its convenience.

Move forward and check what happens when we define a calculated field. Thus, create a new calculated field called Public Name :

Screen Shot 2017-04-15 at 19.32.24

Public Name calculation – This will be a new materialized column in the TDE

And now optimize the extract. Optimization always occurs during extract refresh, however, if you just added a new calculation and you want to materialize it immediately you need to trigger optimization manually.

Optimize Extract

After triggering optimization we can sneak in to the newly generated extract and observe a new column (scroll to the right):

We have a new column called Calculation_5834061538592075777  that is exactly the same calculation what we just added. Thus, it was precalculated and materialized. From now there is no need to do the branching calculation on the fly.

This pre-calculation is only supported for row level calculations. No table calculations or level of detail calculations are materizalized/optimized. However, for FIXED LODs I do not see any reason why not materialize values – but again, it’s not implemented for any type of LODs.

Quick Filter Materialization

Let’s move forward and see what is the case with the filter optimization. This called as Acceleration Views and defined as:

When a workbook contains filters that are set to show only relevant values, computing the available values for that filter can be an expensive query. For these filters, Tableau must evaluate the other filters in the workbook first and then compute the relevant values based on their filter selections. To speed up these queries, a view can be created that computes the possible filter values and caches them for faster lookup later.

To check it add a few quick filters to workbook and click optimize again.

Screen Shot 2017-04-16 at 16.00.33

After an another round of optimize we can check what changed in our Extract file. As observed there is a new schema called [Cache] . Let’s look into it.

As you see the combination of quick filters are materialized into a single table. Using this View#0 Tableau can quickly visualize relevant quick filters without querying the full data set. ( Entry  is a metadata table that defines the contents for each Accelerated View table)

This is all fine. But what happens if we switch to Published Datasource?

Published Datasource

Now the fun part. After publishing the data source to the Server the extract preserves the cache schema (relevant quick filter caches). This is fine.

But after the first extract refresh on Tableau side the cache schema is gone.

It makes sense: if you do not have an associated workbook with quick filters you’ll never be able to cache quick filters. Coz you don’t have quick filters.

On the other hand: every workbook using relevant filters will be slower with published data sources simply because of the missing accelerated views.

Workbook level calculations

But this isn’t the only drawback with published data sources. As you might know there are two types of calculations: one is stored in the data source and the other is stored in the workbook layer.

Screen Shot 2017-04-16 at 18.21.38

Data Source level calculations cannot be edited directly

 

Screen Shot 2017-04-16 at 18.21.51

Workbook level calculations can be edited directly

So we have two types of calculations. The Data Source level calculations are extract based on the server with materialized columns. But your newly defined workbook level calculations cannot be pre-optimized as server has no idea about them during extraction time.

Bottomline: if your users have lots of workbook level calculations with published data sources then they’ll be unhappy at some point. This was what happened with my colleague: his client used many advanced calcs on top of IT supported published data sources. When she made a local copy it was fast but when she switched back to the published data source it slowed down. No surprises here: this is expected.

Conclusion

To sum up what to keep in mind:

  1. Proxying data source requests comes with a minor penalty. Plus, data server is an additional process within the flow with its own problems (like the SAP HANA issue).
  2. Tableau can cache quick filters but only for embedded data sources
  3. Workbook level calculations never get materialized

There are even more tricks to speed things up, but maybe next time.

Do you need more information or help to make your system significantly faster? Just drop me a line.

 

 

Tamás Földi

Tamás Földi

Director of IT Development at Starschema
Decades of experience with data processing and state of the art programming. From nuclear bomb explosion simulation to distributed file systems. ethical hacking, real time stream processing practically I always had a great fun with those geeky ones and zeros.
Tamás Földi
  • Jonathan Drummey

    Hi Tamas,

    Great post! I especially like the way you went into the subtleties around published data sources and quick filters.

    With regards to your comment about materializing FIXED LODs: about 1/3 of the time when using FIXED LOD expressions it’s the case where I’m creating a new record-level dimension and having a materialized result in the extract would be great. The other 2/3 of the time I’m creating a calculation that I want to be sensitive to parameters and/or context filters that are set at runtime so materializing a FIXED LOD in that situation would only be helpful at best when first opening the workbook, and at worst end up making extracts needlessly larger and take longer to create.

    Jonathan

    • Matthew

      Ah, that makes sense: context filters are likely why these can’t be materialized. I’ve long used the explanation that “row-level deterministic calculations can be materialized in extracts,” so I assumed that FIXED LODs would also be materialized if they didn’t use parameters, user functions, or TODAY/NOW. But changing context can change FIXED LOD values, so I suppose that’s why they’re not persisted.

  • Jeffrey Strauss

    Tamas, thanks for sharing your valued knowledge. I wasn’t aware of the accelerated views for quick filters, this is really good to know.

    How do I get a viewer into the TDE? This is really cool.

    My hunch that I haven’t been able to fully prove out yet but am close is that there’s more differences than just the proxy, quick filter acceleration and having to calculate local calcs. Two items come to mind, maybe you can confirm or call them a myth? First one is that when a workbook connects to a published datasource, both the workbook has its own internal data model, and then there is a data model inside of the published datasource. Upon reviewing an isolated set of logs, it appears that Tableau Server does an interpret metadata on both of these. It’s quite fast, but if there’s a sizable model, then it may have a small penalty. Second is the “connect to data” phase when looking at the record_performance. Part of this is interpret of the metadata. But then part of this I believe is uncompressing the required columns, sorting and loading these columns to fulfill the needs of whatever is calling it. In the case of a workbook with an embedded extract, I think that it processes a subset of the columns based on what is in the shelves. But when going through dataserver, I think it does extra processing because the tde is being called from the dataserver which doesn’t necessarily have knowledge of what the workbook needs until the query is actually fired off for the sheet. Again, it’s fast, but to me this may be more noticable. Any thoughts on reality here?

Related items

/ You may check this items as well

counter

Auto refresh a Tableau Dashboard without Embedding

It started with one of my colleague who managed a ...

Read more

Spotfire’s hidden Developer Tools menu

It started with one of my colleague who managed a ...

Read more
Save for previous version

Tableau Save as Previous Version feature

It started with one of my colleague who managed a ...

Read more