Vacuum Analyze with Talend Open Studio
March 23, 2015
, , , , , , , , , ,

The whole started with a requirement to vacuum analyze our production greenplum appliances. The problem was with the timing, we only had two hours to complete the activity. This means we would need to start with the table that was never vacuumed, then followup with the other tables ordered by vacuum date and stop process precisely after two hours. The following query will return the vacuum analyze SQLs in the correct order (works for all postgres clones), we just need to execute them:

It is possible to narrow down the tables/schemas using filters. So we have a SQL script that provides us the SQL queries we have to execute, and this is where we hit a problem. We can execute these SQL-s with Talend. We can even use multiple threads. But we want to have some automation to kill this after a given time.

My first solution looked like this:


Overview of the first version.

Looking at the results we can see that something is not okay. The execution time should be 2 hours: it was, execution should succeed, it failed. With a generated fail message, still it appears as a failure.


Failure even if the job was successful

This approach works, you can use this to automatically kill long running ETL sessions, and get notified by.Since we want to have this in production (where we get a notification from all the failed jobs) we need a way that won’t generate any failures in the logs, if we have a failure in the logs there must be a failure. I shared this approach because it is a generic solution. Its simple, and can be used anywhere.

As for the second version, I realised we have to get rid of the tDie component, also this approach works with Talend Open Studio


Overview of the second version

So where is the magic you may ask? Well it is hidden inside the tJavaRow component:

After a given time we start to skip in the loop. This way after the given time the second tFlowMeter won’t receive any new data.


Execution details

I configured it to stop after 1 minute, however it seems we had a few queries running, after those completed the job quickly iterated on the results without passing them. We can have a comparison with the original version, how it appears in the logs:


Talends Project level logging

For every project I configure project level logging, that is one of the best features talend offers, and its available in open studio as well.

You may ask how does the actual executor looks like:


Executor job

Since this job was created in Talend Enterprise we have the option to use Threads. The vacuums can be paralellized, using the Iterate links advanced option with a context variable. But stil connection creation will take some considerable time. So we should use a shared connection. Unfortunately we need a separate connection for each thread, since the PostgreSQL JDBC driver will serialize the parallel requests. We can have a unique connection name among the Threads:

The tJava contains only a System.out.println. Since this executor will be called thousands of times every week, project level logging will be turned on, but we still have the console logs to notify about errors, which shouldn’t happen.

Did you like this article? Follow us on Twitter!

Balazs Gunics
Latest posts by Balazs Gunics (see all)

Related items

/ You may check this items as well

sync frelard

Tableau Extensions Addons Introduction: Synchronized Scrollbars

At this year’s Tableau Conference, I tried t...

Read more

Tableau External Services API: Adding Haskell Expressions as Calculations

We all have our own Tableau Conference habits.  M...

Read more
Scaling Tableau Image

Scaling out Tableau Extracts – Building a distributed, multi-node MPP Hyper Cluster

Tableau Hyper Database (“Extract”) is ...

Read more