I ran into a request in Tableau Community Forum: how to connect a .csv file located on a remote web server thru http protocol. This is not really supported yet, and the to be released web connector will be extract only. But that’s why we are here, the brave community to getting things done. In this post I will show you an easy way how to deal with remote data providers in Tableau, allowing live connection to csv, json and xml formats.
Preparations
Depending on your distribution, Postgres may or may not be compiled with Python support. Lets check if our distribution has the python bindings available separately:
$ apt-cache search postgres | grep python
Apt shows me a list of postgres-related packages. The ones I am interested in are:
postgresql-plpython-9.3 – PL/Python procedural language for PostgreSQL 9.3
postgresql-plpython3-9.3 – PL/Python 3 procedural language for PostgreSQL 9.3
I’ll be using the python 2 version for this tutorial, so lets install that one:
$ sudo apt-get install postgresql-plpython
Now lets create a new database to keep our exploration for ourselves. On my machine, I will call my db ‘pydl_test’.
$ sudo -u postgres createdb pydl_test
Then we need to install the python language itself into the database:
$ sudo createlang plpythonu pydl_test
Staring out
Lets create a new sql file with a skeleton function. I will call mine pgpydl.sql
1 2 3 4 5 |
CREATE OR REPLACE FUNCTION from_url(IN url text) RETURNS text AS $$ return "Contents of '%s' go here..." % url $$ LANGUAGE plpythonu VOLATILE; |
And lets test if our python installation works by starting a psql shell and importing our sql file and running a small test select.
1 2 3 4 5 6 7 8 9 |
$ psql -d pydl_test pydl_test=# \i pgpydl.sql CREATE FUNCTION pydl_test=# select from_url(E'http://starschema.net'); from_url ------------------------------------------------ Contents of 'http://starschema.net' go here... (1 row) |
Downloading real data
We’ll use urllib2 from the Python standard library to download the contents. Its a basic library, but provides everything we’ll need.
Lets think for a moment about the requirements of our little function:
- it should download the web page if possible
- if not, then depending on the specific use-case:
- it should throw an exception which cancels the current transaction
- it should return some default value instead
- if the download is possible, then depending on the specific use-case:
- it should download the whole response or
- we should be able to provide a limit on the amount of data downloaded
So we’ll need to parameterize both error handling and the (existance of and size of the) download limit.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
CREATE OR REPLACE FUNCTION from_url( -- The URL to download. IN url text, -- The maximum number of bytes to download. -- If 0 then the whole response is returned. IN max_bytes integer DEFAULT 0, -- Should any errors (like HTTP transport errors) -- throw an exception or simply return default_response IN should_throw boolean DEFAULT true, -- The default response if any errors are found. -- Only used when should_throw is set to true IN default_response text DEFAULT E'' ) RETURNS text AS $$ # We will use traceback so we get decent error reporting. import urllib2, traceback # Either throws an error or returns the defeault response # depending on the should_throw parameter of the from_url() function def on_error(): if should_throw: # plpy.error() throws an exception which stops the current transaction plpy.error("Error downloading '{0}'\n {1}".format(url, traceback.format_exc())) else: return default_response try: response = urllib2.urlopen(url) if max_bytes == 0: return response.read() else: return response.read(max_bytes) except: # Log and re-throw the error or return the default response return on_error() $$ LANGUAGE plpythonu VOLATILE; |
Now we can test the function.
1 |
select from_url(E'https://databoss.starschema.net/wp-content/uploads/2015/04/mock_data.csv'); |
psql should now show our sample csv file.
Parsing the CSV
Create a return type
When we are parsing the CSV, we are returning complex types, so we need a type wrapper around the data returned from our parsing function.
1 2 3 4 5 6 7 8 |
CREATE TYPE site_user AS ( id text, first_name text, last_name text, email text, country text, ip_address text ); |
Using the CSV parser
Parsing CSV becomes trivial at this point using pythons built-in CSV parser, the only thing we have to look out for is that we are returning many instances of this type. So we need set our return type as a SETOF site_users, and return either a list of lists or a generator. We’ll go with the latter.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
CREATE OR REPLACE FUNCTION parse_person( IN csv_text text ) RETURNS SETOF site_user AS $$ import csv, traceback try: for row in csv.reader( csv_text.splitlines()): yield( row ) except: plpy.error("Error parsing csv {0}".format(traceback.format_exc())) $$ LANGUAGE plpythonu VOLATILE; |
Unlike the from_url() function, the parser must have a unique signature for each parsed datatype, so we need to copy it for each new type we want to parse, but thankfully we only need to change RETURNS SETOF site_user to RETURNS SETOF my_type.
Now lets test it:
1 2 3 4 5 |
select parse_person( from_url(E'https://databoss.starschema.net/wp-content/uploads/2015/04/mock_data.csv') ); select (first_name,last_name,country) from parse_person( from_url(E'https://databoss.starschema.net/wp-content/uploads/2015/04/mock_data.csv') ) where (country='China'); |
And we should have a list of full names from China.
From Tableau
We just need to set the custom SQL to our new function. The results are there from the web site, without caching, immediately.

This is a simple postgres workbook based on custom SQL – however the data is coming from web service
You can even add parameters to construct or modify the url, thus, you can pass parameters to your web service without modifying the postgres function.
Edge cases
There are some edge cases for these methods: when using a differenct csv file, the columns are separated by semicolons instead of colons, so we need to update our parsing function. But we must create the return type first:
1 2 3 4 5 6 7 |
CREATE TYPE sales_data AS ( varenummer text, import_eksport text, tid text, statistikkvariabel text, Utenrikshandel int ); |
We use Python CSV’s sniffer to figure out the CSV separator and format, and we need to add a new argument to tell the parser if it should skip the first line of the CSV data, where the labels usually are.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
CREATE OR REPLACE FUNCTION parse_sales_data( IN csv_text text, IN skip_first_line boolean DEFAULT true ) RETURNS SETOF sales_data AS $$ import csv, traceback try: dialect = csv.Sniffer().sniff(csv_text[:1024]) lines = csv_text.splitlines() return csv.reader( (lines[1:] if skip_first_line else lines), dialect) except: plpy.error("Error parsing csv {0}".format(traceback.format_exc())) $$ LANGUAGE plpythonu VOLATILE; |
Also we encounter a small problem:
1 2 3 4 5 |
# select * from from_url(E'http://data.ssb.no/api/v0/dataset/1142.csv?lang=no'); ERROR: invalid byte sequence for encoding "UTF8": 0xe5 0x72 0x2c CONTEXT: while creating return value PL/Python function "from_url" |
It seems that the web server is telling us that the content is ISO-8859-1, but it gives us something completely different. So we need to fix our downloader function first by modifying the try block. And on top of this, our local test server returns the response as application/octet-stream, which contains no charset data. We need to handle these cases. So the changed code of the function looks something like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 |
CREATE OR REPLACE FUNCTION from_url( -- The URL to download. IN url text, -- The maximum number of bytes to download. -- If 0 then the whole response is returned. IN max_bytes integer DEFAULT 0, -- Should any errors (like HTTP transport errors) -- throw an exception or simply return default_response IN should_throw boolean DEFAULT true, -- The default response if any errors are found. -- Only used when should_throw is set to true IN default_response text DEFAULT E'', -- The encoding to use if the content-type provided by the server -- does not provide a meaningful encoding (like text/csv or -- application/octet-stream) IN fallback_encoding text DEFAULT E'iso-8859-1' ) RETURNS text AS $$ import urllib2, traceback # Either throws an error or returns the defeault response # depending on the should_throw parameter of the from_url() function def on_error(): if should_throw: plpy.error("Error downloading '%s'\n %s" % (url, traceback.format_exc())) else: return default_response try: response = urllib2.urlopen(url) # sometimes servers return without a charset, handle this case. content_type = response.headers['content-type'].split('charset=') encoding = content_type[-1] if len(content_type) > 1 else fallback_encoding # Limit the read if necessary if max_bytes == 0: return unicode( response.read(), encoding) else: return unicode(response.read(max_bytes), encoding) except: return on_error() $$ LANGUAGE plpythonu VOLATILE; |
Now we can use it. This time we want to get proper columns instead of the composite record type, so we’ll expand the sales_data to a real row of fields using the star:
1 2 3 4 |
select (sales_data).* from parse_sales_data( from_url(E'http://data.ssb.no/api/v0/dataset/1142.csv?lang=no')) as sales_data where (Utenrikshandel != 0); |
The output of the function can be used directly from SQL tools or Tableau:
NOTE: Currently Python has problems handling CSV files, and all input should be UTF-8 or printable ASCII to be safe. The CSV module documentation has more ways of dealing with different encodings.
Summary
Using database stored procedure as data gateways are not a new or unique idea, but sometimes people forget how easily to add non-database data thru database engines to end systems. This method could work with other database engines like sql server or oracle, I used postgres only because it is free and relatively easy to use.
With small modification you can parse json or xml files, flatten them and return their data to tableau – with only a few lines of code, as shown above.
The used SQLs can be downloaded from here.
- Tableau Extensions Addons Introduction: Synchronized Scrollbars - December 2, 2019
- Tableau External Services API: Adding Haskell Expressions as Calculations - November 20, 2019
- Scaling out Tableau Extracts – Building a distributed, multi-node MPP Hyper Cluster - August 11, 2019