When we first started delivering analytic reports to clients, we chose to use Tableau for its flexible, easy-to-use interface and its capacity for quick iterations. However, as our business matured, it became apparent that Tableau was not going to be a solution in the long-term.
From a maintenance perspective, Tableau’s lack of template capabilities meant that every report had its Tableau workbook; if we wanted to make a change across all reports, we needed to open up each file, or go through and recreate each individual file with the new version.
From a usability perspective, our reporting uses complex calculations, many of which are not able to be precalculated and must be computed on the fly. As our dataset grew, we began getting feedback from our clients on the sluggish experience that they were faced with when accessing our reports. If a user changed a filter, the entire dashboard and all the underlying calculations would need to refresh – taking up to a few minutes; this caused many headaches for clients just to get to the data they wanted. In addition, we had to spend hundreds of thousands of dollars on a set of licenses for our company and our end-users since we delivered these reports through Tableau Online.
We explored a few scalable Business Intelligence (BI) tools, but landed on a trial with Periscope Data for reasons I’ll explain below:
- Periscope allows you to embed dashboards and pass a JSON blob back to the dashboard with filter values that you can generate from your site’s front-end.
- Excel downloads from embedded dashboards and pivot abilities within tables.
- The ability to variablize your SQL code and pass variable values in the JSON blob allowed us to create report templates that could read off of tables with the same schema.
- The ability to link to our internal data warehouse to these dashboards allows us to run queries directly on the tables we’re producing – no extracting, caching, or uploading of data needed (our underlying data warehouse has a caching feature that we’ll go into in Part 2).
- Underlying the dashboards was SQL, which we already use every day so no need to learn a new language or syntax.
- We can also add our color palette to keep the branding consistent.
This tool fits our use case quite well, and since it’s a startup, they’re consistently rolling out new features to expand our analytical horizons (they just came out with Python and R capabilities).
However, this task did not come without its challenges. The most substantial task was to recreate the functionality and flexibility of our highly-complex Tableau dashboards using SQL. If any of you have built semi-complex Tableau dashboards, the thought of re-creating those in SQL would give you headaches. Now, imagine working with a panel dataset that uses projection factors to deliver data – we can’t directly sum any of our metrics to get aggregates, there is almost always a subquery involved.
Luckily, we found a simple and elegant solution to this problem that allows us scalability and flexibility. Using Periscope’s text-replacement functionality to variablize our SQL scripts, we have been able to create analysis templates that work on different tables as long as the tables have the same schema (column names and data types).
Replicating Tableau’s functionality took about a month, while the query optimization took about another (due to unforeseen problems we will discuss in Part 2). To deliver 30-40 Tableau workbooks, it took our team of 6 analysts working full-time about 5-10 working days to produce the tables, run the QA, extract and upload, and write summaries. Also, for the dashboards running queries on 1m-10m row tables, Tableau Online could take anywhere from 20s to 1m30s.
Our 4 analysts now produce 80 monthly deliveries at about 20min each (down from about 1h) over a span of a few days (down from up to 2 weeks) – we are currently working to completely automate this process (using Airflow – more to come on this in a future post). In addition, since we could write our queries in SQL, it allowed us the flexibility to optimize our queries to return results in the 10-20s range (down from 20s to 1m30s). However, we still weren’t satisfied with the consistency of our query times and the inelasticity of our underlying compute power.