Problems started after we finished creating a self-contained format for data delivery through Periscope using an underlying Redshift cluster. Redshift clusters are set up with pre-defined compute and storage sizes and billed at a monthly rate until you drop the cluster. Periscope Data highly recommended Redshift over Snowflake/BigQuery; however, assumptions that the cluster would be up 24/7 and using only second-run query speeds for comparison didn’t fit our workflows. We have analysts and customers that run queries (usually once) or download data very intermittently and often access the same tables at the same time. Granted, comparing these data warehouses is like trying to compare apples (Redshift) to oranges (Snowflake) and bananas (BigQuery); BigQuery charges by the amount of data scanned, while Snowflake charges by the amount of time you have the warehouse running – trying to reconcile that with a common use case is very difficult.
After exhaustively optimizing our Periscope queries running on a Redshift cluster by tuning the sort- and dist-keys of the underlying tables, we still wound up getting highly variable query time results if more than one user was hitting our database at the same time (concurrency). If two users were trying to refresh our embedded Periscope dashboards, their queries would be allocated half of our compute (for simplicity, I’m halving it – there are other considerations for allocating compute power but I digress…) and if their dashboards were passing back more than one query at a time (most of ours are a chart and a table, each with its own query), the compute power would be split even further. When you scale this model to 100s of users hitting this thing at the same time, it’s going to get clogged-up and ruin our query times. We’d be right back to the relative Stone Age of the Tableau Online days.
So, we decided to try Snowflake and Redshift in parallel in side-by-side dashboards since Periscope can connect to both. Instead of sharing resources, Snowflake allows us to segment users into separate compute resources and scale the amount of nodes in each warehouse horizontally – if Snowflake senses a lot of queries queued-up, it can create duplicate warehouses that handle the queued queries, effectively scaling the warehouse automatically so the end user can’t tell the difference. We currently separate our warehouses by teams and have a warehouse dedicated to queries coming from our portal – great for cost-accounting. Snowflake warehouses can also scale vertically; if you’re running queries on really large tables, you have the ability to – in seconds – scale-up the compute of your warehouse, so your queries could run in a fraction of the time they would on a shared Redshift cluster.
This flexibility not only solved our concurrency problem from Periscope queries, but also gave our teams a lot of flexibility in their workflows. Instead of running a query, leaving for lunch, and then coming back to find out that it’s either still running or failed halfway-through, our users can scale up their team’s warehouse, run their queries in a fraction of the time it’d take on Redshift, and scale them back down before lunch. This flexibility – on top of the fact that all of our data can be stored in the same place instead of distributed throughout Redshift clusters – has paid off exponentially in terms of workflow and collaboration within the company.
So, we’ve settled on Periscope with an underlying Snowflake warehouse that we can scale however and whenever we want. Our query times are consistently in the 3s-10s range, and we have essentially cut our cost by 70%. Amazing setup for scale and flexibility in my honest [and totally unbiased] opinion.