Visualizing Big Data
This post explores using Tableau Server to visualize data in a Hadoop cluster.
More and more businesses are finding value or savings in replumbing their databases to be deployed on commodity hardware running open-source software like Cloudera Distribution of Hadoop. With tools like Impala, realtime queries of massive datasets become possible. To get the most insight, compelling interactive data exploration and visualization is necessary. We wanted to explore how Tableau works in this regard, and found Tableau Server visualizing data from CDH using Impala proved facile. This combination provides data exploration at the speed of thought with beautiful intuitive visualizations, resulting in a quick front-end for big data.
To demonstrate this, I took the classic AdventureWorks Bike Store data that Microsoft uses to demo a data warehouse in SQL Server (also used in our Endeca Information Discovery demo) and loaded it in our CDH cluster using Impala. I downloaded Tableau Desktop, as well as the Cloudera ODBC Driver for Impala, and spun up a VM running Windows Server to host Tableau Server. After configuring the Impala driver to point at our cluster, I launched Tableau and added a new data source. Tableau makes it pretty simple to choose from a menu of data sources, from a simple CSV to a massive CDH cluster. After selecting the Cloudera Hadoop option, I input our cluster DNS and the port and credentials for Impala. I selected my new Bike Store database and table, and was ready to whip up some visualizations.
Tableau provides tools for ETL, including a pretty nice GUI for simple joins, but since I was trying to denormalize a star schema I did the transformations using impala-shell where I have more control and the operations are more visible. Cluster-side ETL would also help these visualizations run at the speed of thought, even if working with big data at scale.
Using Tableau provides really easy creation of standard charts and even more complicated visualizations like maps can be rendered automagically with geographic attribute detection. I found the date detection to be less dependable. You can add filters to slice and dice on different attributes, and create dashboards to combine several worksheets, or create "Stories" to walk a viewer through a series of visualizations.
Overall, I found some aspects of Tableau a little bit like using Apple products: you get great design and intuitive functionality at the cost of robust configurability. It’s a fantastic tool to get pretty visualizations up and running quickly and intuitively, to add data sources on the fly with little complexity, and to quickly share the results. But in spite of those strengths, Tableau isn't a replacement for enterprise-scale BI offerings.
Interact with the visualization we built and embedded here: