Bringing Data Discovery to Hadoop - Part 3
In our last post, we talked about some of the tools in the Hadoop ecosystem that Oracle Big Data Discovery takes advantage of to do its work -- namely Hive and Spark. In this post, we're going to delve a little deeper into how BDD integrates with data that is already sitting in Hive, how it can write transformed data back to HDFS, and how it can help give users new insights on that data.
BDD ships with a data processing tool that makes imports from Hive easy. Simply point it at the database and table(s) you would like to pull into BDD and the application does the rest. Behind the scenes, the data processing utility launches Spark workers to read in the data from HDFS for the targeted table into new Avro files. BDD then indexes the data in these files for easy discovery.
Another feature of BDD's data processing is that it can be set to auto-detect new tables that are created in a Hive database to keep it in sync with Hive. The BDD Hive Table Detector automatically launches a workflow to import a table whenever one is created. Currently, BDD doesn't yet support updates to existing tables but we hope to see that feature in an future release.
One thing to note: depending on the size of the table, BDD may import only a sampling of its data for discovery purposes. By default, the application's record threshold is set to one-million. This is in order to keep any analysis of a particular collection as interactive as possible while maintaining a relatively dependable and accurate view. For most intents and purposes, this default setting should probably be enough. However, the threshold can be increased if necessary. Ultimately the amount of data sampling to use would have to be a balance between an individual's needs and the computing resources available to them.
Exporting Back to Hive
A unique component of BDD is its ability to throw data back to Hadoop once you have it in a state that you are satisfied with or would like to share with other users. We have some campaign funding data to work with as a test case:
The Chicago mayor's race has been getting some attention due to an unexpected underdog forcing incumbent Rahm Emanuel to a runoff. As you can see, the challenger, Chuy Garcia, is wildly out-funded compared to Emanuel:
Creating this application involved pulling campaign spending data for Illinois from electionmoney.org, importing it into BDD, and then joining a couple tables together and cleaning it all up using the transform tools contained within the application.
Now let's say we wanted to export the results of this work -- these joined, transformed data sets -- for other users to query for themselves in Hive. We can do that with a simple, built-in export feature that can write our denormalized data set back to HDFS.
With a few quick clicks, BDD can create Avro-formatted files, write them to our Hadoop cluster, and then create the corresponding Hive table automatically:
This particular feature adds a lot of flexibility and opportunity for collaboration in teams where members span a wide range of skills. You can imagine users on the business side and technical side of a company throwing data sets back and forth between each other, sharing insights in a natural way that might have been much more difficult to accomplish in other environments.
That concludes our three-part look at Oracle Big Data Discovery. As we've said before, there is a lot to be excited about and we believe the application offers a viable data discovery solution to organizations running data in Hadoop, as well as those who are interested in creating first-time clusters.
For more information or guidance on how BDD could help your organization, contact us at info [at] ranzal.com.