Bringing Data Discovery To Hadoop - Part 2
The most exciting thing about Oracle Big Data Discovery is its integration with all the latest tools in the Hadoop ecosystem. This includes Spark, which is rapidly supplanting MapReduce as the processing paradigm of choice on distributed architectures. BDD also makes clever use of the tried and tested Hive as a metadata layer, meaning it has a stable foundation on which to build its complex data processing operations.
In our first post of this series, we showcased some of BDD's most handy features, from its streamlined UI to its very flexible data transformation abilities. In this post, we'll delve a little deeper into BDD's underlying mechanics and explain why we think the application might be a great solution for Hadoop users.
Much of the backbone for BDD's data processing operations lie in Hive, which effectively acts as a robust metastore for BDD. While operations on the data itself are not performed using Hive functions (which currently run on MapReduce), Hive is a great way to store and retrieve information about the data: where it lives, what it looks like, and how it's formatted.
For organizations that are already running data in Hive, the integration with BDD couldn't be simpler. The application ships with a data processing tool that can automatically import databases and tables from Hive, all while keeping data types intact. The tool can also sync up with a Hive database so that when new tables are created a user can automatically work with that data in BDD. If a table is dropped, BDD deletes that particular data set from its index. Currently, the 1.0 version doesn't support updates to existing Hive tables, but we hope to see that feature in an upcoming release.
BDD can also upload data to HDFS and create a new table with that data in Hive to work with. It does this whenever a user uploads a file through the UI. For example, here's what we saw in Hive with the consumer complaints data set from the last post after BDD imported it:
This easy integration with Hive makes BDD a good option for both experienced Hadoop users who are using Hive already, as well as less technical users.
While Hive provides a solid foundation for BDD's operations, Spark is the workhorse. All data processing operations are run through Spark, which allows BDD to analyze and transform data in-memory. This approach effectively sidesteps the launching of slower MapReduce jobs through Hive and gives the processing engine direct access to the data.
When a user commits a series of transforms to a data set via the BDD UI, those transforms are interpreted into a Groovy script that are then passed to Spark through an Oozie job. Here, we can see how some date strings are converted to datetime objects behind the scenes:
After Spark has done its handiwork, the data is then written out to HDFS as a new set of files, serialized and compressed in Avro. The original collection stays intact in another location in case we want to go back to it in the future.
From this point, the data is then loaded into the Dgraph.
The Dgraph is basically an in-memory index, and is what enables the real-time, dynamic exploration of data in BDD. This concept might be familiar to those who have used Oracle Endeca Information Discovery, where the Dgraph also played a key role, and this lineage means BDD inherits some very nice features: quick response, keyword search, impromptu querying, and the ability to unify metrics, structured and unstructured data in a single interface. The biggest difference now is that users have the ability to apply these real-time search and analytic capabilities to data sitting on Hadoop.
We think the marriage of this kind of discovery application with Hadoop makes a lot of sense. For starters, Hadoop has enabled organizations to store vast amounts of data cheaply without necessarily knowing everything about its structure and contents. BDD, meanwhile, offers a solution to indexing exactly this kind of data -- data that is irregular, inconsistent and varied.
There's also the issue of access. Currently, most data tools in the Hadoop ecosystem require a moderate level of technical knowledge, meaning wide swaths of an organization might have little to no view of all that data on HDFS. BDD offers a system to connect more people to that data, in a way that's straightforward and intuitive.