Bringing Data Discovery To Hadoop - Part 1
We have been anticipating the intersection of big data with data discovery for quite some time. What exactly that will look like in the coming years is still up for debate, but we think Oracle's new Big Data Discovery application provides a window into what true discovery on Hadoop might entail.
We're excited about BDD because it wraps data analysis, transformation, and discovery tools together into a single user interface, all while leveraging the distributed computing horsepower of Hadoop.
BDD's roots clearly extend from Oracle Endeca Information Discovery, and some of the best aspects of that application -- ad-hoc analysis, fast response times, and instructive visualizations -- have made it into this new product. But while BDD has inherited a few of OEID's underpinnings, it's also a complete overhaul in many ways. OEID users would be hard-pressed to find more than a handful of similarities between Endeca and this new offering. Hence, the completely new name.
The biggest difference of course, is that BDD is designed to run on the hottest data platform in use today: Hadoop. It is also cutting edge in that it utilizes the blazingly fast Apache Spark engine to perform all of its data processing. The result is a very flexible tool that allows users to easily upload new data into their Hadoop cluster or, conversely, pull existing data from their cluster onto BDD for exploration and discovery. It also includes a robust set of functions that allows users to test and perform transformations on their data on the fly in order to get it into the best possible working state.
In this post, we'll explore a scenario where we take a basic spreadsheet and upload it to BDD for discovery. In another post, we'll take a look at how BDD takes advantage of Hadoop's distributed architecture and parallel processing power. Later on, we'll see how BDD works with an existing data set in Hive.
We installed our instance of BDD on Cloudera's latest distribution of Hadoop, CDH 5.3. From our perspective, this is a stable platform for BDD to operate on. Cloudera customers also should have a pretty easy time setting up BDD on their existing clusters.
Getting started with BDD is relatively simple. After uploading a new spreadsheet, BDD automatically writes the data to HDFS, then indexes and profiles the data based on some clever intuition:What you see above displays just a little bit of the magic that BDD has to offer. This data comes from the Consumer Financial Protection Bureau, and details four years' worth of consumer complaints to financial services firms. We uploaded the CSV file to BDD in exactly the condition we received it from the bureau's website. After specifying a few simple attributes like the quote character and whether the file contained headers, we pressed "Done" and the application got to work processing the file. BDD then built the charts and graphs displayed above automatically to give us a broad overview of what the spreadsheet contained.
As you can see, BDD does a good job presenting the data to us in broad strokes. Some of the findings we get right from the start are the names of the companies that have the most complaints and the kinds of products consumers are complaining about.
We can also explore any of these fields in more detail if we want to do so:
Now we get an even more detailed view of this date field, and can see how many unique values there are, or if there are any records that have data missing. It also gives us the range of dates in the data. This feature is incredibly helpful for data profiling, but we can go even deeper with refinements.
With just a few clicks on a couple charts, we have now refined our view of the data to a specific company, JPMorgan Chase, and a type of response, "Closed with monetary relief". Remember, we have yet to clean or manipulate the data ourselves, but already we've been able to dissect it in a way that would be difficult to do with a spreadsheet alone. Users of OEID and other discovery applications will probably see a lot of familiar actions here in the way we are drilling down into the records to get a unique view of the data, but users who are unfamiliar with these kinds of tools should find the interface to be easy and intuitive as well.
Another way BDD differentiates itself from some other discovery applications is with the actions available under the "Transform" tab.
Within this section of the application, users have a wealth of common transformation options available to them with just a few clicks. Operations like converting data types, concatenating fields, and getting absolute values now can be done on the fly, with a preview of the results available in near real-time.
BDD also offers more complex transformation functions in its Transformation Editor, which includes features like date parsing, geocoding, HTML formatting and sentiment analysis. All of these are built-in to the application; no plug-ins required. Another nice feature BDD provides is an easy to way group (or bin) attributes by value. For example, we can find all the car-related financing companies and group them into a single category to refine by later on:
Another nice added feature of BDD is the ability to preview the results of a transform before committing the changes to all the data. This allows a user to fine tune their transforms with relative ease and minimal back and forth between data revisions.
Once we're happy with our results, we can commit the transforms to the data, at which point BDD launches a Spark job behind the scenes to apply the changes. From this point, we can design a discovery interface that puts our enriched data set to work.
Included with BDD are a set of dynamic, advanced data visualizations that can turn any data set into something profoundly more intuitive and usable:
The image above is just a sampling of the kind of visual tools BDD has to offer. These charts were built in a matter of minutes, and because much of the ETL process is baked into the application, it's easy to go back and modify your data as needed while you design the graphical elements. This style of workflow is drastically different from workflows of the past, which required the back- and front-ends to be constructed in entirely separate stages, usually in totally different applications. This puts a lot of power into the hands of users across the business, whether they have technical chops or not.
And as we mentioned earlier, since BDD's indexing framework is a close relative to Endeca, it inherits all the same real-time processing and unstructured search capabilities. In other words, digging into your data is simple and highly responsive:
As more and more companies and institutions begin to re-platform their data onto Hadoop, there will be a growing need to effectively explore all of that distributed data. We believe that Oracle's Big Data Discovery offers a wide range of tools to meet that need, and could be a great discovery solution for organizations that are struggling to make sense of the vast stores of information they have sitting on Hadoop.
If you would like to learn more, please contact us at info [at] ranzal.com.
Also be sure to stay tuned for Part 2!