Extending Collaborative Data Discovery - Common Scenarios

Published June 21 2013 by Patrick Rafferty
Back to insights

In our previous post on Advanced Data Discovery capabilities that we've built out in Oracle Endeca Information Discovery (OEID), we introduced our SmartTagger component extension allowing users to augment and modify their Endeca data on the fly.  To help further illustrate the key scenarios that this component satisfies, we're going to expand upon a number of key use cases that are driving us to build these types of extensions.

Data Categorization

Categorization is an awfully broad and slightly vague term for a very powerful interaction that we provide to our customers.  Essentially, it allows users to encapsulate and abstract any number of fine-grained refinements or data interactions (search, navigation, filtering) into a higher-order construct.  A big mouthful of terms to swallow but the easiest way to think about it is "Tagging".  I want to enable my users to find sets of records that may or may not share data in common and allow them to specify some common attributes (i.e. Tags) to link them together.

For a sales analytics or prospect tracking application, I may want to find and save all of my  prospects who haven't purchased in the last 6 months.  Or I may want to find my top 10 customers in a given state, tag them and compare them to another set of customers, such as the top 10 in another state or the best of the rest of my customers.

Another compelling use case revolves around patient and cohort identification.  Say I am sifting through a set of records pertaining to patient data looking for potential candidates for a drug trial.  I might start off with a search for patients who have a certain type of cancer.  Then, I might search further within that same set of patients for those that have not undergone radiation.  Finally, I've navigated further down to those located in a particular region and I've identified a part of my cohort.  Now, I want to add a completely different set of patients to act as a control.  So, I back out of my navigation state, pick a completely different set of criteria and tag these patients as well.

As you can see, this goes way beyond bookmarking and data commonality and enters the realm of true collaborative discovery.  While we've only discussed single user scenarios above, this technique can be employed to enable multiple users working on the same problem to collaborate, share tags and (pun incoming!) tag-team the discovery process.

The above example is just a small sample of what we're seeing in the field right now but the possibilities are endless.

Data Removal

The flip side of providing a powerful and readily available discovery tool to your users is that, when done properly, even the most obscure corners of your data repository can be brought to the fore.  This might mean uncovering a cache of old sales records for products your company no longer sells.

One common scenario is the concept of false positives when examining fault management data, such as that related to warranty claims.  Say you're an analyst tasked with diagnosing and categorizing warranty claims on your product.  In doing your analysis, you'll be analyzing and detailing claims, searching through unstructured and structured data looking for potential issues.  The ability to quickly identify and exclude "bogus claims" is a huge benefit when trying to "sort the wheat from the chaff" and identify root cause.

However, by far the most common example of "bad data", which we call "data pollution", is incorrectly granting access to data that users should not be able to see.  This isn't a result of a flaw in our security models or how we apply entitlements.  It's actually a result of entitlement data being wrong at the source and customers don't even realize it.

Think about an enterprise search solution that crawls all of the publicly available document repositories at a given company.  This would include hundreds, if not thousands of computers containing anywhere from 1-500 shared folders each.   Users are responsible for their own folders and the content that is made available but what if a user accidentally puts a sensitive document and makes it available to "the world"?

Now, chances are, if nobody knows this sensitive document (say a Salary List) is out there, on some obscure file share, it's probably not going to be found in the normal course of operation.  However, if you bring that content into a search index and make it available for everyone (because of incorrect permissions at the source), the potential for discovering this data is increased dramatically.

Obviously, this is an example where the overwhelming preference is to "surgically" remove this file as soon as possible.  Waiting around for a baseline indexing or re-crawling the document to get the new permissions isn't really an option.  If we're talking about something that's even more sensitive, say a classified document at a government entity that can be improperly accessed, the first instinct may be to just shut down the application until the file is removed.

Here's where a delete process that is secured (only Administrators can access it), readily available and persistent (this record will not come back into the index in its current form) provides a tremendous amount of value.  Your data is removed from the application and it remains secure while your users aren't impacted by an outage caused by one "rogue document" out of 10 (or 100) million.  Sometimes the downside of "Data Discovery" is that you end up with the ability to discover information that you shouldn't.

Hopefully, people find this useful in terms of broadening the conversation around Data Discovery.  It's about access to information but in many cases, it's also about enablement and control.  Enable users to share their findings and give administrators fine-grained control over what users can see.

If you want to learn more, don't hesitate to reach out at ranzal.com.

Contact us