Tag 100 Times Faster -- Introducing Branchbird's Fast Text Tagger

Published February 14 2014 by Patrick Rafferty
Back to insights

Text Tagging is the process of using a list of keywords to search and annotate unstructured data. This capability is frequently required by Ranzal customers, most notably in the healthcare industry.

Oracle's Endeca Data Integrator provides three different ways to text tag your data "out of the box" .

  • The first is the "Text Tagger – Whitelist" component which is fed a list of keywords and searches your text for exact matches.
  • The second is the "Text Tagger – Regex" component which works similarly but allows for the use of regular expressions to expand the fuzzy matching capabilities when searching the text.
  • The third is using "Endeca’s Text Enrichment" component (OEM'ed from Lexalytics) and supplying a model (keyword list) that takes advantage of the component’s model-based entity extraction.

Ranzal began working on a custom text tagging component due to challenges with the aforementioned components at scale. All of the above text taggers are built to handle tagging with relatively small inputs -- both the size of the supplied dictionary and the number (and size) of documents.

1,000 EMRs 10,000 EMRs 100,000 EMRs 1,000,000 EMRs
Fast Text Tagger (FTT) 250 docs/sec 1,428 docs/sec 4,347 docs/sec 6172 docs/sec
Text Enrichment (TE) 6.5 docs/second 5 docs/second N/A N/A
TE 4 threads 17.5 docs/second 15 docs/second 15 docs/second N/A

In one of our most common use cases, customers analyzing electronic medical records with Endeca need to enrich large amounts of free text (typically physician notes) using a medical ontology such as SNOMED-CT or MeSH. Each of these ontologies has a large number of medical "concepts" and their associated synonyms. For example, the US version of SNOMED-CT contains nearly 150,000 concepts. Unfortunately, the "out of the box" text tagger components do not perform well beyond a couple hundred keywords. To realize slightly better throughput during tagging, Endeca developers have traditionally leveraged the third component listed above -- the  Lexalytics-based "Text Enrichment" component -- which offers better performance than the other options listed above.

However, after extensive use of the "Text Enrichment" component, it became clear that not only was the performance still not acceptable at high scale, the recall of the component was inadequate especially with Electronic Medical Records (EMRs). The Text Enrichment component is NLP-based and relies on accurately parsing sentence structure and word boundaries to tokenize the document before entity extraction begins. EMRs typically have very challenging sentence structure due both to the ad hoc writing style of clinicians at point of entry and the observational metrics embedded in the record. Because of this, Text Enrichment of even small documents at high scale can be prohibitive for simple text tagging. A recent customer of ours, using very high end enterprise equipment, was experiencing 24 hour processing times using Text Enrichment text tagging with SNOMED-CT concepts to process approximately six million EMRs.

To improve both the performance and recall issues, Ranzal set out to build a simple text tagger component for Integrator that would be easy to setup and use. The Ranzal "Fast Text Tagger" was built using a high performance dictionary matching algorithm that ingests the list of terms (and phrases) into a finite state pattern matching machine which can then be used to process the documents. One of the largest benefits of these search algorithms is that the document text only needs to be parsed once to find all possible matches within the supplied dictionary.

The Ranzal Fast Text Tagger is intended to replace the stock "Text Tagger – Whitelist" component and the use of the "Text Enrichment" component for whitelisting. Our text tagger is intended for straight text matching with optional restrictions to allow for matching on word boundaries. If your use cases require more fuzzy-style text matching, then you should continue to use the "Text Tagger – Regexp" at low scale and "Text Enrichment" at higher scales.

Performance Observations

To go further on the metrics shown above, and duplicated here, you can see the remarkable performance of the Ranzal Fast Text Tagger as compared to "Text Enrichment" even when Text Enrichment is configured to consume 4 threads. Furthermore, the rate of the BB FTT tends to increase with the number of documents, before starting to level off near 1 million documents, whereas Text Enrichment stays relatively constant.

1,000 EMRs 10,000 EMRs 100,000 EMRs 1,000,000 EMRs
BB FTT 1 thread 250 docs/sec 1,428 docs/sec 4,347 docs/sec 6172 docs/sec
TE 1 thread 6.5 docs/second 5 docs/second N/A N/A
TE 4 threads 17.5 docs/second 15 docs/second 15 docs/second N/A

As a final performance note, the previously mentioned customer with the 24 hour graph run just for text tagging, the same process was done on this same test harness with the same data in just shy of 20 minutes. It took longer to read the data from disk than it took to stream it all through the Fast Text Tagger.  This implies that, in typical use cases, the Fast Text Tagger will not be a limiting component in your graph. For those of you curious about the benchmarking methods used, please continue below.

Test Runs

We built a graph that could execute the different test configurations sequentially and then compile the results. Shown below are four separate test runs and a screen capture of Integrator at test completion. Below each screen cap is a list of metrics:

  • Match AVG: The average number of concepts extracted over the corpus
  • Total Match: The total number of concepts extracted over the corpus
  • Misses: The number of non-empty EMRs where no concept was found
  • Exec Time: The total execution time of the test configuration

Note that Text Enrichment's poor recall negatively impacts its precision (Match AVG). If you remove the (significant number of) misses, TE has precision nearly as high as our Fast Text Tagger.

Test 1: 1,000 EMRs


Test ID Match AVG Total Match Misses Exec Time
BB FTT (1 thread) 9 9,126 14 4 secs
TE (1 thread) 4 4,876 260 153 secs
TE (4 threads) 4 4,876 260 57 secs

Test 2: 10,000 EMRs


Test ID Match AVG Total Match Misses Exec Time
BB FTT (1 thread) 12 127,617 14 7 secs
TE (1 thread) 5 55,567 3,739 2,010 secs
TE (4 threads) 5 55,567 3,739 675 secs

Test 3: 100,000 EMRs


Test ID Match AVG Total Match Misses Exec Time
BB FTT (1 thread) 13 1,380,258 17 23 secs
TE (4 threads) 5 546,598 38,466 6,555 secs

Test 4: 1,000,000 EMRs


Test ID Match AVG Total Match Misses Exec Time
BB FTT (1 thread) 14 14,834,247 17 162 secs

Benchmarking Notes

Tests conducted on OEID 3.1 using the US SNOMED-CT concept dictionary (148,000 concepts) against authentic Electronic Medical Records. Physical hardware used: PC, 4 core i7 with hyperthreading, 32 GB RAM on SSD drives.

The "Text Tagger – Whitelist" was discarded as unusable for this test setup. "Text Enrichment" with 1 thread was discarded after the 10,000 document run and TE with 4 threads was discarded after the 100,000 document run.

Contact us