OEID 3.0 First Look -- Text Enrichment & Whitespace

Published April 23 2013 by Patrick Rafferty
Back to insights

I recently spent some cycles building my first POC for a potential customer with OEID v3.0.  After running some of the unstructured data through the text enrichment component, I noticed something odd:

whitespace_prob

The charts I configured to group by those salient terms were displaying a "null" bucket.  This bucket was essentially collecting all records that were not tagged with a term.  After a bit of investigation, it seems this is expected behavior in v3.0 -- the Endeca Server now treats empty, yet non-null attributes, as valid and houses them on the Endeca record.  Empty, yet non-null, attributes are common after employing some of the OOTB text enrichment capabilities in 3.0 (tagging, extraction, regex).  Thus, a best practice treatment for this side-effect is warranted.

The good news is that the workaround was very straightforward.

1) Add a "Reformatter" component to the .grf before the bulk loader with the same input and output metadata edge definition.  From the reformatter "Source" tab, select "Java Transform Wizard" and give your new transformation class a name like "removeWhitespaces".  This will create a .java source file and a compiled .class file in your Integrator project's ./trans directory (where Integrator expects your java source code to reside).

removeWhitespace

2) Provide the following java logic in your new "removeWhitespaces" transformation class:
import org.jetel.component.DataRecordTransform;
import org.jetel.data.DataRecord;
import org.jetel.exception.TransformException;
import org.jetel.metadata.DataFieldType;

public class removeWhitespaces extends DataRecordTransform {

@Override
public int transform(DataRecord[] arg0, DataRecord[] arg1) throws TransformException {
for(int i = 0; i < arg0.length; i++) {
DataRecord rec = arg0[i];
for(int j = 0; j < rec.getNumFields(); j++) {
if(rec.getField(j).getMetadata().getDataType().equals(DataFieldType.STRING)) {
if(rec.getField(j).getValue() == null || rec.getField(j).getValue().equals("") || rec.getField(j).getValue().toString().length() == 0) {
rec.getField(j).setValue(null);
}
}
arg1[i].getField(j).setValue(rec.getField(j).getValue());
}
}
return 0;
}
}

3) Make sure the name of this new class is specified in the "Transform class" input.  Rerun the .grf that loads your data and....profit!

whitespace_fix

We look forward to sharing more emerging OEID v3.0 best practices here....and hearing about your approaches as well.

 

 

Contact us