Maligning keyword search as a “blunt instrument” and claiming that it typically retrieves only about 25% of relevant documents, some apparently would happily have this tool banished from the e-discovery toolbox. But reports of the death of keyword search are greatly exaggerated. Despite all the wailing and gnashing of teeth, for now at least keyword lists continue to be used in many cases as the basis to determine which documents get collected, reviewed and produced. While few would argue that keywords alone are the ideal approach to retrieving all the right data “blind” from an opponent’s archives, when constructed thoughtfully and systematically, with input from linguists or other trained specialists, a keyword-based approach can yield good results.
In order to improve keyword search, we should consider why it often performs poorly. There are a lot of ways the process can go awry. For an activity with the potential to determine something as critical as which documents will or won’t be invoked in a case, the task of keyword development is often delegated to surprisingly under qualified people. There seems to be a prevailing misconception that keyword development is straightforward—we are all highly educated and speak English, after all—and can thus be handled satisfactorily by any attorney familiar with a specific matter and/or industry. But if that were really the case, we wouldn’t be having this discussion.
Those set with the task of keyword development often suffer from a type of tunnel vision. People who are focused on a specific case 24/7 are liable to overlook the fact that certain words and phrases that seem obviously relevant and important are actually less than ideal choices for a keyword-based inclusive cull. In my role as a search consultant I have seen, for example: the word “information” proposed as a keyword to retrieve documents in an employment class action suit against a large newspaper company; the surname “Do” proposed as a standalone term among a list of plaintiffs’ last names; and the number “53” proposed in a case involving bank-owned life insurance, to be run over financial documents. A colleague recently saw the single letter ‘G’ with a trailing wildcard in a proposed keyword set. These may seem like howlers cited for comic relief, but I can assure you that in over eight years in this industry I have never, ever seen a proposed keyword list that didn’t contain at least one term that would have been disastrously over-inclusive.
The problem of over-inclusivity is almost always found in tandem with its slightly less evil twin, over-specificity. Highly specific keyword terms are not inherently bad, but they can create a false impression of adequate coverage of a concept. “United States Patent 1,234,567” will certainly capture some references to United States Patent 1,234,567. But it will miss references to “US patent 1,234,567,” “the ‘567 patent,” “patent 1,234,567,” “patent ‘567,” “our new widget patent,” etc.
The word “keyword” itself suggests, well, a single word. But rarely is a single, context-free word the ideal method by which to target specific subject matter. The most effective keyword sets will make ample use of Boolean and proximity operators to contextualize and focus. Yet the classic keyword list seen in practice consists primarily of individual words and short exact phrases. Brevity in general is not a virtue here; a good keyword list can easily have 500 or more components.
That is not to say that an effective list will comprise 500 or more concepts. The proliferation of individual terms will result at least in part from crucial concepts being placed in proximity to various contextualizing “anchors,” as well as from including synonyms that all target the same key idea. For example, a keyword list intended to find evidence of document shredding in the Enron population should include different potential ways of expressing the concept shred, such as destroy, get rid of, trash, ditch, bury, etc. Those, in turn, being generic and common concepts, should be anchored by a required proximity (say, w/10) to contextualizing references such as papers, evidence, docs, documents, files, memos, etc., or even more generic references like stuff. When concepts are contextually anchored you can more safely explore and exploit related ideas that would be far too broad on their own—for example in this case, concepts like hide, disappear, (not) keep, preserve, etc. Variations in syntax (we need to preserve these emails vs. these emails must be preserved) , parts of speech (preserve vs. preservation), and inflection (preserve vs. preserving) should be considered and accounted for with the use of bidirectional proximity operators and wildcards. Irregular word forms, especially those that would not be captured though standard wildcard use (e.g., kept will not be captured by keep*) must also be addressed.
Keyword lists are here to stay, and of course are used and are useful in many situations beyond the one discussed here – for example, to improve a technology-assisted review process and results. The development of effective keyword sets is not simple, and keyword choices will often have nontrivial cascading effects throughout the rest of a matter. Want a better keyword list? There are linguists and other experienced search consultants out there waiting on your call.
About the Author