It’s the Math, Stupid! – Metrics, Measurements and Methodology in TAR
No doubt, the TAR movement has taken root. On his Complex Discovery blog, Rob Robinson has meticulously documented this turn of events from early 2012 to date with an exhaustive list of over 600 articles on the topic. In just one year, we’ve come a long way from the time that only eggheads and stats wonks dared “to go where angels fear to tread.”
But even if TAR talk has gone mainstream, according to the 2013 ILTA/InsideLegal Technology Purchasing Survey, only 16% of respondents indicated that big data would lead to more strategic use of data via predictive data modeling, data mining and more accessible data analytics.
Additionally, the math behind TAR is still misunderstood by most in the legal field. Recent trends show an increase in articles attempting to explain the metrics, measurements and methodology. At ILTA this year, several sessions (Technology Assisted Review: A Hands-On Case Study; A Numbers Game: The Value of E-Discovery Metrics) were dedicated to the same challenge.
Our own Conduent experts, Gaby Baron and Amanda Jones, recently wrote about TAR standards and best practices in “Handle Technology-Assisted Review With Care,” published in Law Technology News. Here’s a brief summary of the key takeaways:
Context is King – Adjust Accordingly
In setting the evaluation methodology, it’s important to understand the context in which TAR is being applied. Is it for definitive first-level review coding, document prioritization or quality control? The success criteria must be adapted to each application. Where there may be higher risk, you may want to set a higher bar in measurement standards to mitigate that risk.
For example, you may want to hit higher metrics when using TAR for definitive first-level review coding decisions since you’ll have fewer “eyes” on every document. In the case of document prioritization, however, it may not be necessary to achieve the highest metrics since TAR would be used mainly to speed full manual review rather than replace it.
Know and Use “Precision” and “Recall” Correctly
Although there are a number of legitimate performance metrics used to evaluate TAR effectively, the two most commonly used are:
- “precision” – how well the process identifies only the pertinent data, and
- “recall” – how well the process identifies all of the pertinent data.
In addition, it’s important to assess the:
- “confidence levels” – degree of certainty that the measured value is correct; and
- “margins of error” – degree to which the actual value can be expected to vary from the estimated value based on the sample size and confidence level.
Know and Use “Random Sampling” for Testing Correctly
Random sampling can be used for training a TAR tool, but it is essential for testing. Baron and Jones state that TAR models should always be judged against samples “that are fully representative of the entire document population that will be classified by the TAR system.” Random samples, as opposed to biased samples, will best reflect the characteristics of the overall population.
Two key guidelines for testing:
- A “fresh random sample” should be drawn and coded for testing and should not include any of the same documents that were used to train it. Otherwise, as Baron and Jones aptly state, it is “akin to testing a student who holds the answer key.”
- Test results apply only to the population from which the sample was drawn. If your data population changes because you collected new data for the matter, don’t assume you can apply the same TAR model to the new set and generate the same results.
Let us know what you think in the Comments below. What TAR standards and best practices do you incorporate in your process?
Kris Vann is a consultant at Conduent. She can be reached at firstname.lastname@example.org.