Your client wants to start using technology assisted review (TAR) on a new case… fantastic! So now what?
There’s been a lot of discussion since the seminal 2012 Da Silva Moore v. Publicis Groupe SA case regarding technology-assisted review (TAR), but many law firms have been slow to fully embrace machine learning approaches for large-volume eDiscovery matters.
TAR isn’t going away; in fact, it’s becoming an even stronger and more useful tool that is being employed by innovative law firms to save clients time and money.
So sooner rather than later your client is going to ask you (or the lawyer sitting in the office next door to yours), “what is TAR and how can it be used to help my company save legal costs?”
When that day comes, don’t panic: you can quickly learn how TAR works–without becoming a technical expert or data scientist–and leverage the expertise of third-party experts that can assist you with the technology and the process.
Knowing what to expect from applying TAR puts leading law firms ahead of the game. Here are some key questions to consider:
- What is the legal precedent for using TAR?
Since the landmark Da Silva case, courts have discussed the application of TAR in great detail and upheld its use.1 Recently the UK joined the U.S. in supporting the use of TAR in Pyrrho Investments Ltd. v. MWB Property Ltd.
These cases have provided invaluable guidance on designing defensible TAR workflows. Many consider the most useful to be Rio Tinto PLC v. Vale S.A., subtitled “Da Silva Moore Revisited.” In Rio Tinto, Judge Peck wrote an opinion to accompany the parties’ stipulated TAR protocol, which presents clear guidelines to parties employing TAR on a new case.
- What TAR strategies can you offer your client?
TAR, also known as computer-assisted review or predictive coding, is an analytical tool that can be used alongside other tools such as keyword searching, email threading, relationship analysis, etc. to eliminate the need for a full manual review of a document population. There are a number of different strategies that become available by applying TAR to your review:
- Predictive coding: Using the tool to apply coding on a larger population of unreviewed data based on the manual coding of a smaller sample set.
- Prioritization: Using the tool to prioritize a review population based on likelihood of responsiveness. With this approach, called “skimming the cream,” review teams will look at the docs most likely to be responsive at the beginning of the review. As the team reaches a point where most of the remaining documents are not responsive, various decisions can be made regarding review of the remainder of the likely not responsive set.
- Quality control: Using TAR to review documents already coded by manual review to identify documents that are likely to have been coded inconsistently. This type of intelligent culling can supplement or replace a “random QC” with a targeted QC.
- What is the source population?
It is the universe of documents that are the subject of the matter and need to be reviewed and produced in some format. Key points to consider when determining the document universe are:
- Custodians and data sources: What is the volume of potentially relevant data subject to review, including custodians and data from shared drives, SharePoint and other sources? Volume will be a key factor in determining whether TAR makes sense for your case. (The larger the volume of documents, the more time-consuming and costly the process will be for your client – especially if there is a court-ordered production deadline.)
- Culling (i.e., keyword, date range, concept searching): TAR is a powerful tool that can be used alongside other tools such as keyword searching and concept analysis.
- Data types: email, shared drives, hard copy, and other electronic data. Certain data types such as images with no text may not be available for TAR analysis.
The seed set is perhaps the most important part of the process for the review team. The seed set is a subset of the document universe which will be reviewed by the subject matter experts and/or senior lawyers. The decisions made in the seed set will ultimately be applied to the document universe through application of the TAR software. Here are some things to consider regarding the seed set:
- Purpose: The documents in the seed set will be used to initially train the TAR software.
- Review team: The results of TAR will only be as good as the starting population, so choose the small team of subject matter that are well-versed in the substance of the matter will be essential.
- Identifying the seed set: You must use a ‘reasonable method’ and the size and methodology should be documented. In Rio Tinto, the stipulated protocol provided a baseline formula to generate a “statistically valid sample.” See below.
- Statistically valid sample: a random sample of sufficient size and composition to permit statistical extrapolation with a margin of error of +/- 2% at the 95% confidence level. This confidence level of 95% and margin of error has become the standard in eDiscovery.2 Based on this formula, a document universe of 1,000,000 docs would require a sample set of 2,395 docs.
- Sharing seed sets: There have been some questions regarding whether seed sets ought to be shared with opposing counsel. Although the TAR protocol in Rio Tinto requested the sharing of seed sets, Judge Peck also noted that sharing a seed set was not required and there were alternatives available that could still maintain a transparent process for both parties.
The training process of the TAR workflow is the machine learning part of TAR. Since this idea is relatively new, it is often viewed as the most mysterious part of the workflow. However, all it requires is working with results, understanding basic statistics, and using an iterative process of applying new docs in the training model as needed. The iterative process can require going through several rounds of training and retraining the software. Although many firms have engaged the expertise of technical experts to drive this process, understanding the basic methodology will help take TAR out of the “black box” mentality and into an understandable and repeatable process for you, your client and the judge.
In Da Silva Moore, Judge Peck stated: “I may be less interested in the science behind the “black box” of the vendor’s software than in whether it produced responsive documents with reasonably high recall and high precision.”3 Thus, understanding precision and recall are the nuts and bolts of the grasping the TAR training process.
The standard definition for both, as defined in The Grossman-Cormack Glossary of Technology-Assisted Review, and adopted in Rio Tinto, is as follows:
- “Precision” refers to the fraction of documents identified as likely responsive by the predictive coding process that are in fact responsive.
- “Recall” means the fraction of responsive documents that are identified as likely responsive by the Predictive Coding Process.
In layman’s terms, precision provides a metric on how accurate the predictive model is finding responsive docs in comparison to docs that are actually coded responsive by a subject matter expert. Recall is a measure for completeness and provides a metric on the percentage of responsive documents identified out of all the relevant documents.
Interpreting recall and precision will be factors in determining when to retrain the model to obtain the recall and precision percentages that meet your client’s goals.
The last part of the TAR process is knowing when to stop. There’s no magic threshold, but most users of TAR ensure completeness by performing a validation on the final sets. Validation requires sampling the final document sets for accuracy before they go out the door. Proper sampling should be based on a statistically valid sample” (See 4. above). This sampling should be on both the positive and negative results to ensure accuracy.
As leading law firms seek to develop innovative approaches for saving clients money, lawyers are increasingly seeking to understand the basics of what TAR can and can’t do. Even then, they needn’t go it alone… working in conjunction with TAR experts, they are formulating and implementing defensible TAR strategies.
1 In Da Silva Moore, Judge Peck even said “It is now black letter law that where the producing party wants to utilize TAR for document review, courts will permit it.”
2Rio used the following formula which provides the calculation and explanation for an eDiscovery defensible sample size:
n = X2 * N * P * (1 – P)/(ME2 * (N – 1)) + (X2 * P * (1 – P))
3Da Silva Moore v. Publicis Groupe, 287 F.R.D. 182, 183-84 (S.D.N.Y. 2012). Furthermore, in Rio Tinto the stipulated protocol required disclosure in writing of “the estimated rates of Recall and Precision with their associated error margins.”
About the Author