Sampling Made Simple: Sampling Fundamentals in Technology-Assisted Review

March 17, 2013 kbaumer

Regardless of the general method being employed in a large-scale document review—keyword cull with linear manual review, technology-assisted review (TAR), or some other combination of human and machine analysis—it has become best practice to employ sampling as part of the overall process. Samples provide estimates for the rate of occurrence of variables of interest within a population; in an e-discovery context, such variables might be responsive documents, privileged documents, documents discussing a particular subtopic of interest, etc.
With sampling, a relatively small investment of resources returns a wealth of information that can inform the larger review going forward, provide clarity and confidence about the nature of the document population, and enhance defensibility. Recently, some questions arose from our readership on how best to apply sampling in e-discovery projects in which technology-assisted review (TAR) is used.

What can sampling achieve in TAR?
Sampling is useful for basic knowledge acquisition, allowing a team to learn about the general nature and composition of entire data sets or specific subsets, without requiring exhaustive review. Sampling allows users to gain quantifiable and actionable insights quickly and at a low relative cost. Sampling assists also in prediction of larger outcomes, for example the overall rate of responsiveness, privilege, and/or confidentiality. This provides visibility for project planning and effective resource allocation, and all of these data points can inform certain decisions within a TAR process. Finally, sampling can help ensure the defensibility of TAR-based data volume reduction efforts and document productions by providing vital metrics for both precision (how well the process managed to target only the pertinent data) and recall (how well the process managed to target all the pertinent data).

What do you need to know in order to sample successfully?
It is perhaps counterintuitive, but the sample size required to gain statistically valid feedback about even a large population is fairly small, and does not vary drastically with total population size. However, one factor that will cause some variation in required sample size is the desired confidence level and corresponding confidence interval.

Confidence level is a measure indicating the overall reliability of sample-based estimates. A confidence level of 95, for example, means that if the sampling were performed 100 times, one could expect the observed counts to fall within the given margin of error (or confidence interval—see below) for the estimate 95 times. Typical confidence levels are 95% or 99%. Generally speaking, and with all other factors being equal, the larger your sample, the higher the confidence level.

Confidence interval refers to the degree to which the actual rate of occurrence for the variable of interest may differ from the estimate. For example, a sample may have a rate of responsiveness of 10%, but if the sample size is associated with a confidence interval of +/-1%, the actual rate of responsiveness for the population as a whole can reasonably be expected to be anywhere from 9%-11%. Generally speaking and with all other factors being equal, the larger your sample, the narrower the confidence interval.

No single confidence level/margin of error is appropriate for all situations. Confidence level and interval settings can and should vary according to the goals of sampling, with proportionality and defensibility both playing important roles when making sampling decisions. Culling for responsive review, one application of TAR, should have higher standards, as one needs to be more confident in precision and recall measures. There is less room for variation if there is a legal obligation for attorneys to attest to the adequacy of production. Conversely, if goals are informational-only or for workflow planning and review prioritization, less stringent criteria can safely be adopted.

How can and should sampling be used in combination with TAR?
Sampling can determine the rate of responsiveness in a document collection, helping calculate the likely rates of production for project planning, budgeting, and resource allocation. Responsiveness rates are also key inputs for the calculation of precision and recall metrics, which are crucial for defensibility of results.

The TAR engine should be trained with a small, fully coded representative sample, or “seed set,” pulled from the full review population. This allows the TAR engine to accurately recognize patterns of responsiveness in a population—what responsiveness “looks like” and how often it occurs.

Sampling on subsets of TAR-coded documents as part of a quality control process improves the defensibility of the data reduction and validates the coding results.

Do samples always have to be random?
There are two basic types of sampling methodology commonly used in a TAR context: random and judgmental. The first is a statistical sampling approach that gives each document in the overall collection an equal chance of being chosen for inclusion within a sample. Judgmental sampling, on the other hand, calls upon subjective factors when determining inclusion within a sample—e.g., a judgmental sample might be based on keyword hits or be pulled disproportionately from certain key custodians.

Random samples are essential for the generation of metrics (since one cannot accurately extrapolate out to a larger population based on an uncontrolled skewed input) and should also play a role in training the TAR engine, as noted above, as part of the seed set. A random coded sample should also be set aside (and not used for training) to function as a testing set for evaluation of the TAR engine’s performance.

The seed set, however, can and often should be supplemented with judgmental sampling in addition to its random component. Exemplar documents added to the training set help to ensure that important examples of responsiveness are included in training. However, the training set must still include sufficient nonresponsive documents for the eventual TAR model to be effective at recognizing the difference between responsive and nonresponsive content. The training set should never consist solely or primarily of “hot” or otherwise significant but nonrepresentative documents.

Judgmental sampling can also be used to supplement TAR results, for example when certain key concepts are necessarily correlated with responsiveness (such as a particular and unusual code name or unique deal number). In such cases it makes sense to “force” these documents into the highest tier of results (if they are not there already).

When should one involve an expert in statistics?
While attorneys should be expected to understand the basics of sampling and know what questions to ask of their systems and/or processes, it is not their core competency – nor should it be. Certain TAR projects may enjoy greater efficiencies and reduced costs by bringing in a statistician highly knowledgeable in sampling techniques who can work behind the scenes during the TAR training process, iteratively testing and refining the TAR model, and identifying documents for QC. Furthermore, a statistician can generate defensible precision and recall metrics, providing peace of mind that a review was conducted with documented inputs, decision points, and performance metrics.

Statistical sampling is an incredibly powerful and flexible tool, and, when applied appropriately in a TAR process, can significantly increase TAR’s efficiency and effectiveness.

Karen Baumer is a senior search consultant at Conduent. She can be reached at

About the Author


Previous Article
Is It MySpace or YourSpace? Emerging Trends in the Discovery of Social Media Evidence

In analyzing motions to compel production of social media content, courts generally agree that there is no ...

Next Article
Lessons in Technology-Assisted Review Transparency

How should you negotiate with your adversary when considering the use of technology-assisted review (TAR)? ...