Testing Your Search Using Sampling: eDiscovery Throwback Thursdays

June 27, 2019

Here is the third and final part in our Throwback Thursday series on sampling. Two weeks ago, we talked about how to determine an appropriate sample size to test your search results as well as the items NOT retrieved by the search, using a site that provides a sample size calculator. Last week, we talked about how to make sure the sample size is randomly selected. Today, we’ll walk through an example of how you can test and refine a search using sampling.

This post was originally published on April 5, 2011. It was part of a three-post series that we have revisited over the past couple of weeks. We have continued to touch on this topic over the years, including our webcast just last month. One of our best!

The example is a somewhat simplified real-life example of a search scenario I encountered several years ago where I went through these steps for a similar search to get to a search term that provided the right balance of recall and precision.

TEST #1: Let’s say in an oil company we’re looking for documents related to oil rights. To try to be as inclusive as possible, we will search for “oil” AND “rights”. Here is the result:

Files retrieved with “oil” AND “rights”: 200,000
Files NOT retrieved with “oil” AND “rights”: 1,000,000

Using the site to determine an appropriate sample size that we identified before, we determine a sample size of 662 for the retrieved files and 664 for the non-retrieved files to achieve a 99% confidence level with a margin of error of 5%. We then use this site, to generate random numbers and then proceed to review each item in the retrieved and NOT retrieved items sets to determine responsiveness to the case. Here are the results:

Retrieved Items: 662 reviewed, 24 responsive, 3.6% responsive rate.
NOT Retrieved Items: 664 reviewed, 661 non-responsive, 99.5% non-responsive rate.

Nearly every item in the NOT retrieved category was non-responsive, which is good. But, only 3.6% of the retrieved items were responsive, which means our search was WAY over-inclusive. At that rate, 192,800 out of 200,000 files retrieved will be NOT responsive and will be a waste of time and resource to review. Why? Because, as we determined during the review, almost every published and copyrighted document in our oil company has the phrase “All Rights Reserved” in the document and will be retrieved.

TEST #2: Let’s try again. This time, we’ll conduct a phrase search for “oil rights” (which requires those words as an exact phrase). Here is the result:

Files retrieved with “oil rights”: 1,500
Files NOT retrieved with “oil rights”: 1,198,500

This time, we determine a sample size of 461 for the retrieved files and (again) 664 for the NOT retrieved files to achieve a 99% confidence level with a margin of error of 5%. Even though, we still have a sample size of 664 for the NOT retrieved files, we generate a new list of random numbers to review those items, as well as the 461 randomly selected retrieved items. Here are the results:

Retrieved Items: 461 reviewed, 435 responsive, 94.4% responsive rate.
NOT Retrieved Items: 664 reviewed, 523 non-responsive, 78.8% non-responsive rate.

Nearly every item in the retrieved category was responsive, which is good. But, only 78.8% of the NOT retrieved items were not responsive, which means over 20% of the NOT retrieved items were actually responsive to the case (we also failed to retrieve 8 of the items identified as responsive in the first iteration). So, now what?

TEST #3: This time, we’ll conduct a proximity search for “oil within 5 words of rights”. Here is the result:

Files retrieved with “oil w/5 rights”: 5,700
Files NOT retrieved with “oil w/5 rights”: 1,194,300

This time, we determine a sample size of 595 for the retrieved files and (once again) 664 for the NOT retrieved files, generating a new list of random numbers for both sets of items. Here are the results:

Retrieved Items: 595 reviewed, 542 responsive, 91.1% responsive rate.
NOT Retrieved Items: 664 reviewed, 655 non-responsive, 98.6% non-responsive rate.

Over 90% of the items in the retrieved category were responsive AND nearly every item in the NOT retrieved category was non-responsive, which is GREAT. Also, all but one of the items previously identified as responsive was retrieved. So, this is a search that appears to maximize recall and precision.

Had we proceeded with the original search, we would have reviewed 200,000 files – 192,800 of which would have been NOT responsive to the case. By testing and refining, we only had to review 8,815 files – 3,710 sample files reviewed plus the remaining retrieved items from the third search (5,700 – 595 = 5,105) – most of which ARE responsive to the case. We saved tens of thousands in review costs while still retrieving most of the responsive files, using a defensible approach.

So, what do you think? Do you use sampling to test your search results? Please share any comments you might have or if you’d like to know more about a particular topic.

Sponsor: This blog is sponsored by CloudNine, which is a data and legal discovery technology company with proven expertise in simplifying and automating the discovery of data for audits, investigations, and litigation. Used by legal and business customers worldwide including more than 50 of the top 250 Am Law firms and many of the world’s leading corporations, CloudNine’s eDiscovery automation software and services help customers gain insight and intelligence on electronic data.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine. eDiscovery Daily is made available by CloudNine solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscovery Daily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

eDiscovery Daily Blog