Analysis

A Fresh Comparison of TAR and Keyword Search: eDiscovery Best Practices

Bill Dimm of Hot Neuron (the company that provides the product Clustify that provides document clustering and predictive coding technologies, among others) is one of the smartest men I know about technology assisted review (TAR).  So, I’m always interested to hear what he has to say about TAR, how it can be used and how effective it is when compared to other methods (such as keyword searching).  His latest blog post on the Clustify site talk about an interesting exercise that did exactly that: compared TAR to keyword search in a real classroom scenario.

In TAR vs. Keyword Search Challenge on the Clustify blog, Bill challenged the audience during the NorCal eDiscovery & IG Retreat to create keyword searches that would work better than technology-assisted review (predictive coding) for two topics.  Half of the room was tasked with finding articles about biology (science-oriented articles, excluding medical treatment) and the other half searched for articles about current law (excluding proposed laws or politics).  Bill then ran one of the searches against TAR in Clustify live during the presentation (the others he couldn’t do during the session due to time constraints, but did afterward and covered those on his blog, providing the specific searches to which he compared TAR).

To evaluate the results, Bill measured the recall from the top 3,000 and top 6,000 hits on the search query (3% and 6% of the population respectively) and also included the recall achieved by looking at all docs that matched the search query, just to see what recall the search queries could achieve if you didn’t worry about pulling in a ton of non-relevant docs.  For the TAR results he used TAR 3.0 (which is like Continuous Active Learning, but applied to cluster centers only) trained with (a whopping) two seed documents (one relevant from a keyword search and one random non-relevant document) followed by 20 iterations of 10 top-scoring cluster centers, for a total of 202 training documents.  To compare to the top 3,000 search query matches, the 202 training documents plus 2,798 top-scoring documents were used for TAR, so the total document review (including training) would be the same for TAR and the search query.

The result: TAR beat keyword search across the board for both tasks.  The top 3,000 documents returned by TAR achieved higher recall than the top 6,000 documents for any keyword search.  Based on this exercise, TAR achieved better results (higher recall) with half as much document review compared to any of the keyword searches.  The top 6,000 documents returned by TAR achieved higher recall than all of the documents matching any individual keyword search, even when the keyword search returned 27,000 documents.

Bill acknowledges that the audience had limited time to construct queries, they weren’t familiar with the data set, and they couldn’t do sampling to tune their queries, so the keyword searching wasn’t optimal.  Then again, for many of the attorneys I’ve worked with, that sounds pretty normal.  :o)

One reader commented about email headers and footers cluttering up results and Bill pointed out that “Clustify has the ability to ignore email header data (even if embedded in the middle of the email due to replies) and footers” – which I’ve seen and is actually pretty cool.  Irrespective of the specifics of the technology, Bill’s example is a terrific fresh example of how TAR can outperform keyword search – as Bill notes in his response to the commenter “humans could probably do better if they could test their queries, but they would probably still lose”.  Very interesting.  You’ll want to check out the details of his test via the link here.

So, what do you think?  Do you think this is a valid comparison of TAR and keyword searching?  Why or why not?  Please share any comments you might have or if you’d like to know more about a particular topic.

Sponsor: This blog is sponsored by CloudNine, which is a data and legal discovery technology company with proven expertise in simplifying and automating the discovery of data for audits, investigations, and litigation. Used by legal and business customers worldwide including more than 50 of the top 250 Am Law firms and many of the world’s leading corporations, CloudNine’s eDiscovery automation software and services help customers gain insight and intelligence on electronic data.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine. eDiscovery Daily is made available by CloudNine solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscovery Daily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

Don’t Miss Our Webcast Today on Technology Assisted Review!: eDiscovery Webcasts

What is Technology Assisted Review (TAR)? Why don’t more lawyers use it? Find out in our webcast today!

Today at noon CST (1:00pm EST, 10:00am PST), CloudNine will conduct the webcast Getting Off the Sidelines and into the Game using Technology Assisted Review. In this one-hour webcast that’s CLE-approved in selected states, will discuss what TAR really is, when it may be appropriate to consider for your case, what challenges can impact the use of TAR and how to get started. Topics include:

  • Understanding the Goals for Retrieving Responsive ESI
  • Defining the Terminology of TAR
  • Different Forms of TAR and How They Are Used
  • Acceptance of Predictive Coding by the Courts
  • How Big Does Your Case Need to Be to use Predictive Coding?
  • Considerations for Using Predictive Coding
  • Challenges to an Effective Predictive Coding Process
  • Confirming a Successful Result with Predictive Coding
  • How to Get Started with Your First Case using Predictive Coding
  • Resources for More Information

Once again, I’ll be presenting the webcast, along with Tom O’Connor, who recently wrote an article about TAR that we covered on this blog.  To register for it, click here.  Even if you can’t make it, go ahead and register to get a link to the slides and to the recording of the webcast (if you want to check it out later).  If you want to learn about TAR, what it is and how to get started, this is the webcast for you!

So, what do you think?  Do you use TAR to assist in review in your cases?  Please share any comments you might have or if you’d like to know more about a particular topic.

Sponsor: This blog is sponsored by CloudNine, which is a data and legal discovery technology company with proven expertise in simplifying and automating the discovery of data for audits, investigations, and litigation. Used by legal and business customers worldwide including more than 50 of the top 250 Am Law firms and many of the world’s leading corporations, CloudNine’s eDiscovery automation software and services help customers gain insight and intelligence on electronic data.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine. eDiscovery Daily is made available by CloudNine solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscovery Daily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

Why Is TAR Like a Bag of M&M’s?, Part Four: eDiscovery Best Practices

Editor’s Note: Tom O’Connor is a nationally known consultant, speaker, and writer in the field of computerized litigation support systems.  He has also been a great addition to our webinar program, participating with me on several recent webinars.  Tom has also written several terrific informational overview series for CloudNine, including eDiscovery and the GDPR: Ready or Not, Here it Comes (which we covered as a webcast), Understanding eDiscovery in Criminal Cases (which we also covered as a webcast) and ALSP – Not Just Your Daddy’s LPO.  Now, Tom has written another terrific overview regarding Technology Assisted Review titled Why Is TAR Like a Bag of M&M’s? that we’re happy to share on the eDiscovery Daily blog.  Enjoy! – Doug

Tom’s overview is split into four parts, so we’ll cover each part separately.  The first part was covered last Tuesday, the second part was covered last Thursday and the third part was covered this past Tuesday.  Here’s the final part, part four.

Justification for Using TAR

So where does this leave us? The idea behind TAR – that technology can help improve the eDiscovery process – is a valuable goal. But figuring out what pieces of technology to apply at what point in the workflow is not so easy, especially when the experts disagree as to the best methodology.

Is there a standard, either statutory or in case law to help us with this determination?  Unfortunately, no. As Judge Peck noted on page 5 of the Hyles case mentioned above, “…the standard is not perfection, or using the “best” tool, but whether the search results are reasonable and proportional.”

FRCP 1 is even more specific.

These rules govern the procedure in all civil actions and proceedings in the United States district courts, except as stated in Rule 81. They should be construed, administered, and employed by the court and the parties to secure the just, speedy, and inexpensive determination of every action and proceeding.  (emphasis added)

The Court in any given matter decides if the process being used is just.  And although we have seen ample evidence that computers are faster than humans, speed may not always equate to accuracy. I’ll leave aside the issue of accuracy for another day since two of the most interesting case studies, the EDI/Oracle study and the most recent Lex Geek “study” in which a human SME scored exactly the same number of accurate retrievals as the computer system.

I am most interested in pointing out that few if any studies or case law opinions address the issue of inexpensive.  To his credit, Judge Peck did note in footnote 2 on page 3 of the Hyles opinion that “…some vendor pricing models charge more for TAR than for keywords.” but went on to note that typically those costs are offset by review time savings.  With all due respect to Judge Peck, to whose opinion I give great credence, I am not sure that is necessarily the case.

Most case studies I have seen emphasize speed or accuracy and don’t even mention cost. Yet the increased emphasis on proportionality in eDiscovery matters makes this third requirement more important than ever. Maura Grossman does provide for this concern in her Broiler Chicken protocol but only to the extent that a concerned party should bring any issues to the Special Master.

The proportionality issue is an important one. Principle 4 of the Sedona Conference Commentary on Proportionality in Electronic Discovery states that “The application of proportionality should be based on information rather than speculation.” Absent specific statistics regarding TAR costs, it seems we are all too often engaging in speculation about the true cost a specific technology.

I am mindful of the decision in the case of In Re State Farm Lloyds in March of 2017 (covered by eDiscovery Daily here), in which the Texas Supreme Court, deciding a matter involving the form of production and noting it’s parity with the Federal Rules, remarked that one party made an assertion of an “… extraordinary and burdensome undertaking … without quantifying the time or expense involved.”   Meaningful case studies and their statistics about the actual costs of various technologies would go a long way towards resolving these sort of disputes and fulfilling the requirement of FRCP 1.

Conclusions

Although the use of TAR has been accepted in the courts for several years, there is still a great deal of confusion as to what TAR actually is. As a result, many lawyers don’t use TAR at all.

In addition, the lack of definitions makes pricing problematic. This means that the several of the Federal Rules of Civil Procedure are difficult if not impossible to implement including FRCP 1 and FRCP 26(b)(1).

It is essential for the proper use of technology to define what TAR means and to determine not only the different forms of TAR but the costs of using each of them.  Court approval of technology such as predictive coding, clustering and even AI all depend on clear concise information and cost analysis.  Only then will technology usage be effective as well as just, speedy and inexpensive.

So, what do you think?  How would you define TAR?  As always, please share any comments you might have or if you’d like to know more about a particular topic.

Image Copyright © Mars, Incorporated and its Affiliates.

Sponsor: This blog is sponsored by CloudNine, which is a data and legal discovery technology company with proven expertise in simplifying and automating the discovery of data for audits, investigations, and litigation. Used by legal and business customers worldwide including more than 50 of the top 250 Am Law firms and many of the world’s leading corporations, CloudNine’s eDiscovery automation software and services help customers gain insight and intelligence on electronic data.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine. eDiscovery Daily is made available by CloudNine solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscovery Daily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.

Why Is TAR Like a Bag of M&M’s?, Part Three: eDiscovery Best Practices

Editor’s Note: Tom O’Connor is a nationally known consultant, speaker, and writer in the field of computerized litigation support systems.  He has also been a great addition to our webinar program, participating with me on several recent webinars.  Tom has also written several terrific informational overview series for CloudNine, including eDiscovery and the GDPR: Ready or Not, Here it Comes (which we covered as a webcast), Understanding eDiscovery in Criminal Cases (which we also covered as a webcast) and ALSP – Not Just Your Daddy’s LPO.  Now, Tom has written another terrific overview regarding Technology Assisted Review titled Why Is TAR Like a Bag of M&M’s? that we’re happy to share on the eDiscovery Daily blog.  Enjoy! – Doug

Tom’s overview is split into four parts, so we’ll cover each part separately.  The first part was covered last Tuesday and the second part was covered last Thursday.  Here’s part three.

Uses for TAR and When to Use or Not Use It

Before you think about using more advanced technology, start with the basic tools early on: dedupe, de-nist, cull by dates and sample by custodians. Perhaps even keyword searches if your case expert fully understands case issues and is consistent in his or her application of that understanding.

When you have all (or at least most) of your data at the outset, some examples are:

  • Review-for-production with very large data sets
  • First pass review for Responsive/Not Responsive
  • First pass review for Privileged/Not Privileged
  • Deposition preparation
  • Working with an expert witness

Then when you are ready to move on to more advanced analytics, get an expert to assist you who has legal experience and can explain the procedure to you, your opponent and the Court in simple English.

Advanced tools may also be helpful when all of the data is not yet collected, but you need to:

  • Identify and organize relevant data in large datasets
  • When the objective is more than just identifying relevance or responsiveness
  • If you need to locate a range of issues
  • If you have a very short deadline for a motion or hearing

There are several operational cautions to keep in mind however.

  1. TAR isn’t new: it’s actually the product of incremental improvements over the last 15 years
  2. TAR isn’t one tool: just as there is no one definition of the tools, there is likewise no single approach to how they’re employed
  3. TAR tools do not “understand” or “read” documents. They work off of numbers, not words

And when do you NOT want to use TAR? Here is a good example.

This is a slide that Craig Ball uses in his presentation on TAR and eDiscovery:

Image Copyright © Craig D. Ball, P.C.

The point is clear. With large data sets that require little or no human assessment, TAR … and here we are specifically talking about predictive coding …. is your best choice. But for the close calls, you need a human expert.

How does this work with actual data? The graphic below from the Open Source Connections blog shows a search result using a TAR tool in a price fixing case involving wholesale grocery sales.  The query was to find and cluster all red fruits.

Image Copyright © Open Source Connections blog

What do see from this graphic?  The immediate point is that the bell pepper is red, but it is a vegetable not a fruit. What I pointed out to the client however was there were no grapes in the results.  A multi modal approach with human intervention could have avoided both these errors.

We’ll publish Part 4 – Justification for Using TAR and Conclusions – on Thursday.

So, what do you think?  How would you define TAR?  As always, please share any comments you might have or if you’d like to know more about a particular topic.

Image Copyright © Mars, Incorporated and its Affiliates.

Sponsor: This blog is sponsored by CloudNine, which is a data and legal discovery technology company with proven expertise in simplifying and automating the discovery of data for audits, investigations, and litigation. Used by legal and business customers worldwide including more than 50 of the top 250 Am Law firms and many of the world’s leading corporations, CloudNine’s eDiscovery automation software and services help customers gain insight and intelligence on electronic data.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine. eDiscovery Daily is made available by CloudNine solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscovery Daily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.