eDiscovery Daily Blog

A Fresh Comparison of TAR and Keyword Search: eDiscovery Best Practices

Bill Dimm of Hot Neuron (the company that provides the product Clustify that provides document clustering and predictive coding technologies, among others) is one of the smartest men I know about technology assisted review (TAR).  So, I’m always interested to hear what he has to say about TAR, how it can be used and how effective it is when compared to other methods (such as keyword searching).  His latest blog post on the Clustify site talk about an interesting exercise that did exactly that: compared TAR to keyword search in a real classroom scenario.

In TAR vs. Keyword Search Challenge on the Clustify blog, Bill challenged the audience during the NorCal eDiscovery & IG Retreat to create keyword searches that would work better than technology-assisted review (predictive coding) for two topics.  Half of the room was tasked with finding articles about biology (science-oriented articles, excluding medical treatment) and the other half searched for articles about current law (excluding proposed laws or politics).  Bill then ran one of the searches against TAR in Clustify live during the presentation (the others he couldn’t do during the session due to time constraints, but did afterward and covered those on his blog, providing the specific searches to which he compared TAR).

To evaluate the results, Bill measured the recall from the top 3,000 and top 6,000 hits on the search query (3% and 6% of the population respectively) and also included the recall achieved by looking at all docs that matched the search query, just to see what recall the search queries could achieve if you didn’t worry about pulling in a ton of non-relevant docs.  For the TAR results he used TAR 3.0 (which is like Continuous Active Learning, but applied to cluster centers only) trained with (a whopping) two seed documents (one relevant from a keyword search and one random non-relevant document) followed by 20 iterations of 10 top-scoring cluster centers, for a total of 202 training documents.  To compare to the top 3,000 search query matches, the 202 training documents plus 2,798 top-scoring documents were used for TAR, so the total document review (including training) would be the same for TAR and the search query.

The result: TAR beat keyword search across the board for both tasks.  The top 3,000 documents returned by TAR achieved higher recall than the top 6,000 documents for any keyword search.  Based on this exercise, TAR achieved better results (higher recall) with half as much document review compared to any of the keyword searches.  The top 6,000 documents returned by TAR achieved higher recall than all of the documents matching any individual keyword search, even when the keyword search returned 27,000 documents.

Bill acknowledges that the audience had limited time to construct queries, they weren’t familiar with the data set, and they couldn’t do sampling to tune their queries, so the keyword searching wasn’t optimal.  Then again, for many of the attorneys I’ve worked with, that sounds pretty normal.  :o)

One reader commented about email headers and footers cluttering up results and Bill pointed out that “Clustify has the ability to ignore email header data (even if embedded in the middle of the email due to replies) and footers” – which I’ve seen and is actually pretty cool.  Irrespective of the specifics of the technology, Bill’s example is a terrific fresh example of how TAR can outperform keyword search – as Bill notes in his response to the commenter “humans could probably do better if they could test their queries, but they would probably still lose”.  Very interesting.  You’ll want to check out the details of his test via the link here.

So, what do you think?  Do you think this is a valid comparison of TAR and keyword searching?  Why or why not?  Please share any comments you might have or if you’d like to know more about a particular topic.

Sponsor: This blog is sponsored by CloudNine, which is a data and legal discovery technology company with proven expertise in simplifying and automating the discovery of data for audits, investigations, and litigation. Used by legal and business customers worldwide including more than 50 of the top 250 Am Law firms and many of the world’s leading corporations, CloudNine’s eDiscovery automation software and services help customers gain insight and intelligence on electronic data.

Disclaimer: The views represented herein are exclusively the views of the author, and do not necessarily represent the views held by CloudNine. eDiscovery Daily is made available by CloudNine solely for educational purposes to provide general information about general eDiscovery principles and not to provide specific legal advice applicable to any particular circumstance. eDiscovery Daily should not be used as a substitute for competent legal advice from a lawyer you have retained and who has agreed to represent you.