Automated Electronic Discovery
by Jana Sukkarieh, August 2016. Special thanks for Bill Dimm, Founder & CEO of Hot Neuron, LLC for reviewing the first version of this page.
- Automated e-Discovery
- Tar 1.0, 2.0, and 3.0
- Beyond Civil Litigation
- Concluding Remarks
Broadly, electronic discovery (e-discovery) refers to a process where given a collection of e-documents and a topic of interest, the goal is to find the relevant (or non-relevant) documents in the collection against this topic, excluding some privileged classes of documents. Oard, Baron, Hedin, Lewis and Tomlinson (2009/2010) define e-discovery as "The requirement that documents and information in electronic form stored in corporate systems be produced as evidence in litigation".
Before 2005, e-discovery was done using manual review and keyword searching. According to Oard et al. (2009/2010), “how lawyers go about propounding and responding to discovery didn’t materially change between the 1930s and the 1990s". Kershaw (2005) was probably the first to mention statistical techniques for document similarity classification given a specific criterion, ontologies and context across documents not just within the same document. Since then many organizations and practitioners were instrumental in galvanizing the automated e-discovery research and development (“automated”, where a human is involved, as opposed to “automatic”). In particular, in 2006, the Text Retrieval Conference Legal Track (TREC Legal Track) started and prompted building test suites for TREC legal track and evaluation design (Oard et al. 2009/2010). In 2007, the first machine learning (ML) and other advanced techniques for Discovery and Information Governance (DESI) workshop was co-located with ICAIL (http://www.iaail.org/) and in 2009 the Sedona Conference recognized that “the legal profession is at a crossroads”, in terms of its willingness to embrace automated, analytical and statistical approaches to how law is practiced in the area of e-discovery. Though some progress has been made, there are many limitations and criticisms (see for example, https://www.youtube.com/watch?v=f-Hdif4vaes). There are many false claims by vendors on what they can deliver. Different communities and players have to come together to continue to play a vital role in this live, on-going progress.
A major question in law is whether the performance of the technology is comparable to that of humans or whether it falls short, assuming that humans are the gold standard. Even if these comparable results between a machine and humans have been shown in some studies, for some topics, such as that of Grossman and Cormack in 2011 (Grossman and Cormack, 2011), it is still a controversial issue given the unpredictability and ambiguity of many “production requests” or “information need” (if not familiar with these terms check Grossman and Cormack Glossary) and given the multi-modal multi-media type of today’s e-documents. Hence, for more than one reason, adoption or acceptance of technology-enabled e-discovery is still an open debate.
Saying that, everyone agrees that the process needs to be made more efficient and less costly. Therefore, investing in automation and reduced manual work, especially from senior domain experts, is a thriving endeavor in the legal industry with many claims on what the technologies are capable of.
When e-discovery is done in an automated manner, it might seem like just another Information Retrieval (IR) problem but it is not, especially when considering the legal domain. The legal discovery task could be seen as a special case of IR or that it has commonalities with IR but, it seems to me, it is a much more challenging task than IR. Some challenges come in at least the following aspects. First, a tricky definition of a topic, privileged documents, relevancy (or responsiveness in e-discovery terms) poses a challenge. For an excellent account on various aspects of "relevancy", please see Brassil, Hogan and Attfield (2009). Second, independent evaluations are difficult to perform, including the unavailability of data, the controversy of what constitutes a success, who in the internal team decides about "success or not", and the requirement for a high recall, in some cases near 100%. Third, the repercussions entailed of being deemed unsuccessful or unresponsive to the production request by an external body could be huge and, if that is the case, who is made accountable. Last but not least, as mentioned earlier, the adoption and acceptance of the technology by enterprises and courts/judges are a major challenge.
Many research questions and parameters are at play in this task beyond text processing, IR, ML and Knowledge Representation communities, all of which a newcomer to e-discovery might miss. For example, recently, in the patent infringement subdomain, there was a clear evidence that in addition to IR, Textual Entailment was one major subtask under the e-discovery hood (Sukman, 2016). Other examples include the need to deal with multi-modality and media such as speech transcription or automated speech recognition (ASR) tools and video annotation (Fersini, Messina, Archetti, and Cislaghi, 2010) and the need for deeper semantic features (Graus et al, 2013). A big open question is performance efficiency and text representation e.g. Hyman, Sincich, Will, and Warren (2015/2016). Also, dealing with privileged documents is non-trivial and current techniques may fail miserably. Hence, “privilege” is another topic of endeavor, see e.g., Gabriel, Paskach and Sharpe (2013), and Vinjumur, Oard, Axelrod (2015).
In another example, as pointed out by Conrad (Conrad, 2010) and especially recently since e-discovery opened the door for contract recovery/discovery (see “Beyond Civil Litigation” section below), it has become apparent that Information Extraction (IE) can be a major task. Another point made by Conrad in his paper, is the need to eliminate non-responsive or irrelevant documents/material early on. This is one approach ClearstoneIP uses. Also, I think, not forgetting the elephant in the room, scanned documents, require additional technologies, at least in forms of optical character recognition and image recognition and processing that are more accurate than what has been used to date. Furthermore, a point that seems to be missing in the e-discovery literature is that, all the above, requires dealing with noisy data. Hence, one needs to make sure to allow for data cleaning tools and/or that the components of the technology under the hood are robust towards noise. Finally, collaboration among various communities in AI is essential to make further progress. See for example, "Where Search Meets Machine Learning": Presented by Diana Hu & Joaquin Delgado – addressing current techniques of ML and IR.
The industry specific terms that are used to describe the automated processes used for e-discovery are technology-assisted review (TAR), computer-assisted review (CAR) or predictive coding. In some instances, the terms are used interchangeably and in others predictive coding is used for supervised ML while TAR for a broader use. Many practitioners define the process as an iterative process using ML. However, there is nothing in the specification of the e-discovery task that obliges one to use ML or indeed for it to be iterative. It is, however, one methodology researchers/practitioners have tackled the problem with some success.
Tar 1.0, 2.0 and 3.0
It is important to remind a beginner that, contrary to what some uninformed news channels convey, ML is not synonymous to AI but only one of its sub-fields and deep learning is not synonymous to ML but only one of its sub-camps. For e-discovery protocols, one discovers that except for few papers and blogs, there is not much described in terms of text processing or ML and feature engineering techniques.
According to researchers and practitioners, the three versions of TAR, 1.0, 2.0 and 3.0 all, use ML in one way or another. The main differences among TAR 1.0 and 2.0 consist of a) how the seed set is selected, b) whether the iterative approach takes advantage of the classification model and c) whether the documents to be reviewed and added to the training set are the top-ranking documents or ones that the ML algorithm is least certain about (Cormack and Grossman, 2014). TAR 3.0 is the implementation where clustering of documents is performed as a first step. For more details on TAR 3.0, the reader could refer to “TAR 3.0 Performance” (Dimm, 2016). For an excellent reference and more detailed description of the techniques, the reader could refer to Bill Dimm’s in-progress book (http://predictivecodingbook.com).
In terms of ML techniques, there are many questions to be answered when it comes to e-discovery. For instance, do all researchers and practitioners use features such as bits, bytes, words, n-grams? are there deeper linguistic features in general and semantic features in particular being considered? How do text, video, speech, image features translate into ML features? Which features are proving to be more significant and why? How about which ML technique is more efficient, scalable or accurate given the size of the seed set, the topic at hand, the type of data and the size of data (whether the training or total datasets)?
There are also many parameters in e-discovery beyond ML that could be and need to be explored further. Parameters could be grouped into at least the following categories, not necessarily disjoint and by no means exhaustive. First, “Relevancy or Responsiveness” category that includes parameters such as (non)-ambiguity of production requests or temporal parameters as in what might be relevant at a certain time might become irrelevant or vice-versa. Then, an “Iteration, Review and Evaluation” category that could include parameters related to data prevalence, human involvement (expertise, what, how many, at which stage they get involved), evaluation methodology and metrics or what constitutes success, etc. Another category is related to “Cross-countries, Languages and Jurisdictions” and dealing with major changes in an uncertain period of time such as Brexit, or election campaigns. An important category that might be overlooked is the “Technology ‘s ease of use” that includes parameters such as deployment, integration, modularity, server requirements, data storage (& destruction), quality assurance, sleekness of user-interfaces, if any, and user-experience in general. Last but not least, there are also additional ethical issues or parameters that are involved, for example, the assumption that parties, on opposite ends of the litigation (or even intra-party), are playing by the rules.
Beyond Civil Litigation
Though people might think e-discovery is a term mostly associated with civil litigation including patent infringement (e.g. Sukman, 2016) and landscaping (e.g. Abood and Feltenberger, 2016), it applies to many other legal and non-legal domains. “Other areas closely affiliated with EDD today include litigation and management, compliance regulation, Freedom of Information inquiries, and Homeland Security Initiatives” (Conrad, 2010). In 2015, Cormack and Grossman (Cormack and Grossman, 2015) refer to e-discovery for evidence-based medicine (Lefebvre, Manheimer & Glanville, 2008) and for building test suites for IR (Sanderson and Joho, 2004). The most obvious discovery task is of course the task of "literature review" that researchers undertake on a regular basis. "Success" in research is also dependent on many criteria including subjective ones. One "trap" that researchers might fall into, in literature discovery, is that they stop prematurely, do not go back enough in time (maybe for lack of time) or do not see connections, if for example, techniques/knowledge are/is presented under different names; hence, a drawback is that many would "re-invent the wheel". Saying that, literature discovery/research might never be enough, in the sense that, it should be always on, in an iterative adaptive mode (one has to start somewhere and actively learn, continuously).
Recently, e-discovery gave birth to contract recovery as one could see from the following very recent interesting articles:
The bottom line is that there are many more questions to be answered and a non-trivial amount of work to be done for everyone interested in learning about or working on e-discovery. Automated e-discovery is a growing, evolving, dynamic field with different trends and changing players. Hence, research and development venues are still to be explored, beyond or concurrently with the major techniques and protocols to date.
Some additional resources with their hyperlinks are listed here (no particular order):
1. The Grossman-Cormack Glossary of Technology Assisted Review by Grossman and Cormack (2013)
2. The Decade of Discovery (2014) Directed by Joe Looby. With Jason R. Baron, Richard G. Braman, Stephen Breyer, John M Facciola.
6. Book by Bill Dimm at http://predictivecodingbook.com
7. TAR for smart people By John Tredennick with Mark Noel, Jeremy Pickens,Robert Ambrogi & Thomas C. Gricks III.
9. E-disovery Team of Ralph Losey https://e-discoveryteam.com/ and in particular https://e-discoveryteam.com/car/
15. Data Preservation, Alex Ponce de Leon of Google
Click for References.