Artificial intelligence is gaining greater acceptance in the legal profession, especially in relation to analyzing large sets of electronically stored information, because it can solve budget and staffing constraints. The use of AI lets legal departments do more with less and frees attorneys for more legal-knowledge-driven work. Many attorneys are familiar with basic e-discovery tools such as simple word searches to find key documents, eliminate duplicates, and connect conversations, but AI has moved beyond these basics.

One of the most powerful e-discovery technologies is predictive coding, which searches documents for context, concepts, and tone. It greatly increases accuracy and relevance in document review, and completes tasks in minutes, not days or months. “The demand for computer-assisted techniques to do large document reviews is so great that predictive coding has become the one mainstream application of machine learning techniques in the legal industry,” said Warren Agin, principal at Analytic Law and founding chair of the American Bar Association’s Legal Analytics Committee.

Statistical Sampling Validates Results

Attorneys may be troubled by the possibility that predictive coding could overlook relevant documents. One way to address this concern is through data sampling, a process that involves reviewing a set of sample documents and extrapolating the results to the entire population of a document production.

To check for reliability in document production, reviewers select a statistically valid random sample of documents to review manually for responsive documents. Human reviewers look at the selected sample to determine whether the software accurately identified the documents. If the sample contains too many irrelevant or fewer than expected relevant documents, the algorithm is run again (after being fed additional examples of relevant documents), and another sample set is generated. The process continues until the human reviewers are satisfied with the accuracy of the results.

Statistical measurements are indispensable for ensuring that the technology produces an accurate, defensible result. While no means of screening for privilege is perfect, statistical testing of the screen can bring greater confidence in the accuracy of the outcome.

Quality Seed Sets Are Key To Quality Results

“Garbage in, garbage out” has long been an axiom in computer science. For machine learning and deep learning to fulfill their potential, the training set used to teach the computer must be carefully considered. To create the training materials, known as a “seed set,” expert reviewers select a representative cross section of documents from the full population that needs to be reviewed.

The reviewers then code, or label, each document in the seed set as responsive or unresponsive and input those results into the predictive coding software. The AI or machine learning software analyzes the seed set and creates an algorithm for predicting the responsiveness of future documents. Reviewers then test as described above to verify accuracy and refine the algorithm until the desired results are achieved.

Bias in Data and Algorithms Affects Machine Learning

Artificial intelligence is as prone to bias as humans are. Bias can seep into algorithms and the data machine learning uses to train, influencing the results.8 Bias can arise when data used to calibrate machine-learning algorithms is insufficient or the algorithms themselves are poorly designed. The data picked by the trainers is subject to the trainers’ biases, so they must be vigilant to avoid bias in assembling seed sets.

Bias in assessing electronic data can be a significant problem if, for instance, only text-search software is used to select files and only text analytics is used to evaluate the files. This is true because sometimes electronically stored information is stored as images and is therefore invisible to text-only searches unless the images are converted to text.

In addition, if the person training the AI lacks sufficient knowledge to accurately gauge the difference between responsive and unresponsive documents, the engine will learn incorrectly. The key to combating bias in machine learning is to assume it exists and work to remove it.

“Don’t ignore the human element when utilizing artificial intelligence.”

Conclusion, Recommendations and Cost Efficiencies

When machine learning analytics technology is applied to electronically stored information, a highly navigable framework can be created that enables lawyers to see connections they might not have considered at the beginning of the discovery process. Machine learning leverages sampling techniques and advanced algorithms to predict whether documents are responsive to criteria established by the e-discovery team. As artificial intelligence is refined to produce increasingly accurate and complex results, costs related to e-discovery will be reduced and confidence regarding responsiveness increased (provided the underlying algorithms and seed sets are sound).

The benefits of using predictive coding in e-discovery are that it:

  • Prioritizes or eliminates documents based on relevancy scores
  • Lowers costs by reducing the number of documents requiring manual review
  • Sorts documents so they can more easily be assigned to specific reviewers
  • Enables strategic decision-making earlier in a case

Predictive coding increases accuracy while reducing the time and expense of document review by using machine learning technology combined with expert reviewers.

This whitepaper is not intended to provide any legal advice.

Ready to Advance Your Business?