Organizations of all sizes are asking their legal teams to go beyond pure legal risk management and compliance to become more cost-efficient and manage more cases and more new sources of data.
Generative AI has arrived at a time when it answers the demands of many legal teams to do more with less, but all AI tools and the models they run on require training to be effective. High-quality automation brings high-quality training, and both rely on having AI-ready data for the job at hand. At the very least, this means that the data must be appropriate and sufficient to properly train the algorithms, and properly sourced and procured.
Many legal teams now employ workflow audits to identify the most promising areas for prioritizing where and when to deploy generative AI efforts, but these audits often overlook data readiness assessments. Even areas with great potential for generative AI will fail without the AI-ready data and expertise needed to train models effectively and compliantly.
If your data isn’t AI-ready, the results can be dire. Data pitfalls have already devastated big brands’ advertising campaigns, products, and entire businesses.
Heavy fines and “restitution of algorithmically illicit profits”
The FTC has a track record of cracking down on the use of AI to protect consumers in recent years, primarily based on the notion of fraudulently obtained data. While generative AI is much newer, early cases suggest that similar themes will prevail in efforts by artists, creators, and other content owners to protect themselves and their work from the new wave of generative AI tools.
Meta received its first multibillion-dollar fine for data misuse when Facebook was fined $5 billion for data privacy violations in 2019, followed by another $1.3 billion in the European Union in 2023. Also in 2023, the FTC fined Amazon for data collection overuse and insufficient disclosures, $25 million for violations related to data used by its Ring doorbell and Alexa assistant, and $20 million for violations related to children’s Xbox accounts by Microsoft. Fines big and small are just a fraction of the negative impacts.
In 2021 Yale Journal of Law and Technology In the article, FTC Commissioner Rebecca Slaughter writes about algorithmically mandated disruption: “The premise is simple,” she writes. “If a company collects data illegally, it shouldn’t be able to profit from that data or the algorithms developed with it.”
Early examples of FTC orders to destroy algorithms include Weight Watchers and EverAlbum. In 2022, Weight Watchers was required to delete all improperly obtained personal information about children under the age of 13, pay a relatively small fee, and destroy all algorithms derived from the data. EverAlbum misused facial recognition, and as part of its settlement with the FTC, was required to delete photos and videos of deactivated users and destroy algorithms developed with those users’ photos and videos.
Generative AI sparks similar lawsuits
Litigation Litigation surrounding generative AI is relatively new and ongoing. Among them is the copyright lawsuit against GitHub, OpenAI, and Microsoft over Copilot. Multiple plaintiffs allege that Copilot, a generative AI code suggestion tool, “reproduces publicly available code in violation of copyright law and software licensing requirements.” The lawsuit alleges that the creation of Copilot relies on software piracy on an unprecedented scale. In another example of fraudulent data for generative AI, Getty Images has sued Stability AI. Getty alleges that Stability AI illegally used its copyrighted image library to train Stable Diffusion, a popular AI art tool. The case is set to go to court in the UK, but the following image seems to say a lot: In the lawsuit, Stable Diffusion reproduced Getty Images’ watermark as part of the image.
Caption: Image reproduced by Stable Diffusion with Getty Images watermark, via TheVerge.
Ensuring data provenance for long-term AI viability
Data provenance describes or answers where data came from, how it was acquired, who owns it, and what it is used for. Data provenance has long been an indicator of validity when it comes to research data, including medical research. According to the National Library of Medicine, “The purpose of data provenance is to inform researchers of the origins, modifications, and details that support the reliability and validity of research data.”
Similarly, for training algorithms or language models, data provenance helps ensure the validity of the data and its usefulness for training by revealing its origins and other details. It also documents where and how the data was obtained, and the specific permissions obtained in connection with its collection and use.
All data must be provided appropriately with appropriate disclosures that are specific enough to cover all uses. In legal departments and across companies, data collection disclosures often occur before the AI or algorithm is trained. This can be a significant undertaking, but in these situations, data collection disclosures must be amended, re-disclosed, and re-agreed.
Garbage in, garbage out
Generative AI models learn what we tell them to learn. And they can learn new things. Garbage in, garbage out The saying is true.
There are two potential issues to avoid:
Data quality: Before developing and running algorithms on a collection of data, it is important to understand what types of data are not suitable for training. For example, non-text data (images, audio, video, poorly formatted or poorly OCRed documents) may not be suitable for a particular GenAI model and should be excluded.
Insufficient training data: If there is bias or error in the training data (for example, if documents in eDiscovery are miscoded), the output from the model can be biased. This can be a serious issue when running algorithms on very large data sets, increasing the risk of poor performance and inaccuracies.
The essential role of humans
It is important to highlight the synergy between technology and human expertise, with legal experts playing a key role in both the algorithms and AI-generated outcomes, ensuring that the nuanced aspects of the task are accurately captured.
To avoid “garbage in, garbage out,” algorithms must be developed with human expertise and input, thoroughly tested through feedback loops, and then applied across documents like GenAI permission logs. Even the most AI-ready data set cannot perfectly train an algorithm; a continuous feedback loop of human review and expertise, along with rigorous quality control, optimizes the model and therefore the results.
With a feedback loop in place, the performance and efficiency of your AI tool will steadily improve over time. As more feedback is incorporated, the need to make corrections and provide feedback will correspondingly decrease.
Additional Considerations for AI Readiness
While having effective and well-sourced data is important, identifying the right technologies and partners to use that data in generative AI is also crucial. In our upcoming AI readiness paper, we will explore how to determine the AI readiness of law firms, legal service providers, promising software, and existing technology stacks.