Perhaps there’s only one industry hotter than artificial intelligence right now: AI litigation.
Just one example is author Michael Chabon suing meth. Getty Images is suing Stability AI. And both the New York Times and the Authors Guild have filed separate lawsuits against OpenAI and Microsoft.
At the center of these lawsuits are claims that technology companies illegally used copyrighted works as part of their AI training data.
For a text-focused generative AI, there’s a good chance that some of its training data will come from one large archive: Common Crawl.
“Common Crawl is a copy of the Internet. It’s an archive of 17 years of the Internet, and we’re making it freely available to researchers, academics, and businesses,” said Rich, president of the nonprofit Common Crawl Foundation. Skrenta said.
Since 2007, Common Crawl has stored 250 billion web pages as downloadable data files. Until recently, some of its biggest users were academics, researching topics like online hate speech and government censorship.
But now we have another power user.
“We’re hearing from researchers that without Common Crawl, LLM wouldn’t exist,” Skrenta said.
LLM stands for Large-Scale Language Model, which is essentially the algorithm behind AI products like ChatGPT.
LLMs require you to take in vast amounts of text to learn the rhythm and structure of the language so you can write a compelling term paper or a compelling and humanizing wedding vows.
OpenAI, Google, and Meta all used versions of Common Crawl in their early AI research.
There’s no easy way to find out unless the 2009 “Glee” fan fiction blog is behind a paywall or has code that tells Common Crawl to look away, but Common Crawl includes Very likely.
After ChatGPT came out, Skrenta says the number of websites that blocked archiving Common Crawl’s content doubled. And requests for deletion from existing archives have increased significantly.
Skrenta argues that by publishing something on the internet without explicitly telling the robot to avoid it, you are consenting to its use by AI.
“You intentionally posted your information on the Internet for people to come see. And robots are humans too,” Skrenta said.
Common Crawl isn’t the only text used to train AI. Luca Soldaini, a researcher at the nonprofit Allen Institute for AI, says we used to know a lot more about how data technology companies train.
But that was before OpenAI was valued at $100 billion.
“It’s not in their interest, from a competitive advantage standpoint or from a legal standpoint, to tell us what’s in there,” Soldaini said.
Most major AI companies allow web publishers to opt out of future AI training data. But Soldaini says it would be incredibly costly and time-consuming if companies were forced to retrain their current AI models without any of the material users want to retrieve.
And without learning from copyrighted works, AI may just stink.
Tech companies claim that using copyrighted material to train AI is legally “fair use.” AI systems should be able to read and learn from the internet, just like humans.
But beyond the legal arguments, there are also ethical ones.
“The creators among us have all grown up fully understanding and accepting that when we create and put it out into the world, people learn from it,” says the founder of a nonprofit startup. said Ed Newton Rex. I’m pretty trained. “We didn’t get into this game expecting big companies to collect it, train it, and create these scalable systems. None of this is part of the social contract. ”
Fairly Training certifies AI systems that only use training data that is licensed or approved by its human creator.
Newton-Rex hopes this certification will allow consumers to decide which AI systems reflect their values, similar to Fair Trade stickers on robots.
“I don’t think people realize that when they use something like ChatGPT, they’re using a model trained in this way. In some cases, they are being trained without their knowledge and without any compensation,” Newton-Rex said.
There’s a lot going on in the world. For everything, Marketplace is here for you.
You use the Marketplace to analyze world events and communicate how they affect you in a factual and approachable way. We rely on your financial support to continue making that possible.
Your donation today helps power the independent journalism you depend on. For as little as $5 a month, you can help sustain our marketplace. This allows us to continue reporting on the things that matter to you.