Reddit will reportedly allow anonymous artificial intelligence companies to train models using user-generated content from online message boards.
Ahead of its initial public offering, California-based Reddit offered prospective investors a deal worth around $60m (£48m) a year with an “unknown large AI company”, according to Bloomberg. I told them that. The agreement “could serve as a model for future agreements of a similar nature.”
According to the New York Times, Apple has already “started discussions” with several major news and publishing organizations to use the material to develop generative AI systems. Anonymous sources told the newspaper that the company had “floated a multi-year deal worth at least $50 million” (£40 million) to license its news article archives.
apply 1 week
Escape from the echo chamber. Get the facts behind the news and analysis from multiple perspectives.
Subscribe and save
Sign up for this week’s free newsletter
From our morning news briefing to our weekly Good News newsletter, get the week’s best stories delivered straight to your inbox.
From our morning news briefing to our weekly Good News newsletter, get the week’s best stories delivered straight to your inbox.
But the rapidly accelerating world of AI training is fraught with controversy, from debates over copyright to concerns about ethical violations and reproducing human bias.
How does AI learn?
According to Business Insider, tech companies train their AI models (most famously ChatGPT) on “large amounts of data and text collected from the internet,” including copyrighted content. It is said that there is.
OpenAI, the creators of ChatGPT, designed it to search for patterns in text databases. In the end, 300 billion words and 570 gigabytes of data were analyzed, according to BBC Science Focus. Other AI models, such as DALL-E, which generates images based on text prompts, were fed approximately 6 billion image-text pairs from the LAION-5B dataset.
What are the issues with training AI models?
Both OpenAI and Google have been accused of training AI models without paying for them through licensing agreements or obtaining permission from content creators. The New York Times even sued OpenAI and Microsoft for copyright infringement for using the article.
OpenAI released its video generator Sora last week, further fueling the copyright controversy. Sora can create incredibly lifelike videos from simple text prompts, Mashable said, but OpenAI “shares almost nothing about training data.” Speculation soon began that Sora had been trained using copyrighted material.
Despite the vast amounts of data stored, these models still require human intervention. In a process called reinforcement learning, human operators evaluate the accuracy and appropriateness of the model’s output. “With each click, a largely unregulated army of humans transforms raw data into raw material for AI,” said the Washington Post.
Not only is hiring someone to babysit an AI model expensive and time-consuming, but the process is subjective, with human standards of what is accurate and appropriate. It depends.
The paper said reinforcement learning also leads to the exploitation of workers. In the Philippines, former employees have accused San Francisco startup Scale AI of paying employees “very low rates” or withholding payments altogether through an outsourced digital work platform called Remotasks. I’m blaming. Human rights groups say the company is “one of many U.S. AI companies that fail to adhere to basic labor standards for their overseas employees,” the Post reported.
AI companies are also inadvertently hiring children and teens to fill these roles, Wired reported, as the tasks are “often outsourced to gig workers through online crowdsourcing platforms.” .
Ethical demands aside, AI model creators are concerned about future supply issues. The Internet may contain large amounts of data, but it is not unlimited.
The Atlantic said cutting-edge AI programs “consume most of the available text and images” and use up “the most valuable resource”: training data. This “hindered technology growth and led to iterative updates rather than large-scale paradigm shifts.”
What’s coming into the pipeline?
OpenAi, Google Deepmind, Microsoft, and other big tech companies recently “published research that uses AI models to improve other AI models and even themselves,” The Atlantic reported. Technology executives are hailing this approach, known as synthetic data, as the “future of technology.”
However, training an AI model based on data generated by another AI model has its flaws. It may reinforce the conclusions that the model draws from the original data, and those conclusions may be inaccurate or biased.
As the AI industry continues to grow exponentially, what happens next is uncertain but likely to be “both exciting and frightening,” The New York Times said.