In the frantic pursuit of AI training data, tech giants OpenAI, Google, and Meta have reportedly bypassed corporate policies, altered their rules, and discussed circumventing copyright law. 

A New York Times investigation reveals the lengths these corporations have gone to reap online information to feed their data-hungry AI systems.

In late 2021, OpenAI researchers developed a speech recognition tool called Whisper to transcribe YouTube videos when facing a shortage of reputable English-language text data. 

Despite internal discussions about potentially violating YouTube’s rules, which prohibit using its videos for “independent” applications, 

NYT found that OpenAI ultimately transcribed over a million hours of YouTube content. Greg Brockman, OpenAI’s president, personally assisted in collecting the videos. The transcribed text was then fed into GPT-4.

Google also allegedly transcribed YouTube videos to reap text for its AI models, potentially infringing on video creators’ copyrights.

This comes days after YouTube’s CEO said such activity would violate the company’s terms of service and undermine creators. 

In June 2023, Google’s legal department requested changes to the corporate’s privacy policy, allowing publicly available content from Google Docs and other Google apps for a wider range of AI products. 

Meta, facing its own data shortage, has considered various options to amass more training data. 

Executives discussed paying for book licensing rights, buying the publishing house Simon & Schuster, and even harvesting copyrighted material from the web without permission, risking potential lawsuits. 

Meta’s lawyers argued that using data to coach AI systems should fall under “fair use,” citing a 2015 court decision involving Google’s book scanning project.

Ethical concerns and the long run of AI training data

The collective actions of those tech corporations highlight the critical importance of online data within the booming AI industry.

These practices have raised concerns about copyright infringement and the fair compensation of creators. 

A filmmaker and writer, Justine Bateman, told the Copyright Office that AI models were taking content – including her writing and movies – without permission or payment.

“This is the biggest theft within the United States, period,” she said in an interview.

In the visual arts, MidJourney and other image models have been proven to generate copyright content, like scenes from Marvel movies. 

With some experts predicting that high-quality online data may very well be exhausted by 2026, corporations are exploring alternative methods, equivalent to generating synthetic data using AI models themselves. However, synthetic training data comes with its own risks and challenges and might adversely impact the standard of models

OpenAI CEO Sam Altman himself acknowledged the finite nature of online data in a speech at a tech conference in May 2023: “That will run out,” he said.

Sy Damle, a lawyer representing Andreessen Horowitz, a Silicon Valley enterprise capital firm, also discussed the challenge: “The only practical way for these tools to exist is in the event that they may be trained on massive amounts of information without having to license that data. The data needed is so massive that even collective licensing really can’t work.”

The NYT and OpenAI are locked in a bitter copyright lawsuit, with the Times looking for what would likely be thousands and thousands in damages.

OpenAI hit back, accusing the Times of ‘hacking’ their models to retrieve examples of copyright infringement.

Undoubtedly, this inside investigation further paints Big Tech’s data heist as ethically and legally unacceptable.

With lawsuits mounting up, the legal landscape surrounding the usage of online data for AI training is amazingly precarious. 

The post Inside Big Tech’s tussle over AI training data appeared first on DailyAI.

This article was originally published at