Among other businesses, Apple, NVIDIA, and Anthropic used the dataset, which was produced by the nonprofit organization EleutherAI and contains transcripts of YouTube videos from over 48,000 channels. The investigation's conclusions highlight the unsettling reality of artificial intelligence: most of the technology's development is based on data that has been stolen from creators without their knowledge or permission.
The collection contains video transcripts from major news organizations like The New York Times, the BBC, and ABC News, as well as major YouTube creators like Marques Brownlee and MrBeast. However, it does not contain any photos or videos from YouTube. A portion of the collection also includes Engadget video subtitles.
Brownlee said on X, "Apple has sourced data for their AI from numerous organizations." He said, "One of them stole a ton of information and transcripts from YouTube videos, including mine. "This is going to be a long-term, growing problem."
According to a Google representative, Neal Mohan, the CEO of YouTube, previously stated that it would be against the platform's terms of service for businesses to use YouTube's data for AI model training. This information was confirmed to Engadget. An inquiry for comments from Engadget was not answered by Apple, NVIDIA, Anthropic, or EleutherAI.
Businesses using AI have not been open about the data they use to train their models up until now. Artists and photographers attacked Apple earlier this month for withholding the source of training data for Apple Intelligence, the company's take on generative AI that will be available on millions of Apple devices this year.
Particularly, YouTube, the largest video repository on the planet, is a treasure trove of audio, video, and image content in addition to transcripts, which makes it a desirable dataset for AI model training. When The Wall Street Journal questioned OpenAI's chief technical officer, Mira Murati, earlier this year, she avoided answering if the business used YouTube films to train Sora, the company's future AI video production tool. At the time, Murati stated, "I am not going to go into the details of the data that was used, but it was publicly available or licensed data." Sundar Pichai, the CEO of Alphabet, has also stated that it would be against YouTube's terms of service for businesses to use the platform's data to train their AI models.
Use the lookup tool on Proof News to find out if the subtitles from your favorite YouTube channels or videos are included in the dataset.