Reddit says it has earned $203 million thus far from licensing its data

Reddit’s listing prospects have loads more to do with relationships with AI providers like OpenAI than one might expect.

In its IPO prospectus filed today with the U.S. Securities and Exchange Commission, Reddit repeatedly emphasized how much it expects to realize through data licensing agreements with the businesses that use AI models across its 1 billion-plus jobs and greater than 16 Training positions, can win – and has won – and won billions of comments.

“In January 2024, we entered into certain data license agreements with an aggregate contract value of $203.0 million and terms starting from two to a few years,” the prospectus states. “We expect to appreciate revenue of no less than $66.4 million within the fiscal yr ending December 31, 2024 and the remaining years thereafter.”

Now it’s a mystery which AI providers have licensed data from Reddit thus far. Earlier this week, Bloomberg and Reuters reported that a “large, nameless AI company” – possibly Google – had entered right into a licensing agreement price roughly $60 million on an annual basis. But OpenAI would not be a surprising customer either, especially considering OpenAI CEO Sam Altman has an 8.7% stake. Mission in Reddit (making him the third-largest shareholder) and was once a member of the corporate’s board of directors.

Why is Reddit data precious? As Reddit explains, AI models “learn” from examples to create essays, code, emails, articles, and more, and providers like OpenAI scour the web for thousands and thousands to billions of those examples so as to add to their training sets. Some examples are in the general public domain. Others aren’t or, within the case of Reddit content, are subject to restrictive licenses that require attribution or certain types of compensation.

So far, Reddit has not restricted access to its data for AI training purposes. But last yr the corporate modified course, arguing that its data shouldn’t be – within the words of CEO Steve Huffman – “provided free of charge to among the largest corporations on this planet.”

“(Our) data APIs are able to providing real-time access to evolving and dynamic topics akin to sports, movies, news, fashion and the most recent trends,” the prospectus continues. “We consider Reddit’s vast body of conversation data and knowledge will proceed to play a job in training and improving large language models. Since our content is updated and expanded each day, we expect models to contemplate these recent ideas and wish to update their training using Reddit data.”

Content producers, from media libraries to news publishers, are increasingly turning to data licensing agreements with AI providers as chatbots like OpenAI’s ChatGPT and Google’s Gemini threaten to disrupt data traffic. A current model from The Atlantic found If a search engine like Google integrated AI into search, it will answer a user’s query 75% of the time without requiring a click on the web site.

Vendors, in turn, have been encouraged to enter into licensing agreements as they face a barrage of lawsuits alleging they haven’t any legal justification for training their models on data without permission or payment. Recently, the New York Times accused OpenAI of using its works to effectively create competitors for news publishers and thus harm its business.

On the one hand, OpenAI has agreements with the image gallery Shutterstock and publishers akin to Axel Springer, the owner of Politico and Business Insider. The licenses are reported However, it’s more likely to be quite small – with a maximum value of $5 million per yr.

This article was originally published at techcrunch.com