Last week, I wrote about Mark Zuckerberg’s comments about Meta’s AI strategy, which includes one special advantage: a massive, ever-growing internal dataset training its Llama models.
Zuckerberg boasted that on Facebook and Instagram there are “hundreds of billions of publicly shared images and tens of billions of public videos, which we estimate is greater than the Common Crawl dataset and people share large numbers of public text posts in comments across our services as well.”
But it turns out that the training data required for Meta, OpenAI or Anthropic AI models — a topic I have returned to many times over the past year — is just the beginning of understanding how data functions as the diet that sustains today’s large language models.
When it comes to AI’s growing appetite for data, it is the ongoing inference required by every large company using LLM APIs — that is, actually deploying LLMs for various use cases — that is turning AI models into the insatiable equivalent of the classic Hasbro Hungry Hungry Hippos game, frantically gobbling up data marbles in order to keep going.
The AI Impact Tour – NYC
We’ll be in New York on February 29 in partnership with Microsoft to discuss how to balance risks and rewards of AI applications. Request an invite to the exclusive event below.
Request an invite
Highly-specific datasets are often needed for AI inference
“[Inference is] the bigger market, I don’t think people realize that,” said Brad Schneider, founder and CEO of Nomad Data, which he describes as a ‘search engine for data.’
The New York City company, founded in 2020, has built its own LLMs to help match over 2,500 data vendors to data buyers, which includes an ‘exploding’ number of companies which need often obscure, highly-specific datasets for their own LLM inference use cases.
Rather than serving as a data broker, Nomad offers data discovery — so companies can, in natural language, search for specific types of data. For example, “I need a data feed of every roof undergoing construction in the US every month.”
A data seeker might have no idea what such a data set would be called, Schneider explained in a recent interview. “Our LLMs and NLP compare it against a whole database of vendors and then we ask the vendor, do you do this? And the vendor might say yes, we have roofing permits. We have roofing providers and materials sales by month.”
As more data comes to market, Nomad can match it to that demand. Take an insurance company that started selling their data on the Nomad platform: The same day they listed, Schneider recalled, “somebody did a search for very specific information on car accidents, and types of damage and volumes of damage — and they didn’t know it was even called insurance data.”
The demand and the supply got matched instantaneously, he explained. “That’s sort of the magic.”
Finding the right AI data ‘food’
Certainly, training data is important, but Schneider pointed out that even if you have the perfect data to train the model, it is trained once — or if there is new data over time, perhaps it is re-trained occasionally. Inference, however — that is, every time you run live data through a trained AI model to make a prediction or solve a task — can happen thousands of times every minute. And for the large companies looking to take advantage of generative AI, that constant data feeding is just as important, depending on the use case.
“You need to feed something to it for it to do something interesting,” he explained.
The problem, however, has always been to find just the right data “food.” For the typical large enterprise company, starting with internal data will be a key use case, Schneider said. But in the past, adding in the most “nutritious” external text data was close to impossible.
“You either couldn’t do anything with it or you had to hire armies of people to do stuff with it,” he explained. Data might have been sitting in millions or even trillions of PDFs, for example, with no cost-effective way to pull it out and make it useful. But now, LLMs can infer things based on millions of consumer records, company records, or government filings in seconds.
“That creates a hunger for all this textual data, think of it as sort of buried treasure,” he said. “All of that data existed before, that was deemed worthless, is now actually very useful” — and valuable.
Another important use case for data, he added, is customized training of LLMs. “For example, if I’m building my model to recognize Japanese receipts, I need to buy a data set of Japanese receipts,” Schneider explained. “If I’m trying to create a model that recognizes advertisements on a picture of a football field. I need videos of a football field — so we’re seeing a lot of that happening.”
We’ll all read about large media companies negotiating to license their data to OpenAI and other LLM companies. OpenAI announced a partnership with Axel Springer — which owns Politico and Business Insider in December — and famously failed in negotiating with the New York Times, which followed up by filing a lawsuit right before New Year’s.
But Schneider says that Nomad Data is also signing up media companies and other corporations as data vendors. “We’ve got two media outlets that are licensing the total corpus of their articles for people to train LLMs,” he said. “We’re basically calling every single large media company, figuring out who the right person is, making sure that we know about the data they have.”
And it’s not just the media industry, he added: “In the last couple of weeks, we have five corporations that have put data on the platform, including automotive manufacturers selling everything about the way people use cars — braking, speed, location, temperature, usage patterns — and we’ve got insurers selling very interesting claims data.”
The hunger games of LLM data
The bottom line is that the LLM hunger supply chain is basically a never-ending circle. Schneider explained that Nomad Data uses LLMs to find new data vendors. Once those vendors are on board, the company uses LLMs to help people find the data that they are looking for — and they, in turn, buy data to use with their own LLM APIs for training and inference.
“I can’t tell you how important LLMs are to make our business work,” said Schneider. “We have all this textual data, and every day people are giving us more and more. So we need to learn about these different data sets — and how to use them at all is being driven by all of us.”
AI training data, he reiterated, is an “immeasurably small piece of this market.” The most exciting part, he emphasized, is LLM inference, as well as customized training.
“Now I am going to buy data that I had no value for before, that’s going to be instrumental in building my business,” he said, “because this new technology allows me to use it.”
VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Discover our Briefings.