Elon Musk said artificial intelligence companies have run out of data to train their models and have “exhausted” the totality of human knowledge.
The world’s richest person suggested that technology companies will need to turn to “synthetic” data, or materials created by AI models, to build and fine-tune new systems, but the process is already being done with rapidly evolving technology.
“The human knowledge pool is being used up in AI training. That’s basically what happened last year,” Musk said. Musk launched his own AI business, xAI, in 2023.
ChatGPT AI models, such as the GPT-4o model that powers chatbots, are “trained” on vast amounts of data taken from the internet, where they actually learn to find patterns in that information, allowing them to predict, for example, what comes next. It will look like this. words in a sentence.
In an interview livestreamed on his social media platform X, Musk said the “only way” to combat the lack of source material to train new models is to move to synthetic data created by AI. He said it was true.
Regarding the exhaustion of data, he said: “The only way to compensate for that is by using synthetic data…writing an essay, coming up with a paper, and then self-grading…going through this process.” Self-study. ”
Meta, the owner of Facebook and Instagram, used synthetic data to fine-tune its largest Llama AI model, while Microsoft also used AI-generated content in its Phi-4 model. Google and OpenAI, the company behind ChatGPT, also use synthetic data in their AI work.
But Musk also warned that AI models’ habit of producing “hallucinations” (a term used to describe inaccurate or meaningless output) is dangerous for synthetic data processes.
In a live-streamed interview with Mark Penn, chairman of advertising group Stagwell, he said hallucinations make the process of using artificial materials “difficult” because “it… “How do we know if the answer is a hallucination or a real answer?”
Andrew Duncan, director of basic AI at Britain’s Alan Turing Institute, said Musk’s comments suggest that publicly available data for AI models could run out as early as 2026. “This is in line with recent academic papers,” he said, adding that over-reliance on synthetic data is dangerous. “Model collapse”. A term that refers to the output of a model that is of reduced quality.
“When you start feeding synthetics to your models, you start to see diminishing returns,” he says, with the risk of biased output and lack of creativity.
Duncan added that with the increase in AI-generated online content, there is also the potential for that content to be absorbed into AI data training sets.
High-quality data and its control is one of the legal battlegrounds in the AI boom. OpenAI acknowledged last year that it is impossible to create tools such as ChatGPT without accessing copyrighted material, but creative industries and publishers have been unable to use the artifacts in the model training process. is demanding compensation for.