An artificial intelligence (AI)-powered idea generator has come up with more original research ideas than 50 scientists working independently, according to a preprint posted on arXiv this month.
The human- and AI-generated ideas were evaluated by reviewers who were not told who or what created each idea. Reviewers rated the AI-generated concepts as more appealing than their human-written counterparts, but rated the AI proposals slightly lower in terms of feasibility.
But scientists point out that the study, which has not been peer-reviewed, has limitations: It focused on one research area and required human participants to come up with ideas on the spot, perhaps hindering its ability to generate the best concepts.
AI in Science
There has been a surge in efforts to use Master of Laws degrees (LLMs) to automate research tasks such as paper writing, code generation, and literature searches. But assessing whether these AI tools can generate new research perspectives at the same level as humans has been difficult because evaluating ideas is highly subjective and requires bringing together researchers with the expertise to carefully evaluate them, says study co-author Shi Chenglei. “The best way to contextualize these capabilities is through direct comparisons,” says Shi, a computer scientist at Stanford University in California.
Tom Hope, a computer scientist at the Allen Institute for Artificial Intelligence in Jerusalem, says the year-long project is one of the largest efforts to evaluate whether large-scale language models (LLMs) — the technology that underpins tools like ChatGPT — can generate innovative research ideas. “We need to do more of this kind of work,” he says.
The research team recruited more than 100 researchers in natural language processing, a branch of computer science that focuses on AI and human communication. The 49 participants were tasked with coming up with and writing down an idea based on one of seven topics within 10 days. As an incentive, the researchers paid participants $300 per idea and offered $1,000 bonuses for the top five scoring ideas.
Meanwhile, the researchers built an idea generator using Claude 3.5, an LLM developed by Anthropic, San Francisco, California. They instructed the AI tool to find papers related to the seven research topics using Semantic Scholar, an AI-powered literature search engine. Then, based on these papers, they instructed the AI agent to generate 4,000 ideas on each research topic and rank the most original ones.
Human reviewers
The researchers then randomly assigned the human- and AI-generated ideas to 79 reviewers who scored each idea on its novelty, interestingness, feasibility, and expected impact. To ensure that the reviewers were blind to the idea’s creators, the researchers used a separate LLM to edit both types of text, standardizing style and tone without changing the ideas themselves.
On average, judges rated the AI-generated ideas as more original and inspiring than those written by human participants. But when the team pored over the 4,000 ideas generated by LLM, they found that only around 200 were truly unique, and that originality decreased as the AI churned out ideas.
When Si surveyed participants, most admitted that the ideas they submitted were average compared to ideas they had created in the past.
The results suggest that LLM students may be able to generate ideas that are a little more original than those in the existing literature, says Cong Lu, a machine learning researcher at the University of British Columbia in Vancouver, Canada, but whether LLM students can beat the most innovative ideas from humans remains an open question.
Another limitation is that the study compared ideas compiled by law students, who changed the language and length of submitted papers, says computational social scientist Jevin West of the University of Washington in Seattle. Such changes could have subtly influenced how reviewers perceive novelty, he says. Pitching academics against law students, who can generate thousands of ideas in a few hours, may not be a completely fair comparison, West adds. “You have to compare apples to apples,” he says.
Shi and his colleagues plan to compare the AI-generated ideas with leading conference papers to better understand how LLM programs can compete with human creativity. “We’re trying to encourage the community to think more seriously about what the future looks like when AI is able to play a more active role in the research process,” he says.