Despite their great capabilities, large-scale language models are far from perfect. These artificial intelligence models can sometimes “hallucinate” by producing false or unsupported information in response to queries.
Because of this problem of illusion, LLM responses are often verified by human fact-checkers, especially when the model is deployed in high-risk environments such as healthcare or finance. However, the validation process typically requires reading all of the lengthy documentation cited in the model, a very tedious and error-prone task that prevents some users from deploying generative AI models in the first place. Possibly.
To assist human verifiers, researchers at MIT have created a user-friendly system that can verify LLM responses more quickly. Using this tool, called SymGen, LLM generates a response that includes a citation that directly points to a location in a source document, such as a specific cell in a database.
When a user moves their mouse over a highlighted portion of a text response, they see the data that the model used to generate that particular word or phrase. At the same time, the unhighlighted parts indicate to the user which phrases need to be checked and verified.
“We’re enabling people to selectively focus on the parts of the text they need to be more concerned about. Ultimately, with SymGen, they can be confident that their information has been verified. It gives us more confidence in the model’s response because we can easily dig deeper to confirm,” said Shannon Shen, a graduate student in electrical engineering and computer science and co-senior author of the paper. . Papers on SymGen.
Through user research, Shen and his collaborators found that using SymGen reduced validation time by approximately 20% compared to manual procedures. SymGen helps identify errors in LLMs introduced in a variety of real-world situations, from writing clinical notes to summarizing financial market reports, by enabling humans to more quickly and easily validate model outputs. Helpful.
Shen is joined on the paper by co-first author and EECS graduate student Lucas Torroba Hennigen. EECS graduate student Aniruddha “Ani” Nursingha. Bernhard Gapp, Chairman of the Good Data Initiative. Senior author David Sontag is a professor at EECS, a member of the MIT Jameel Clinic, and leader of the Clinical Machine Learning Group at the Computer Science and Artificial Intelligence Laboratory (CSAIL). and Yoon Kim, assistant professor at EECS and member of CSAIL. This research was recently presented at a conference on language modeling.
symbol reference
To aid in verification, many LLMs are designed to generate citations pointing to external documents and their language-based responses, allowing users to review the documents. But these verification systems are typically designed as an afterthought, without considering the effort it would take for people to sift through large numbers of citations, Shen said.
“Generative AI is meant to reduce the time it takes users to complete tasks. We spend hours and hours going through all these documents to make sure our models are saying something reasonable. “Actually creating a generation isn’t very useful if you need to read it manually,” says Shen.
The researchers approached the validation problem from the perspective of the humans doing the work.
SymGen users first provide LLM with data that can be referenced in the response, such as a table containing statistics for a basketball game. Researchers then perform intermediate steps rather than immediately asking the model to complete a task, such as generating a game summary from those data. These prompt the model to generate responses in symbolic form.
This prompt requires that each time the model quotes a word in the response, it writes the specific cell from the data table that contains the information it is referencing. For example, if your model wants to quote the phrase “Portland Trailblazers” in its response, replace that text with the names of cells in your data table that contain those words.
“Because we have this intermediate step where we have the text in symbolic form, we can have a very fine-grained reference. For every span of text in the output, we can say that this is the exact corresponding location in the data.” says Torroba Hennigen.
SymGen then resolves each reference using rule-based tools that copy the corresponding text from the data table to the model response.
“This way you know it’s a verbatim copy and there are no errors in the parts of the text that correspond to the actual data variables,” Shen adds.
Streamline validation
The model can produce symbolic responses depending on how it is trained. Large language models are fed large amounts of data from the Internet, and some data is recorded in “placeholder form” where code replaces the actual values.
A similar structure is used when SymGen asks a model to generate a symbolic response.
“We design the prompts in a special way to take advantage of the power of LLM,” adds Shen.
In a user survey, the majority of participants said that SymGen made it easier to validate text generated by LLM. We were able to validate the model response approximately 20% faster than using standard methods.
However, SymGen is limited by the quality of the source data. LLM may quote the wrong variables, and human verifiers may be none the wiser.
Additionally, users must have source data in a structured format, such as a table, to feed SymGen. Currently, the system only processes tabular data.
In the future, researchers are enhancing SymGen to handle arbitrary text and other formats of data. This feature could, for example, help verify some of the summaries of legal documents generated by AI. They also plan to test SymGen with doctors and study how it can identify errors in AI-generated clinical summaries.
Funding for this research was provided in part by Liberty Mutual and the MIT Quest for Intelligence Initiative.