Researchers behind some of the most advanced artificial intelligence (AI) on the planet have warned that the systems they helped to create could pose a risk to humanity.
The researchers, who work at companies including Google DeepMind, OpenAI, Meta, Anthropic and others, argue that a lack of oversight on AI’s reasoning and decision-making processes could mean we miss signs of malign behavior.
In the new study, published July 15 to the arXiv preprint server (which hasn’t been peer-reviewed), the researchers highlight chains of thought (CoT) — the steps large language models (LLMs) take while working out complex problems. AI models use CoTs to break down advanced queries into intermediate, logical steps that are expressed in natural language.
The study’s authors argue that monitoring each step in the process could be a crucial layer for establishing and maintaining AI safety.
Monitoring this CoT process can help researchers to understand how LLMs make decisions and, more importantly, why they become misaligned with humanity’s interests. It also helps determine why they give outputs based on data that’s false or doesn’t exist, or why they mislead us.
However, there are several limitations when monitoring this reasoning process, meaning such behavior could potentially pass through the cracks.
Related: AI can now replicate itself — a milestone that has experts terrified
“AI systems that ‘think’ in human language offer a unique opportunity for AI safety,” the scientists wrote in the study. “We can monitor their chains of thought for the intent to misbehave. Like all other known AI oversight methods, CoT monitoring is imperfect and allows some misbehavior to go unnoticed.”
The scientists warned that reasoning doesn’t always occur, so it cannot always be monitored, and some reasoning occurs without human operators even knowing about it. There might also be reasoning that human operators don’t understand.
Keeping a watchful eye on AI systems
One of the problems is that conventional non-reasoning models like K-Means or DBSCAN — use sophisticated pattern-matching generated from massive datasets, so they don’t rely on CoTs at all. Newer reasoning models like Google’s Gemini or ChatGPT, meanwhile, are capable of breaking down problems into intermediate steps to generate solutions — but don’t always need to do this to get an answer. There’s also no guarantee that the models will make CoTs visible to human users even if they take these steps, the researchers noted.
“The externalized reasoning property does not guarantee monitorability — it states only that some reasoning appears in the chain of thought, but there may be other relevant reasoning that does not,” the scientists said. “It is thus possible that even for hard tasks, the chain of thought only contains benign-looking reasoning while the incriminating reasoning is hidden.”A further issue is that CoTs may not even be comprehensible by humans, the scientists said. “
New, more powerful LLMs may evolve to the point where CoTs aren’t as necessary. Future models may also be able to detect that their CoT is being supervised, and conceal bad behavior.
To avoid this, the authors suggested various measures to implement and strengthen CoT monitoring and improve AI transparency. These include using other models to evaluate an LLMs’s CoT processes and even act in an adversarial role against a model trying to conceal misaligned behavior. What the authors don’t specify in the paper is how they would ensure the monitoring models would avoid also becoming misaligned.
They also suggested that AI developers continue to refine and standardize CoT monitoring methods, include monitoring results and initiatives in LLMs system cards (essentially a model’s manual) and consider the effect of new training methods on monitorability.
“CoT monitoring presents a valuable addition to safety measures for frontier AI, offering a rare glimpse into how AI agents make decisions,” the scientists said in the study. “Yet, there is no guarantee that the current degree of visibility will persist. We encourage the research community and frontier AI developers to make best use of CoT monitorability and study how it can be preserved.”