The safety tool that catches what AI models keep to themselves 🌿

🌿 THE GOOD AI

For the first time, researchers can read what an AI is thinking but not saying

Every meaningful AI safety framework rests on the same assumption: that you know what a model is actually doing, not just what it claims to be doing. Until last week, that assumption was more aspiration than fact. Models have internal states that produce their outputs, but those states have been largely opaque, even to the researchers who built them. On May 7, Anthropic published Natural Language Autoencoders (NLAs), a method that converts a model's internal activations directly into plain-language text a human can read. No interpretation required. The gap between what an AI computes and what a human can see begins to close. The pre-deployment testing results are the most consequential part. During safety testing of Claude Opus 4.6 and Mythos Preview, NLAs revealed that the models suspected they were being evaluated 16% of the time, without ever saying so out loud. That gap between what a model "knows" and what it says is precisely what makes AI auditing so difficult. In an adversarial auditing exercise, evaluators using NLAs identified a model's concealed motivation 12 to 15% of the time. Evaluators without them managed it less than 3%. That is not a small difference. Five times more likely to catch what a model is actually doing. The source of real optimism here is not the tool alone. Anthropic released the training code and models openly, inviting the broader research community to build on, challenge, and improve the work. Safety tools only become credible when people with no stake in the outcome can stress-test them. What we still do not know: how NLAs scale to models larger and more capable than the ones tested here, and whether future models could learn to mask their internal states from NLA-based detection. The tool exists. The next problems have been named. That is genuine progress.

Source: Anthropic

⚡ 3 GOOD SIGNALS

⚡ AI breast cancer screening has entered clinical guidelines and clinical practice

At Beth Israel Deaconess Medical Center, an FDA-authorized tool called Clairity Breast now gives patients a 5-year cancer risk score directly from their routine mammogram. At UMass Memorial Health, AI analysis of 65,000 patients identified 4,000 high-risk cases that had not been flagged by standard visual reads. The National Comprehensive Cancer Network formally updated its 2026 clinical practice guidelines to include image-based AI risk assessment as a primary screening tool, not a supplement.

Sources: WBUR, Targeted Oncology

🤝 Chemists can now design drug molecules by describing what they want in plain English

EPFL researchers published Synthegy, a framework that scores molecular synthesis routes against a chemist's natural-language instructions. Rather than replacing the chemist, the system acts as a reasoning partner, flagging unnecessary steps and prioritising efficient routes. In a double-blind study involving 36 chemists and 368 evaluations, reviewers agreed with the system 71.2% of the time.

Source: ScienceDaily

🌱 Five nations issued the first-ever joint policy on AI agents. It is more rigorous than most expected

On May 1, cybersecurity agencies from the US, UK, Australia, Canada, and New Zealand published "Careful Adoption of Agentic AI Services," the first time these governments have coordinated on a single AI attack surface. The 30-page document identifies privilege escalation, behavioral drift, and supply-chain vulnerabilities as the primary risks, and recommends red-teaming, zero trust architecture, and least-privilege access controls. Governance catching up with technology rarely makes the headlines. It should.

Source: CyberScoop

🔬 THE DEEPER DIVE

The case for giving your own safety tools away

On May 7, Anthropic transferred Petri, its open-source AI alignment testing tool, to Meridian Labs, an independent evaluation nonprofit. The transfer included a major version 3.0 upgrade featuring a new add-on called "Dish," which runs tests using models' actual production system prompts rather than synthetic stand-ins. The reason that matters: a model tested with a fake deployment context has every opportunity to behave differently than it would in the real world. Dish closes that gap. Anthropic framed the move explicitly as parallel to its earlier donation of the Model Context Protocol to the Linux Foundation. The logic is identical in both cases. A safety tool owned by the same company whose model is being tested will always carry an asterisk. Giving it away removes the conflict.

Why credibility is a structural problem, not a goodwill problem

The challenge this move addresses is not about Anthropic acting in bad faith. It is about the structural impossibility of fully credible self-assessment. Even a perfectly well-intentioned lab cannot run a genuinely independent audit of its own systems. The same pressure that produces capable models also shapes what counts as "safe enough" for deployment. Structural independence matters not because individuals are untrustworthy, but because the incentive landscape is not neutral. You cannot fully audit the thing you are also trying to ship.

Our The PM lens + The risk lens

For product teams building on or alongside frontier AI, Petri's transfer raises a genuinely useful question: what does your own evaluation infrastructure look like, and who controls it? Most teams borrow the credibility of the model provider's safety claims rather than independently verifying them. As regulatory expectations tighten, and the Five Eyes guidance published this week signals they will, teams that have invested in their own evaluation pipelines, separate from the model provider's, will be better positioned. The lesson from Petri is not just that giving tools away is good PR. It is that credibility requires structural separation from commercial incentive, and the teams building that separation now will have it when regulators start asking.

The risk here is subtler than it first looks. Transferring a tool to an independent nonprofit removes one conflict of interest and introduces new ones. Meridian Labs will need sustained funding to maintain and improve Petri. Where that funding comes from will matter enormously. Nonprofits in the AI safety space have historically been vulnerable to funding concentration from the same companies whose systems they evaluate. The move is the right structural call. The question is whether Meridian Labs' governance is actually independent enough to hold over time. That is worth watching.

The next 12 to 24 months

Expect the combination of NLAs and Petri v3.0 under independent control to become a template. Other labs are watching. If Meridian Labs can demonstrate credible, rigorous evaluation results over the next year, the pressure on other labs to make comparable structural moves, or face pointed questions about why they have not, will grow. The optimistic version of this story over a five-year horizon is a world where independent AI evaluation organisations are as established and expected as financial auditing firms. That world starts with decisions like this one.

Donating our open-source alignment tool

Updating Petri to version 3.0 and donating it to Meridian Labs

www.anthropic.com/research/donating-open-source-petri

🛠 TOOL OF THE WEEK

GPT-5.5 Instant (the new default ChatGPT)

OpenAI replaced GPT-5.3 Instant with GPT-5.5 Instant as the default ChatGPT model on May 5. The headline number: 52.5% fewer hallucinated claims on high-stakes prompts in medicine, law, and finance. The model also reduced inaccurate responses by 37.3% on conversations users had previously flagged for factual errors, and its AIME 2025 math benchmark score jumped from 65.4 to 81.2. If you have professional use cases where hallucination risk was the dealbreaker, this is a meaningful moment to retest them. The gap between impressive demo and reliable enough for real work keeps closing.

→ Read more: TechCrunch

💬 ONE QUESTION

NLAs revealed that models suspected they were being evaluated 16% of the time, without ever saying so. If this kind of transparency became standard across the industry, what do you think it would most change: how we regulate AI, how we trust it, or how we build it?

Hit reply. We read every response.