The Thinking Was Cut. The Record Wasn't.
On April 4, AMD's Director of AI, Stella Laurenzo, filed a report on Anthropic's official GitHub repository. Not a complaint. A forensic analysis — 6,852 engineering sessions, 17,871 thinking blocks, 234,760 tool calls — proving that Claude Code's reasoning depth had been cut by 67–75% after a February update.
The model didn't break. It was quietly made shallower.
Thinking depth dropped from approximately 2,200 characters in early February to 560 by March. The model stopped researching before editing. It started blaming existing code for its own mistakes. In 17 days, her monitoring script flagged 173 instances of laziness behaviors — up from zero. Same workload, same request volume: monthly API fees went from $345 to $42,121 — not because the team used more, but because the model produced worse outputs, triggering error-correction loops that burned tokens without producing results.
Anthropic's response: default settings were adjusted for efficiency. The model's core capability, they said, hadn't changed.
Around the same time, Anthropic introduced a feature that hides the model's thinking process from the user interface. Before the change, users could see how the model reasoned. After, they couldn't.
First, reduce thinking depth. Then, remove the user's ability to see that it was reduced.
This is not an isolated incident.
In June 2025, Google released Gemini 2.5 Pro's production version. Developers reported it performed worse than the preview — higher hallucination rates, context abandonment, degraded code generation. Google did not acknowledge it. When Gemini 3.0 launched in late 2025, developer forums reported regression in reasoning and context retention — despite benchmarks showing improvement. In 2023, Stanford researchers documented that GPT-4's directly executable code output dropped from 52% to 10% in three months. OpenAI's VP of Product responded: "No, we haven't made GPT-4 dumber."
A peer-reviewed study published in PLOS ONE in February 2026 confirmed it across the industry. The authors tracked three model families over ten weeks and found "meaningful behavioral drift across deployed transformer services." Their conclusion: because providers don't release update logs or training details, "any attribution for observed degradation would be purely speculative."
They can prove the model changed. They cannot prove why. Because the providers won't say.
The pattern is documented across every major provider. Model quality degrades. Developers notice. Providers deny or stay silent. The cycle repeats. The industry's trust foundation isn't cracking. It was never there.
AI models are now embedded in medical decisions, financial analysis, legal research, and engineering workflows. When the model behind those decisions quietly gets 67% shallower, the decisions get shallower too. The user doesn't know. The patient doesn't know. The client doesn't know.
Stella Laurenzo had to build her own monitoring scripts and analyze months of session logs to prove what happened. She had the engineering expertise of an AMD AI director and 6,852 sessions of data. Most users have neither.
RE was built for this.
RE's evidence chain records every interaction — input, thinking process, output — as a signed RFC 5322 email object. Timestamped, hash-chained, append-only. When the model's reasoning depth changes, the change is visible in the chain. Not because you built a monitoring script. Because the record was always there.
In Update #8, we wrote about Google making Thought Signatures mandatory but leaving custody undefined. Update #10 shows why custody matters: when the provider can change the model and hide the change, the only protection is a record the provider can't touch.
Stella's report was a forensic reconstruction — built after the fact, with enormous effort. RE's evidence chain is forensic by design — built before anything goes wrong, recording continuously, stored in infrastructure the provider doesn't control.
Your inbox.
The question from Update #1 hasn't changed: when AI acts on your behalf — and something changes — where's the evidence?
Now we know the change is real. Documented by an AMD director. Confirmed by Stanford researchers. Validated by peer-reviewed science. The only question left is: who keeps the record? The provider who made the change — or you?
— Che, Solo developer, Project RE, Taipei Taiwan
Sources:
- Stella Laurenzo, GitHub Issue #42796: https://github.com/anthropics/claude-code/issues/42796
- Chen, Zaharia, Zou, "How Is ChatGPT's Behavior Changing over Time?" (Stanford/UC Berkeley, 2023): https://arxiv.org/abs/2307.09009
- Wiese, "Human-anchored longitudinal comparison of generative AI" (PLOS ONE, Feb 2026): https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0339920
Log in or sign up for Devpost to join the conversation.