Developers and power users are locked in a heated discussion over claude code performance, with social media now amplifying a complex mix of data, benchmarks and product changes.
Summary
Developers allege regression in Claude Opus 4.6 and Claude Code
A growing number of developers and AI power users say Anthropic‘s Claude Opus 4.6 and Claude Code feel less capable than they did just weeks ago. They complain the flagship coding model seems less reliable, more prone to wasting tokens and more likely to abandon tasks midway through demanding workflows.
Those concerns have spread quickly across GitHub, X and Reddit over the past several weeks. Moreover, several high-reach posts claim Claude is now worse at sustained reasoning and more likely to produce hallucinations or contradictions in complex coding jobs.
Some users have labeled the situation an “AI shrinkflation” moment, arguing that customers are paying the same price for a weaker product. Others have gone further, suggesting Anthropic may be throttling or tuning the system downward during peak demand to manage limited compute capacity.
Those throttling claims remain unproven. Anthropic employees have publicly denied degrading models to manage capacity. However, the company has acknowledged real changes to usage limits and reasoning defaults in recent weeks, which has made the broader debate much more combustible among power users.
VentureBeat has contacted Anthropic seeking clarification on the accusations, including whether any shifts to reasoning defaults, context handling, throttling behavior, inference parameters or benchmark methodology might explain the spike in complaints. As of publication time, the outlet was still awaiting a response.
AMD executive’s GitHub analysis helps ignite the controversy
The most detailed public complaint emerged on April 2, 2026, when Stella Laurenzo filed a GitHub issue about Claude Code. Her LinkedIn profile identifies her as a Senior Director in AMD‘s AI group, giving the post added weight in the developer community.
In that issue, Laurenzo argued that Claude Code had regressed so sharply it could no longer be trusted for complex engineering work. She backed the claim with an analysis of 6,852 Claude Code session files, 17,871 thinking blocks and 234,760 tool calls, positioning her critique as data-driven rather than anecdotal.
Her analysis claimed that starting in February, Claude’s estimated reasoning depth fell while performance issues rose. According to the post, users were seeing more premature stopping, more “simplest fix” behavior, more reasoning loops and a clear shift from research-first to edit-first behavior in demanding coding scenarios.
Laurenzo’s broader point was that for advanced engineering workflows, extended reasoning is not a luxury feature but essential to making the model usable at all. That argument resonated with many professional developers who rely on deep, stepwise reasoning to maintain large codebases.
The GitHub thread soon jumped into the wider social media conversation. On April 11, X user @Hesamation posted screenshots of Laurenzo’s analysis, helping turn the issue into a more viral talking point among AI practitioners. That said, the technical details of her claims also drew scrutiny.
The amplification mattered because it gave the emerging “Claude is getting worse” narrative something more concrete than scattered frustration. Power users now cited a long, data-heavy post from a senior AI leader at a major chip company, arguing that a regression was visible in logs, tool-use patterns and user corrections, not just in gut feeling.
Anthropic’s product team disputes claims of hidden downgrades
Anthropic’s public response has focused on separating perceived changes from what it says would constitute true model degradation. In a pinned follow-up on the same GitHub issue posted a week ago, Claude Code lead Boris Cherny thanked Laurenzo for the depth of her work but rejected her main conclusion.
Cherny said the “redact-thinking-2026-02-12” header cited in the complaint reflects a UI-only change that hides intermediate thinking from the interface to reduce latency. He stressed that it “does not impact thinking itself,” “thinking budgets” or how extended reasoning operates internally.
He also pointed to two other product changes that he believes shape what users are seeing. On February 9, Opus 4.6 shifted to adaptive thinking by default. Then on March 3, Anthropic set a medium effort level, or effort level 85, as the default for Opus 4.6, which Cherny described as the best balance of intelligence, latency and cost for most users.
Cherny added that users who want more extended reasoning can manually increase effort by typing /effort high inside Claude Code terminal sessions. However, critics counter that such manual interventions are not obvious and may feel like a downgrade in practice compared with previous defaults.
This exchange highlights the core of the controversy. Critics like Laurenzo argue that in demanding coding workflows, the tool’s behavior has plainly worsened, citing detailed logs and usage patterns as evidence. Anthropic, by contrast, concedes changes but frames them as visible product and interface choices, not a secret downgrade of underlying model weights.
That framing may be technically important for documentation and benchmarking. Yet for power users who feel the tool delivers worse outcomes, the distinction can seem academic. From their perspective, if the default experience deteriorates, the practical effect mimics a silent model reduction regardless of internal labels.
Coverage from TechRadar and PC Gamer further amplified Laurenzo’s post and the wider wave of agreement from some heavy users. Moreover, another viral X thread from developer Om Patel on April 7 pushed the controversy even further into mainstream AI discussions.
Patel argued that someone had “actually measured” how much “dumber” Claude had become, summarizing the result as a 67% drop in performance. His thread helped popularize the “AI shrinkflation” label and carried the dispute well beyond core Claude Code users into the general AI developer community on X.
These claims have resonated because they mirror what many frustrated users say they are seeing in daily work: more unfinished tasks, more backtracking, heavier token burn and a stronger sense that Claude is less willing to reason deeply through complicated coding jobs than it was earlier in the year.
Benchmark volatility turns frustration into public controversy
The loudest benchmark-based accusation has centered on the BridgeBench hallucination benchmark, run by BridgeMind. On April 12, the account posted that Claude Opus 4.6 had fallen from 83.3% accuracy and a No. 2 ranking in an earlier result to 68.3% accuracy and No. 10 in a new retest, calling this proof that “Claude Opus 4.6 is nerfed.”
That post spread quickly, becoming one of the main data points cited by those arguing that Anthropic had degraded the model. Other users circulated benchmark-related or task-based comparisons suggesting Opus 4.6 was underperforming relative to Opus 4.5 on practical coding tasks. Still others pointed to TerminalBench-related results as supposed evidence that the model’s behavior had changed in specific harnesses or product contexts.
The cumulative effect was powerful. Benchmark screenshots, side-by-side tests and anecdotal frustration began reinforcing one another, turning individual complaints into what looked like a broader pattern. However, the strength of these claims has come under renewed scrutiny from outside researchers.
This pattern matters because benchmark graphics usually travel farther than subjective reports. A developer saying a model “feels worse” is one thing. A screenshot showing a fall from No. 2 to No. 10, alongside a double-digit percentage drop in apparent accuracy, can look like hard proof even when the underlying comparisons are more complex.
Outside researchers challenge the benchmark narrative
The most important rebuttal to the BridgeBench claim came not from Anthropic but from Paul Calcraft, a software and AI researcher active on X. He argued that the viral comparison was misleading because the initial Opus 4.6 result used only six tasks, while the later run used 30, making it a fundamentally different benchmark configuration.
In Calcraft’s words, it was a “DIFFERENT BENCHMARK.” He further noted that on the six tasks both runs shared, Claude’s score moved only modestly, from 87.6% to 85.4%. He suggested that the larger apparent swing came mostly from a single fabrication result without repeats, which could easily fall within ordinary statistical noise.
That outside rebuttal matters because it undercuts one of the cleanest, most shareable claims in circulation. It does not prove users are wrong to think something has changed. However, it does imply that at least some benchmark evidence now driving the discussion may be overstated, poorly normalized or not directly comparable.
Even the original BridgeBench post drew a community note making a similar point. The note emphasized that the two runs covered different scopes — six tasks versus 30 — and that the subset of common tasks showed only minor change. That does not render the later result meaningless, but it weakens the strongest form of the “BridgeBench proved it” argument.
This has become a defining feature of the controversy. Some claims are grounded in first-hand experience and detailed logs. Others rest on real product changes. Still others rely on benchmark comparisons that may not be apples-to-apples. And some depend on inferences about hidden system behavior that anyone outside Anthropic cannot directly verify.
Capacity pressure and shifting session limits fuel suspicion
The backlash is also unfolding in the shadow of a confirmed Anthropic policy change from late March. On March 26, Anthropic technical staffer Thariq Shihipar posted that, “To manage growing demand for Claude,” the company was adjusting how 5-hour session limits work for Free, Pro and Max subscribers during peak hours, while keeping weekly limits unchanged.
He explained that during weekdays from 5 a.m. to 11 a.m. Pacific time, users would now move through those 5-hour limits faster than before. In follow-up posts, he said Anthropic had achieved some efficiency gains to offset the impact, but that roughly 7% of users, particularly on Pro tiers, would hit session limits they would not have previously reached.
In a March 27, 2026 email to VentureBeat, Anthropic clarified that Team and Enterprise customers were not affected and that the shift was not dynamically optimized per user, but rather applied consistently across the publicly described peak window. The company also reiterated that it continued to invest in scaling capacity.
Those comments focused solely on session limits, not model downgrades. However, they established two facts that many users now connect publicly: Anthropic faces surging demand, and it has already changed how usage is rationed during busy periods. For some, that context makes it easier to believe that other, less visible tradeoffs might also be in play.
Prompt caching, TTL and quota burn concerns
A separate GitHub issue has broadened the debate from quality into pricing and quota behavior. In issue #46829, user seanGSISG argued that Claude Code’s prompt-cache time-to-live, or TTL, appeared to shift from one hour back to five minutes in early March. The claim was based on nearly 120,000 API calls drawn from Claude Code session logs across two machines.
According to the complaint, this change drove meaningful increases in cache-creation costs and quota burn, especially in long-running coding sessions where cached context expired quickly and had to be rebuilt. The author suggested this shift helped explain why some subscribers began hitting usage limits they had never previously encountered.
Unlike some other threads, Anthropic did not deny that a change occurred. In a reply, Jarred Sumner confirmed a March 6 change was real and intentional but rejected the idea that it was a regression. He said Claude Code uses different cache durations for different request types and noted that a one-hour cache is not always cheaper because those writes cost more up front and only pay off when contexts are reused often enough.
In his explanation, the adjustment was part of ongoing cache optimization, not a silent cost increase, and the pre–March 6 behavior described in the issue “wasn’t the intended steady state.” That said, early adopters looking mainly at their own bills and usage patterns saw the timing as one more reason to worry.
The thread later drew a more detailed response from Anthropic’s Cherny, who described one-hour caching as “nuanced” and said the company has been testing heuristics to improve cache hit rates, token usage and latency for subscribers. He wrote that Anthropic keeps five-minute caches for many queries, including subagents rarely resumed, and that turning off telemetry also disables experiment gates, causing Claude Code to fall back to a five-minute default in some cases.
Cherny added that Anthropic plans to expose environment variables allowing users to force one-hour or five-minute cache behavior directly. Together, these replies did not validate the claim that Anthropic secretly made Claude Code more expensive overall. However, they did confirm that Anthropic has been actively experimenting with cache behavior in exactly the period when users began complaining more about quota burn and changing product behavior.
Anthropic insists product settings, not hidden nerfs, drive the changes
Anthropic-affiliated employees have pushed back publicly on the broadest accusations. In one widely circulated X reply, Cherny responded to allegations that Anthropic had secretly nerfed Claude Code by stating, “[This is false].” He reiterated that the tool had been defaulted to medium effort largely in response to feedback that Claude was consuming too many tokens by default.
He also said the change to medium effort had been disclosed both in the product changelog and in a dialog shown to users when they opened Claude Code. That response is notable because it concedes a meaningful change in default behavior while rejecting the more conspiratorial interpretation that Anthropic secretly sacrificed quality to expand capacity.
Public documentation backs the claim that effort defaults have been in flux. Claude Code’s changelog states that on April 7, Anthropic switched the default effort level from medium to high for API-key users, as well as for Bedrock, Vertex, Foundry, Team and Enterprise customers. Moreover, this shows the company actively tuning experience across segments, in ways that could easily affect how users perceive output quality and latency.
Shihipar has also directly denied the broader demand-management theory. In an X reply posted April 11, he said Anthropic does not “degrade” its models to better serve demand. He further argued that changes to thinking summaries altered how some users were measuring Claude’s “thinking,” and that the company had not found internal evidence to support the strongest qualitative claims circulating online.
At the same time, some developers now point to claude code performance drop reports as a signal that at least parts of the user base feel burned by recent changes. Whether those changes are best understood as interface tweaks, default-effort shifts, quota rules or something deeper, the perception problem is real.
Trust gap widens as rival tools gain attention
What now appears clearest is that a trust gap has opened between Anthropic and some of its most demanding users. For developers who live inside Claude Code all day, subtle shifts in visible thinking output, effort defaults, token burn, latency tradeoffs or usage caps can feel indistinguishable from a weaker model.
That is true whether the root cause is a product setting, a UI change, an inference-policy adjustment, capacity pressure or a genuine quality regression. However, those possibilities mean both sides may be talking past one another, emphasizing different parts of the same evolving story.
Users are largely describing their lived experience: more friction, more failures and less confidence under heavy workloads. Anthropic is speaking in product terms: effort levels, hidden thinking summaries, changelog entries and repeated denials that demand pressure has led to secret model degradation.
Those narratives are not necessarily incompatible. A model can feel worse to paying customers even if the company believes it has not “nerfed” its weights, at least not in the way critics allege. But the timing complicates matters, because OpenAI, Anthropic’s chief rival, has recently shifted more resources into its own enterprise- and coding-focused products, including a more mid-range ChatGPT subscription that it hopes will drive additional adoption.
That competitive context makes the current storm of claude code performance regression claims particularly awkward for Anthropic. It is not the kind of publicity likely to help with customer retention, especially among developers who feel their workflows have already been disrupted by shifting defaults and evolving quotas.
At the same time, public evidence remains mixed. Some of the most widely shared complaints come from developers with detailed logs and strong opinions shaped by repeated use. Some benchmark evidence has been challenged by external researchers over methodology. And Anthropic’s own changes to limits, cache settings and effort defaults ensure that this debate unfolds against a backdrop of real adjustments, not just rumor.
For now, the controversy around claude code performance degradation looks less like a settled indictment and more like an evolving stress test of how AI vendors communicate product changes to their most advanced users. How Anthropic responds — in data, documentation and day-to-day behavior — will likely determine whether trust is rebuilt or erodes further.

