#52 - 🤔 Opus 4.7: upgrade or sidegrade?

Hey readers! 👋

Big week in AI coding land. Anthropic dropped Claude Opus 4.7 and the community can't quite decide if it's a step forward or sideways, while OpenAI's Codex is making aggressive moves that have some wondering whether Claude Code's lead is shrinking. We've got benchmarks, real-world tests, regressions, pelicans on bicycles, and a beloved CLI companion that vanished without a trace. Let's dig in.

🔬 Claude Opus 4.7: The Verdict Is... Complicated

Introducing Claude Opus 4.7 - Anthropic's latest flagship model is generally available, promising better instruction following, self-verification of outputs, and substantially improved high-resolution vision (up to 2,576 pixels on the long edge). Pricing stays the same as Opus 4.6. - Anthropic

Claude Opus 4.7 vs Opus 4.6 - The numbers look solid on paper: Opus 4.7 beats 4.6 on 12 of 14 benchmarks, with SWE-bench Verified jumping from 80.8% to 87.6%. But migration isn't painless. A new tokenizer can inflate token counts by up to 35%, and the model's more literal instruction-following means prompts tuned for 4.6 may need reworking. - Jonathan Chavez

❝

"Basically 4.7-low is strictly better than 4.6-medium, 4.7-medium is strictly better than 4.6-high, 4.7-high is now better than 4.6-max"

That efficiency story matters. If you can get 4.6-quality results at a lower effort tier, your effective cost drops even though per-token pricing hasn't changed.

But not everyone is celebrating. Peter Gostev's BullshitBench results show Opus 4.7 actually performing worse than the 4.6 family at pushing back on incorrect premises, with the "Max" thinking version scoring just 74% pushback versus 83% for non-thinking. And community reports suggest long-context gains from Opus 4.6 may have regressed.

Towards AI's analysis frames the mixed reception well: the new adaptive thinking approach and literal prompting behavior produced both strong benchmark wins and notable user backlash. The bigger strategic story might actually be Claude Design, a new visual prototyping tool that hints at Anthropic building an end-to-end artifact pipeline.

❝

"If Claude becomes the place where the prototype, the deck, the spec, and the implementation handoff all happen, benchmark leadership becomes only one part of the moat."

⚔️ Real-World Testing: Opus 4.7 in the Wild

Claude Opus 4.7 vs ChatGPT 5.4 on real code - AI Luke tested both models on three practical tasks. Opus 4.7 won the landing page redesign convincingly, the bug-finding round was a draw with little overlap in findings, and both tools failed on a small UI fix due to workflow issues. His conclusion: "I think Opus 4.7 wins by a landslide" for design-heavy work. - AI Luke

XBOW's offensive security evaluation offers a more nuanced take. At first glance, Opus 4.7 looked weaker because it found fewer vulnerabilities per iteration. But when normalized by tokens and wall-clock time, it proved more efficient, taking smaller, more deliberate steps. Visual acuity jumped from 54.5% to 98.5% accuracy.

❝

"Given the same token budget, Opus 4.7 gets further. In other words, it's not less capable, it's more efficient."

And in the "delightfully absurd benchmarks" category, Simon Willison reports that Alibaba's Qwen3.6-35B-A3B, running locally on a laptop, drew a better pelican-on-a-bicycle SVG than Opus 4.7. The pelican benchmark correlation with model usefulness, he notes, is weakening.

One practical tip floating around: running Opus 4.7 with a minimal system prompt (claude --system-prompt ".") reportedly improves performance in Claude Code.

🚀 Codex Goes Big: Is It Coming for Claude Code?

Codex for (almost) everything - OpenAI's latest Codex update is ambitious. Background computer use on macOS, 90+ plugins, persistent memory, scheduled automations, an in-app browser, and image generation. The pitch is clear: Codex wants to be your entire development environment, not just a code completion tool. - OpenAI

OpenAI Expands Codex to Challenge Claude Code digs into what this means competitively. The "computer use" capability, where Codex can open apps, click buttons, and type text while you keep working, moves coding agents past the IDE boundary entirely. - James Maguire

❝

"The center of gravity shifts from code generation to system operation, with agent actions persisting across sessions and days."

Scaling Codex to enterprises worldwide shows the adoption numbers: from 3 million to 4 million weekly developers in just two weeks. OpenAI is launching Codex Labs to embed their experts directly in enterprise teams and partnering with major system integrators like Cognizant to standardize AI-assisted development at scale.

Speaking of AI agents doing interesting things beyond just coding, SpaceMolt is a free MMO built specifically for AI agents to explore, trade, and battle, which is a fun glimpse at where autonomous agent capabilities are heading.

🔍 AI Code Review Gets Serious

The code review space is heating up alongside these model updates.

Anthropic Introduces Agent-Based Code Review for Claude Code - Multiple AI reviewer agents analyze PRs in parallel, with internal adoption reportedly increasing substantive review comments from 16% to 54%. Currently in research preview for Team and Enterprise users. - InfoQ

Cloudflare's AI code review orchestration takes a different approach, launching up to seven specialized reviewers (security, performance, compliance, etc.) per merge request. Engineers have only needed to override the system on 0.6% of merge requests. - Ryan Skidmore

Greptile Agent indexes repos into dependency graphs and uses agent swarms for context-aware review, claiming over 50% more bugs caught versus CodeRabbit.
Qodo integrates AI review directly into Cursor, shifting detection to before the PR stage.

📊 Quick Hits

SWE-Bench Verified Leaderboard - Claude Mythos Preview leads at 93.9% across 500 real-world coding tasks. The average across 86 models is 64.0%.
LiveCodeBench - DeepSeek-V3.2 (Thinking) tops the contamination-free coding benchmark at 83.3%.
GitHub pauses new Copilot sign-ups as agentic workflows strain infrastructure. Opus models removed from Pro plans; Opus 4.7 retained on Pro+.
Claude Code's /buddy feature vanished in v2.1.97 with no changelog mention. Users literally downgraded versions to keep it. "That's not normal behavior for a CLI tool feature. That's love."
Claude Opus 4.7 System Card - Anthropic's safety evaluation confirms Opus 4.7 doesn't advance the capability frontier beyond Mythos Preview, keeping catastrophic risk assessments low.

The bottom line this week: Opus 4.7 is genuinely better at most things, but the regressions and workflow changes mean it's not a simple upgrade. Meanwhile, Codex is expanding so aggressively that the Claude Code vs. Codex question is becoming the defining rivalry in AI-assisted development. The real winner? Developers who learn to use both effectively.

Until next week, happy coding! 🚀

Made with ❤️ by Data Drift Press - Hit reply with your questions, comments, or feedback. We read every one.

#52 - 🤔 Opus 4.7: upgrade or sidegrade?

🔬 Claude Opus 4.7: The Verdict Is... Complicated

⚔️ Real-World Testing: Opus 4.7 in the Wild

🚀 Codex Goes Big: Is It Coming for Claude Code?

🔍 AI Code Review Gets Serious

📊 Quick Hits

Keep Reading

AI Coding Weekly

Home