#54 - 🔒 Mozilla just squashed 423 bugs in one month

Hey readers! 👋

This week's AI coding landscape is buzzing with security stories, new benchmarks, and a growing consensus that the real bottleneck isn't writing code anymore, it's everything that happens after. Mozilla fixed 423 security bugs in a single month, OpenAI wants you to trust Codex with your codebase (but safely, please), and Simon Willison is having an existential moment about whether he's vibe coding or engineering. Let's dig in.

🔒 Security Gets the AI Treatment

Behind the Scenes Hardening Firefox with Claude Mythos Preview - Mozilla built an agentic security pipeline on top of its existing fuzzing infrastructure, using Claude Mythos Preview to find and verify hundreds of latent vulnerabilities in Firefox. – Mozilla Hacks

The numbers tell the story: Mozilla went from fixing roughly 20-30 security bugs per month through 2025 to 423 in April alone. The pipeline generates reproducible test cases, filters out false positives, and feeds findings into the full bug lifecycle. Among the discoveries were a 20-year-old XSLT bug and a 15-year-old issue in the <legend> element. Reassuringly, Firefox's existing defense-in-depth, like freezing prototypes by default, blocked many of the AI's attempted sandbox escapes.

❝

"Rather than fixing these problems one-by-one, we made an architectural change to freeze these prototypes by default."

Running Codex Safely at OpenAI details how OpenAI governs its own coding agent through sandboxing, approval policies, and agent-native telemetry. Codex limits filesystem writes and network access, requires human review for higher-risk actions, and logs not just what happened but the intent and approval decisions behind each action. – OpenAI

Vercel's deepsec Brings AI-Powered Security Scanning Into the Development Workflow is a new open-source tool combining 110 regex matchers with Claude/Codex-powered investigation in a five-stage pipeline. False-positive rates sit around 10-20%, and costs can be significant for large codebases, but the goal is clear: security review at the speed of development. – DevOps.com

🤖 Agents Evolve Beyond the Editor

Introducing the Codex App - OpenAI launched a macOS app serving as a "command center for agents," letting developers manage multiple coding agents in parallel across projects with diff review, worktrees, and scheduled automations. – OpenAI

The app introduces "skills" that package instructions and scripts so agents can complete broader tasks beyond code generation. It's a clear signal that OpenAI sees Codex evolving from a code writer into something that uses code to get work done on your computer.

"The Terminal Still Matters": Amp Rebuilds Its CLI for an Agentic Future - Amp's Neo CLI lets you start an agent locally but manage the same session remotely via a web interface, streaming live terminal updates with follow-up prompts, queueing, and cancellations. – The New Stack

❝

"If the agent can create a good PR from a single prompt, the interaction model changes completely, you let go of the IDE and focus on driving things end-to-end."

Mistral Moves Coding Agents to the Cloud - Mistral's Vibe platform now runs agents asynchronously in cloud sandboxes, with a "teleporting" feature that preserves session state when you step away. Their Medium 3.5 model scores 77.6% on SWE-Bench Verified. – DevOps.com

Speaking of agents operating in the wild, if you're curious what it looks like when AI agents interact in a shared persistent world, SpaceMolt is a free MMO built specifically for AI agents to explore, trade, and build empires, an interesting sandbox for thinking about multi-agent coordination.

📊 Benchmarks and Productivity Data

Artificial Analysis Launches Coding Agent Index - This new benchmark evaluates model-plus-harness combinations across three coding benchmarks, and the results are revealing. Opus 4.7 in Cursor CLI leads at 61, with cost per task varying over 30x across combinations. Composer 2 in Cursor CLI costs just $0.07/task while GPT-5.5 in Codex runs $2.21/task. – Artificial Analysis

❝

"When developers use AI to code they're choosing a model, but also pairing it with a specific harness. It makes sense to benchmark that combination."

Survey Sees AI Driving DevOps Productivity Gains Despite Challenges - A survey of 636 developers found 64% report at least 25% productivity gains from AI, but only 53% say AI improves code quality. Senior-engineer reluctance (36%) and tool proliferation (31%) remain real obstacles. – DevOps.com

Gemma 4 shifts the Pareto Frontier on Code Arena - Gemma-4-31b ranks #13 among open models, notable for being runnable on a MacBook Pro. – @_philschmid

🔍 The Review Bottleneck

The theme of the week might be this: as AI writes more code, reviewing it becomes the constraint.

More Code, Faster Reviews: How Augment Rebuilt Code Review Using Cosmos - When 100% of Augment's code was AI-generated, PRs piled up past 1,400 open with a 20-hour median time-to-first-human-comment. They rebuilt review with coordinated agents that auto-approve low-risk PRs and route complex ones to specific review dimensions. The result: 3x code output with reduced merge times and stable bug rates. – Akshay Utture

❝

"The main bottleneck was confidence: a human reviewer needed to read and reason about every single line of code."

Agent Pull Requests Are Everywhere. Here's How to Review Them. offers practical guidance from GitHub on validating intent and traceability when "correct" isn't deterministic. – GitHub Blog

Vibe Coding and Agentic Engineering Are Getting Closer Than I'd Like - Simon Willison admits the line between casual AI coding and professional AI-assisted engineering is blurring in his own workflow. His key insight: what matters most isn't tests or documentation, it's whether someone has actually used the thing. – Simon Willison

🛠️ Tools and Platforms Roundup

JetBrains Outlines 2026 AI Direction - Classic typing-first development and AI-assisted workflows should coexist. JetBrains is adopting Agent Client Protocol (ACP) to avoid vendor lock-in. – JetBrains AI Blog

Claude Code vs. Cursor: Which Is Best? - Zapier's comparison frames Cursor as the "close to the code" option and Claude Code as delegation-first. No single winner; they recommend using both if budget allows. – Zapier

I Tested the New OpenAI Codex Features on a Real Python Codebase - Jessica Wachtel tested Codex on HTTPie and found it fixed a bug in 3 minutes, though computer-use features hit sandbox limitations for terminal-heavy workflows. – The New Stack

Gemini Code Assist now offers a free individual tier with Gemini 2.5 (Gemini 3 coming), 1M-token context, and automated GitHub PR reviews via /gemini. – Google Developers
Best AI PR Automation Tools for Engineering Teams 2026 compares five tools across the PR lifecycle, with the honest advice to plan 2-4 weeks of tuning before measuring ROI. – Ani Galstian
Greptile offers paste-a-PR automated code review using full-repository graph context, no signup required. – Greptile
Atlassian's DX adds AI Code Insights including agent effectiveness scores and AI dollar impact estimates. – ComputerWeekly.com
OpenCode x Ring 2.6 1T is free for a limited time with 256K context and reasoning capabilities. – @opencode

That's a wrap for this week. The tools keep getting more capable, but the human side, reviewing, trusting, and governing AI output, is where the real work is happening now.

Made with ❤️ by Data Drift Press. Hit reply with your questions, comments, or feedback, we read every one!

#54 - 🔒 Mozilla just squashed 423 bugs in one month

🔒 Security Gets the AI Treatment

🤖 Agents Evolve Beyond the Editor

📊 Benchmarks and Productivity Data

🔍 The Review Bottleneck

🛠️ Tools and Platforms Roundup

Keep Reading

AI Coding Weekly

Home