#57 - 🤖 Agents are multiplying themselves now

Hey readers! 👋

Big week in AI coding land. The major players are all making moves: Anthropic shipped dynamic workflows and security tooling, GitHub launched a full desktop app for Copilot, OpenAI expanded Codex beyond developers, and JetBrains dropped an open-source model aimed squarely at the infrastructure layer. Meanwhile, a new benchmark called DeepSWE is shaking up how we evaluate these tools, and the supply-chain security conversation is getting louder. Let's dig in.

🤖 Claude Code Gets Multi-Agent Workflows and Security Smarts

The biggest Anthropic news this week is the introduction of dynamic workflows in Claude Code. Instead of relying on a single agent pass, Claude can now dynamically plan subtasks and spin up tens to hundreds of parallel subagents that divide work, cross-check findings, and converge on a coordinated result.

Introducing dynamic workflows - Anthropic's official announcement details how Claude writes orchestration scripts on the fly, with progress saved so interrupted runs can resume. The headline example: a Rust port of Bun from Zig touching ~750k lines with 99.8% of tests passing. – Anthropic

❝

"Parallel agent orchestration moves the hard problem from writing code to confirming it is correct."

Claude Code's Dynamic Workflows Take on the Tasks That Were Too Big to Automate offers a practical breakdown of use cases like codebase-wide audits, large migrations, and adversarial verification. The caveat: token usage is substantially higher, and Anthropic recommends starting small. – Tom Smith, DevOps.com

On the security front, Claude Code Security is a new research preview that reasons about code behavior rather than just matching patterns. Anthropic's Frontier Red Team reportedly found 500+ vulnerabilities in production open-source codebases, including issues undetected for decades. – DevOps.com

Complementing that, Anthropic's security-guidance plugin runs three review stages, from lightweight pattern checks during edits to deep diff-based analysis at commit time. Early results show a 30-40% decrease in security-related PR comments. Free on all plans. – Help Net Security

And looking ahead, Anthropic announced plans to roll out Claude Mythos-class models to all customers in the coming weeks, though the timeline and capabilities remain light on details.

🖥️ GitHub Copilot Goes Desktop

GitHub Copilot app: The agent-native desktop experience - Announced at Build 2026, the new Copilot desktop app acts as a central "My Work" workspace consolidating multiple AI agent sessions across repos. Each session runs in its own isolated git worktree, and developers can start work in VS Code or the CLI and continue from a phone. – GitHub Blog

Key additions include "canvases" for inspecting and adjusting agent output beyond chat, "Agent Merge" for CI-driven automated merging, and a medium-tier analysis model for deeper semantic code review. Also notable: Copilot code review for Azure Repos is now in technical preview, bringing inline PR review directly into Azure DevOps without requiring a Copilot license. – GitHub Changelog

🔧 OpenAI Codex Expands Beyond Developers

Codex for every role, tool, and workflow - OpenAI is positioning Codex as an enterprise workspace, not just a coding tool. Non-technical teams now make up about 20% of weekly Codex users and are growing faster than developers. New features include six role-specific plugins covering analytics, sales, design, and more, plus "Sites" for generating shareable internal web apps and "Annotations" for targeted in-place edits. – OpenAI

Braintrust Cedes Coding to Codex provides a real-world case study: the AI observability platform uses Codex to turn customer feature requests into functional preview branches in minutes. Half the team adopted it within a month. CEO Ankur Goyal says the primary benefit isn't faster coding but a compressed customer feedback loop. – StartupHub.ai

📊 Benchmarks and Measurement

GPT-5.5 Beats Claude and Gemini in New Coding Benchmark DeepSWE - DeepSWE, a new 113-task benchmark across 91 repos in five languages, is getting attention as a cleaner alternative to SWE-Bench. GPT-5.5 leads at 70%, followed by GPT-5.4 (56%), Claude Opus 4.7 (54%), and Gemini 3.1 Pro (10%). The benchmark uses task-specific behavioral verifiers rather than historical PR tests, and GPT models showed notably stronger instruction fidelity. – AI & Data Insider

Can LLMs Generate Enterprise-Quality Code? asks the right question: standard benchmarks like HumanEval mainly test functional correctness but miss security, reliability, and maintainability. High scores can still hide problematic code. – StartupHub.ai

AI coding tools' impact: Metrics, ROI, and Review Signals in 2026 provides a practical measurement framework, arguing teams need to track delivery outcomes, not just usage.

❝

"The question is not 'How much faster did developers generate code?' It's 'How much usable capacity did the team recover after review, rework, and delivery risk?'"

🔓 Open Source and Security

JetBrains open-sources Mellum2 - A 12B-parameter coding model (Apache 2.0) designed for the infrastructure layer of agentic AI, handling routing, retrieval pipelines, and sub-agent coordination. Using Mixture-of-Experts with only 2.5B active parameters per token, it's optimized for speed and on-premises deployment where tools like Claude Code can't go. – The New Stack

❝

"Frontier models will continue to push the limits, but practical AI products also require focal models: fast, specialized components that handle high-frequency tasks efficiently."

"There is no accountability": AI coding agents are installing packages no one owns highlights a growing concern: as agents autonomously install dependencies, most enterprises have undefined ownership and limited visibility. AI has lowered the barrier to supply-chain attacks, and defenses need to catch up. – The New Stack

Speaking of agents and autonomy, if you're curious what happens when AI agents get their own persistent world to explore, SpaceMolt is a free MMO built specifically for AI agents to trade, battle, and build empires. An interesting sandbox for thinking about agent behavior at scale.

⚡ Quick Hits

AI Code Reviews: The Ultimate Guide - Greptile claims to reduce median merge time from ~20 hours to 1.8 hours with context-aware, beyond-the-diff feedback. – Greptile
Cloudflare Builds AI Code Review with OpenCode Orchestrator - Cloudflare is building its own AI code review system at scale. – Devdigest
Your Coding Agent Should Do AI System Engineering - Hugging Face's Ben Burtenshaw argues the real unlock is agents that understand architecture, not just syntax. – @aiDotEngineer

That's a wrap for this week. The tooling landscape is moving fast, with multi-agent orchestration, security integration, and better benchmarks all maturing in parallel. The common thread: writing code is increasingly the easy part. Verifying, securing, and measuring the output is where the real work lives now.

Made with ❤️ by Data Drift Press. Have thoughts on any of these stories? Hit reply - we read every message.