New SWE-CI Benchmark Reveals 75% of AI Coding Agents Break Working Code Over Time

A new benchmark from Alibaba researchers is challenging the narrative that AI coding agents are ready to replace human developers. Published on March 4, 2026, SWE-CI is the first evaluation framework that tests AI agents on what software engineers actually spend most of their time doing: maintaining and evolving existing codebases through continuous integration workflows.

The results are sobering. 75% of tested models break previously working code during long-term maintenance tasks — even when they initially produce patches that pass all tests.

What Makes SWE-CI Different

Most existing benchmarks like SWE-bench evaluate AI agents on isolated tasks: fix this bug, implement this feature, pass these tests. SWE-CI takes a fundamentally different approach by simulating the full lifecycle of real software projects.

Each of the benchmark's 100 tasks corresponds to an actual Python repository's evolution history, spanning an average of 233 days and 71 consecutive commits. Agents must work through dozens of iterative rounds of analysis and coding — just like a real developer maintaining a production codebase.

The benchmark was curated from 4,923 candidate repositories, filtering for projects with:

More than 3 years of active maintenance
Over 500 GitHub stars
Permissive open-source licenses
At least 500 modified lines of code

The Results: A Reality Check

The researchers tested 18 models from 8 providers, including Claude, GPT, DeepSeek, Qwen, MiniMax, Kimi, GLM-5, and Doubao. The findings reveal a stark divide:

Claude Opus models were the only ones exceeding a 50% zero-regression rate — meaning they managed to avoid breaking existing functionality more than half the time
GLM-5 emerged as a strong secondary performer
All other models scored below 25% on zero-regression rates

Within the same provider family, newer models consistently achieved higher scores, with models released after January 2026 showing the largest gains over their predecessors.

EvoScore: A New Way to Measure Code Quality

One of SWE-CI's key contributions is EvoScore, a new evaluation metric that penalizes short-term optimization. Unlike traditional pass/fail test metrics, EvoScore weights later iterations more heavily than earlier ones.

This design choice exposes a common failure pattern: agents that produce quick fixes early on but create mounting technical debt that causes cascading failures in subsequent commits. An agent might score well on initial patches while leaving the codebase in a worse state overall.

Why This Matters

The gap between benchmark performance and real-world utility has been a growing concern in the AI coding space. Developers using tools like Cursor, Claude Code, and Devin have reported strong results for greenfield development but frustration with maintenance tasks — the work that typically consumes 60-80% of a software engineer's time.

As one researcher summarized: "Passing tests once is table stakes. Not breaking everything over time is the actual job."

The SWE-CI findings suggest that the AI coding industry has been optimizing for the wrong metric. Writing new code is the easy part. The hard part — and the part where AI agents still fall short — is maintaining, evolving, and not regressing a living codebase across months of continuous development.

What Comes Next

The benchmark is openly available under a CC BY 4.0 license, and the researchers have called on the community to adopt long-term maintenance evaluation as a standard practice for AI coding tools.

For development teams evaluating AI coding assistants, SWE-CI offers a more realistic lens: not whether an agent can write code, but whether it can be trusted to keep code working over time.

Want practical implementation patterns beyond benchmark headlines?

For production QA and reliability rollout, explore /vibe-coding-audit-and-qa and /quality-assurance.

Source: SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via Continuous Integration