Noqta
  • Home
  • Services
  • About us
  • Writing
  • Sign in
writing/news/2026/03
● NewsMar 16, 2026·6 min read

New SWE-CI Benchmark Reveals 75% of AI Coding Agents Break Working Code Over Time

Alibaba researchers introduce SWE-CI, a benchmark that tests AI coding agents on real-world codebase maintenance across months of commits — exposing a critical gap between writing new code and keeping it working.

Noqta Team
Noqta Team
Author
·EN · FR · AR

A new benchmark from Alibaba researchers is challenging the narrative that AI coding agents are ready to replace human developers. Published on March 4, 2026, SWE-CI is the first evaluation framework that tests AI agents on what software engineers actually spend most of their time doing: maintaining and evolving existing codebases through continuous integration workflows.

The results are sobering. 75% of tested models break previously working code during long-term maintenance tasks — even when they initially produce patches that pass all tests.

What Makes SWE-CI Different

Most existing benchmarks like SWE-bench evaluate AI agents on isolated tasks: fix this bug, implement this feature, pass these tests. SWE-CI takes a fundamentally different approach by simulating the full lifecycle of real software projects.

Each of the benchmark's 100 tasks corresponds to an actual Python repository's evolution history, spanning an average of 233 days and 71 consecutive commits. Agents must work through dozens of iterative rounds of analysis and coding — just like a real developer maintaining a production codebase.

The benchmark was curated from 4,923 candidate repositories, filtering for projects with:

  • More than 3 years of active maintenance
  • Over 500 GitHub stars
  • Permissive open-source licenses
  • At least 500 modified lines of code

The Results: A Reality Check

The researchers tested 18 models from 8 providers, including Claude, GPT, DeepSeek, Qwen, MiniMax, Kimi, GLM-5, and Doubao. The findings reveal a stark divide:

  • Claude Opus models were the only ones exceeding a 50% zero-regression rate — meaning they managed to avoid breaking existing functionality more than half the time
  • GLM-5 emerged as a strong secondary performer
  • All other models scored below 25% on zero-regression rates

Within the same provider family, newer models consistently achieved higher scores, with models released after January 2026 showing the largest gains over their predecessors.

EvoScore: A New Way to Measure Code Quality

One of SWE-CI's key contributions is EvoScore, a new evaluation metric that penalizes short-term optimization. Unlike traditional pass/fail test metrics, EvoScore weights later iterations more heavily than earlier ones.

This design choice exposes a common failure pattern: agents that produce quick fixes early on but create mounting technical debt that causes cascading failures in subsequent commits. An agent might score well on initial patches while leaving the codebase in a worse state overall.

Why This Matters

The gap between benchmark performance and real-world utility has been a growing concern in the AI coding space. Developers using tools like Cursor, Claude Code, and Devin have reported strong results for greenfield development but frustration with maintenance tasks — the work that typically consumes 60-80% of a software engineer's time.

As one researcher summarized: "Passing tests once is table stakes. Not breaking everything over time is the actual job."

The SWE-CI findings suggest that the AI coding industry has been optimizing for the wrong metric. Writing new code is the easy part. The hard part — and the part where AI agents still fall short — is maintaining, evolving, and not regressing a living codebase across months of continuous development.

What Comes Next

The benchmark is openly available under a CC BY 4.0 license, and the researchers have called on the community to adopt long-term maintenance evaluation as a standard practice for AI coding tools.

For development teams evaluating AI coding assistants, SWE-CI offers a more realistic lens: not whether an agent can write code, but whether it can be trusted to keep code working over time.

Related Noqta Guides

Want practical implementation patterns beyond benchmark headlines?

  • Claude Code Review: Multi-Agent PR Review
  • Building AI Agent Workflows: A Practical Framework for 2026
  • How to Set Up Multiple OpenClaw Agents on Telegram

For production QA and reliability rollout, explore /vibe-coding-audit-and-qa and /quality-assurance.


Source: SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via Continuous Integration

● Tags
#AI#Developer Tools#Open Source
● Share
● A question?

Talk to a Noqta agent about this article.

Noqta Team
Noqta Team
Author · noqta
Follow ↗

● Read next

Alibaba's Qwen3.6-27B Beats a 397B Model on Coding Benchmarks
● News

Alibaba's Qwen3.6-27B Beats a 397B Model on Coding Benchmarks

Apr 23, 2026
Flipbook Prototype Streams Every Pixel from AI, Replaces HTML
● News

Flipbook Prototype Streams Every Pixel from AI, Replaces HTML

Apr 24, 2026
Google Commits Up to $40 Billion to Anthropic in Cash and TPU Compute
● News

Google Commits Up to $40 Billion to Anthropic in Cash and TPU Compute

Apr 25, 2026
Noqta
Terms and Conditions · Privacy Policy
Services
  • AI Automation
  • AI Agents
  • CX Automation
  • Vibe Coding
  • Project Management
  • Quality Assurance
  • Web Development
  • API Integration
  • Business Applications
  • Maintenance
  • Low-Code/No-Code
Links
  • About Us
  • How It Works?
  • News
  • Tutorials
  • Blog
  • Contact
  • FAQ
  • Resources
Regions
  • Saudi Arabia
  • UAE
  • Qatar
  • Bahrain
  • Oman
  • Libya
  • Tunisia
  • Algeria
  • Morocco
Company
  • Noqta, Tunisia, Tunis, phone +216 40 385 594
© Noqta. All rights reserved.