AI agents can write code and trigger CI. But when tests fail, they get back a wall of unstructured text. They can’t tell a flaky test from a regression, a single root cause from fifteen independent failures, or whether results have even finished processing.
The Loop
The agentic CI pattern is straightforward: agent writes code, pushes to CI, waits for results, acts on them.
The first three steps are solved. The last one — acting on results — is where agents stall. CI output is designed for humans scanning logs, not machines making decisions.
The Problem
When an agent gets back raw CI output, it has to:
- Parse framework-specific output formats (Playwright vs. Jest vs. Pytest — all different)
- Distinguish real failures from flaky tests
- Figure out if 15 failing tests are 15 bugs or 1 bug manifesting 15 ways
- Decide whether to fix, retry, or escalate
Agents handle step 1 poorly and steps 2-4 not at all. They end up applying surface-level fixes to individual test failures without understanding root causes.
Structured Test Data
Gaffer’s MCP server gives agents structured access to test history and analytics. Here’s what that looks like in practice.
Grouping Failures by Root Cause
Instead of an agent seeing 12 separate failures and trying to fix each one:
// get_failure_clusters response{ "clusters": [ { "representativeError": "Connection refused: localhost:5432", "count": 9, "tests": [ { "name": "should create user", "fullName": "Auth > should create user" }, { "name": "should update profile", "fullName": "Profile > should update profile" } ] }, { "representativeError": "Expected 200, received 401", "count": 3, "tests": [ { "name": "should access dashboard", "fullName": "Dashboard > should access dashboard" } ] } ], "totalFailures": 12}12 failures, 2 root causes. The agent fixes the database connection issue and the auth bug — not 12 individual tests.
Distinguishing Regressions from Flaky Tests
// get_test_history response{ "history": [ { "testRunId": "run_abc", "test": { "status": "failed" }, "commitSha": "a1b2c3" }, { "testRunId": "run_def", "test": { "status": "passed" }, "commitSha": "d4e5f6" }, { "testRunId": "run_ghi", "test": { "status": "passed" }, "commitSha": "g7h8i9" } ], "summary": { "totalRuns": 3, "passedRuns": 2, "failedRuns": 1, "passRate": 66.67 }}Failed once, passed twice before — likely a regression introduced in the latest commit, not a flaky test.
Checking If Results Are Ready
// get_upload_status response{ "sessions": [ { "id": "upl_xyz", "processingStatus": "completed", "commitSha": "a1b2c3d4", "pendingFileCount": 0 } ]}No polling CI logs. processingStatus: "completed" means results are available for analysis.
Setup
claude mcp add gaffer -e GAFFER_API_KEY=gaf_your_api_key -- npx -y @gaffer-sh/mcpThe MCP server exposes 16 tools covering test results, failure analysis, coverage data, and upload status. The agent picks the right tool based on context — no configuration for which tool to call when.
No Code Access Required
Gaffer never sees your source code. It works entirely with test results — the structured output your CI pipeline already produces. Your CI uploads JUnit XML, coverage reports, or Playwright results. Gaffer stores and analyzes those artifacts. That’s it.
The MCP server is read-only. It can’t modify test results, trigger builds, or access your codebase. It answers questions about what happened in your test runs — nothing else.
Try It
The tools are in the Gaffer MCP server. You’ll need a Gaffer account with test results uploaded from your CI pipeline.