Your AI agent just “fixed” a test that’s been flipping between pass and fail for two weeks. It spent tokens debugging a 40% flip-rate flaky test that no human would touch. The agent doesn’t know that because it has no memory of what happened before this session.
The Problem
AI coding agents operate without test history. Each session starts from zero. This leads to three failure modes that waste time and CI minutes:
Wasted cycles on flaky tests. An agent sees a test failure and tries to fix it. But the test has a 40% flip rate — it was going to pass on the next run anyway. The agent doesn’t know because it can’t see historical pass/fail patterns.
Duplicate work on shared root causes. CI reports 14 failures. The agent attempts 14 separate fixes. In reality, 11 of those failures share one root cause — a database connection timeout. Fixing one line would have resolved them all.
No before/after comparison. After making changes, the agent can’t verify whether a specific test actually improved. Did the fix work, or did a flaky test just happen to pass this time? Without comparing metrics across commits, the agent is guessing.
This isn’t theoretical. Google’s 2025 DORA report found that AI coding tools correlate with a 7.2% decrease in delivery stability. The agents are fast at writing code, but they make poor decisions when they can’t see test history.
What Teams Try (and Why It Falls Short)
Piping CI logs into context. Raw CI output is unstructured text designed for humans. An agent can’t compute a flip rate from log lines. It can’t cluster failures by error similarity. And it definitely can’t compare results across commits when it’s parsing different log formats from Playwright, Jest, and Pytest.
General-purpose memory tools. Tools like mem0 or Letta give agents fuzzy vector recall — useful for remembering conversations, not for computing that test_checkout_flow has flipped 8 times in 20 runs. You can’t calculate a flakiness score from semantic similarity search.
Checking CI directly. GitHub Actions artifacts expire. Navigating workflow runs programmatically is brittle. And even if the agent finds the right artifact, it gets back HTML reports or raw XML — not structured data it can reason about.
Structured Test Memory
“Test memory” means giving agents deterministic, queryable access to test history — not fuzzy recall, but exact data: pass rates, flip counts, failure clusters, duration changes.
Gaffer’s MCP server exposes this as structured tool calls that any MCP-compatible agent (Claude, Cursor, Windsurf, Copilot) can use automatically.
Identify Flaky Tests Before Wasting Time
// get_flaky_tests response{ "flakyTests": [ { "name": "should complete checkout flow", "flipRate": 0.4, "flipCount": 8, "totalRuns": 20, "flakinessScore": 0.72 } ], "summary": { "totalFlaky": 3, "threshold": 0.1, "period": 30 }}The agent sees a flakinessScore of 0.72 and a 40% flip rate. It skips the flaky test and focuses on real regressions.
Group Failures by Root Cause
// get_failure_clusters response{ "clusters": [ { "representativeError": "Connection refused: localhost:5432", "count": 11, "tests": [ { "name": "should create user", "fullName": "Auth > should create user" }, { "name": "should update profile", "fullName": "Profile > should update profile" } ] }, { "representativeError": "Expected 200, received 401", "count": 3, "tests": [ { "name": "should access dashboard", "fullName": "Dashboard > should access dashboard" } ] } ], "totalFailures": 14}14 failures, 2 root causes. The agent fixes the database connection issue and the auth bug — not 14 individual tests.
Track Pass/Fail Across Commits
// get_test_history response{ "history": [ { "status": "failed", "commitSha": "a1b2c3", "durationMs": 4200 }, { "status": "passed", "commitSha": "d4e5f6", "durationMs": 3800 }, { "status": "passed", "commitSha": "g7h8i9", "durationMs": 3900 } ], "summary": { "totalRuns": 3, "passedRuns": 2, "failedRuns": 1, "passRate": 66.67 }}Failed once, passed twice before — likely a regression in the latest commit, not a flaky test.
Compare Before and After
// compare_test_metrics response{ "before": { "status": "failed", "durationMs": 12400, "commit": "a1b2c3" }, "after": { "status": "passed", "durationMs": 3200, "commit": "d4e5f6" }, "change": { "statusChanged": true, "percentChange": -74.2 }}The fix worked: status changed from failed to passed, duration dropped 74%. No guessing.
Comparison
| Capability | CI Logs | Vector Memory | Gaffer MCP |
|---|---|---|---|
| Flip rate / flakiness score | No | No | Yes |
| Failure clustering by root cause | No | No | Yes |
| Cross-commit comparison | No | Approximate | Exact |
| Structured, queryable data | No | Fuzzy | Yes |
| Works across test frameworks | Varies | N/A | Yes |
| Agent can use without parsing | No | Yes | Yes |
Setup
Three steps. No code changes to your test suite.
1. Upload test results from CI
Add the Gaffer uploader to your CI pipeline. Supports GitHub Actions, GitLab CI, CircleCI, and others.
- name: Upload to Gaffer if: always() uses: gaffer-sh/gaffer-uploader@v2 with: api-key: ${{ secrets.GAFFER_UPLOAD_TOKEN }}2. Add the MCP server
claude mcp add gaffer -e GAFFER_API_KEY=gaf_your_key -- npx -y @gaffer-sh/mcp@latestWorks with Claude Code, Cursor, Windsurf, and any MCP-compatible tool.
3. Agent queries automatically
Once connected, the agent calls the right tool based on context. No configuration for when to use which tool — it sees a test failure and checks history, checks flakiness, clusters failures, and compares results across commits.
Get Started
Gaffer’s free tier includes test history and flaky test detection. The MCP server is open source.