Test Intelligence: The Missing Context for AI Coding Tools

By Alex Gandy February 14, 2026

AI coding tools know your code. They can read every file, trace every import, understand every type. But ask them about your tests — which ones are flaky, which failures are regressions, which files have zero coverage — and they have nothing. This is the test intelligence gap.

The code is local. The test intelligence isn’t.

The Context Gap

When Claude Code or Cursor suggests a change to your auth module, it can read the source, understand the types, and even write new tests. What it can’t do is answer basic questions about your test suite’s state:

  • Has should validate session token been failing for three days, or did it just break?
  • Are those 14 test failures 14 bugs, or one database connection issue?
  • Does the file you’re editing have 90% coverage or 0%?

Without this context, AI tools make predictable mistakes. They “fix” flaky tests that aren’t actually broken. They write new tests for well-covered code while ignoring untested critical paths. They treat every failure as novel, even when the same error has been clustering across runs for a week.

These aren’t capability problems. The tools are capable. They’re information problems.

Context Enrichment Over Agent Orchestration

A simpler pattern works better than autonomous agent pipelines: give existing tools better context and let them reason with it.

An AI tool with the right information makes better decisions than an elaborate agent pipeline with poor inputs. You don’t need a dedicated “test-fixing agent” if your coding assistant already understands test history. The reasoning capability is already there. The data isn’t.

But context enrichment for testing requires more than piping CI logs into a prompt. CI output is unstructured text designed for human eyes scanning a terminal. It tells you what failed. It doesn’t tell you whether this failure matters.

What Test Intelligence Actually Requires

There’s a meaningful difference between “this test failed” and “this test has been flipping between pass and fail for 30 runs with a 40% flip rate.” The first is a CI log. The second is test intelligence.

Temporal context — Is this failure new? A test that failed once after 50 consecutive passes is a regression. A test that alternates pass/fail every few runs is flaky. The correct response to each is completely different, and you can’t distinguish them from a single CI run.

Causal context — Are these failures related? Fifteen failing tests look like fifteen problems. But if they all share Connection refused: localhost:5432 in their error messages, it’s one problem. Without failure clustering, an AI tool will attempt fifteen fixes.

Coverage context — What’s actually tested? When an AI tool modifies server/services/billing.ts, it should know that file has 12% line coverage before deciding whether to add tests. And it should know that file is also associated with recent failures — making it a high-risk area.

Comparative context — Did my change make things worse? Comparing test metrics between two commits gives a clear signal: did this specific test get slower, did it start failing?

No amount of CI log parsing provides this. It requires structured data accumulated across many test runs over time.

Gaffer’s MCP Approach

Gaffer’s MCP server exposes test intelligence as structured tools that any MCP-compatible editor can call. Rather than dumping raw data into context windows, each tool answers a specific question.

”Is this test flaky or did I break it?”

An AI tool sees a test failure after your change. Before attempting a fix, it calls get_test_history:

// get_test_history for "should validate session token"
{
"history": [
{ "status": "failed", "commitSha": "a1b2c3", "durationMs": 1230 },
{ "status": "passed", "commitSha": "d4e5f6", "durationMs": 980 },
{ "status": "failed", "commitSha": "g7h8i9", "durationMs": 1150 },
{ "status": "passed", "commitSha": "j0k1l2", "durationMs": 1020 }
],
"summary": { "totalRuns": 4, "passedRuns": 2, "failedRuns": 2, "passRate": 50.0 }
}

50% pass rate, alternating pass/fail. This is a flaky test — not a regression introduced by the current change. The correct action is to flag it, not to “fix” the code under test.

Cross-referencing with get_flaky_tests confirms it:

// get_flaky_tests (excerpt)
{
"flakyTests": [
{
"name": "should validate session token",
"flipRate": 0.42,
"flipCount": 8,
"totalRuns": 20,
"flakinessScore": 0.71
}
]
}

A flip rate of 0.42 across 20 runs. This test has been unreliable long before the current commit.

”What actually broke?”

A CI run reports 18 failures. The AI tool calls get_failure_clusters:

// get_failure_clusters
{
"clusters": [
{
"representativeError": "ECONNREFUSED 127.0.0.1:5432",
"count": 14,
"similarity": 0.92
},
{
"representativeError": "Expected status 200, received 403",
"count": 3,
"similarity": 0.87
},
{
"representativeError": "Timeout waiting for selector [data-testid='modal']",
"count": 1,
"similarity": 1.0
}
],
"totalFailures": 18
}

Three root causes, not eighteen. Fourteen tests failed because the database wasn’t running — an infrastructure issue, not a code bug. The AI tool can ignore those and focus on the permissions error and the missing UI element.

”What’s not tested in the code I’m changing?”

Before modifying server/services/billing.ts, the AI tool calls get_coverage_for_file:

// get_coverage_for_file for "server/services/billing.ts"
{
"hasCoverage": true,
"files": [
{
"path": "server/services/billing.ts",
"lines": { "covered": 14, "total": 118, "percentage": 11.86 },
"branches": { "covered": 2, "total": 18, "percentage": 11.11 },
"functions": { "covered": 3, "total": 15, "percentage": 20.0 }
}
]
}

12% line coverage. If the AI tool writes changes to this file without adding tests, it’s modifying largely untested code. It can then call find_uncovered_failure_areas to see if this file also has associated failures — a sign that it’s both poorly tested and actively problematic.

The Compound Effect

The value of test intelligence compounds across sessions. A single CI run tells you what happened right now. Accumulated test history tells you what’s normal.

When an AI tool has access to weeks or months of test data, it can distinguish signal from noise:

  • A test that fails once after passing for weeks is worth investigating. A test with a 40% flip rate is not — at least, not as an immediate blocker.
  • A test that passed consistently then started failing after a specific commit (visible via compare_test_metrics) points directly at the regression source.
  • The slowest tests in the suite (via get_slowest_tests) can inform whether to run the full suite or a targeted subset during development.

None of this works with a single snapshot. It requires historical data and the structured tools to query it.

Getting Started

Three steps to connect your AI coding tools to your test history.

1. Upload test results from CI

Add one step to your CI pipeline. Here’s GitHub Actions with Playwright:

- name: Upload to Gaffer
if: always()
uses: gaffer-sh/gaffer-uploader@v2
with:
gaffer_upload_token: ${{ secrets.GAFFER_UPLOAD_TOKEN }}
report_path: playwright-report/

2. Add the MCP server to your editor

For Claude Code:

Terminal window
claude mcp add gaffer -e GAFFER_API_KEY=gaf_your_api_key -- npx -y @gaffer-sh/mcp

For Cursor, Windsurf, or any MCP-compatible client:

{
"mcpServers": {
"gaffer": {
"command": "npx",
"args": ["-y", "@gaffer-sh/mcp"],
"env": {
"GAFFER_API_KEY": "gaf_your_api_key"
}
}
}
}

3. Use it

No special prompting required. The tools are available to the AI automatically. Ask “is this test flaky?” or “what’s the coverage of the file I’m editing?” and the tool calls happen behind the scenes.

The MCP server is open source — @gaffer-sh/mcp on npm. Gaffer’s free tier includes test history and analytics. Full docs at /docs/mcp/.


Start Free