Why `gaffer affected-tests` Returned Empty for Every E2E-Touching Edit I Made

By Alex Gandy May 13, 2026

A few months ago I used the Gaffer MCP to find and fix slow Playwright tests. The MCP held up. The actual win turned out to be a Better Auth SSR fix in the app layer, but the discovery process worked: ask Claude what’s slow, look at the patterns, follow the data.

This time I wanted to try the full agentic loop. Not just find something. Run the loop a coding agent should run: query flaky tests, propose a fix, ask Gaffer which tests are affected by that fix, run only those, verify, repeat. The agent never has to think “do I have time to run the whole E2E suite?” because the tool tells it which specs the change actually touches.

I never got past step two. gaffer affected-tests kept returning empty.

The probe

Five representative edits against this repo, run through the locally-built CLI:

Terminal window
$ gaffer affected-tests --files <path> --graph --format json
ScenarioTests found (heuristic)Tests found (--graph)
apps/dashboard/server/api/v1/projects/[id]/coverage-summary.get.ts00
apps/dashboard/app/composables/useProjects.ts00
apps/dashboard/server/utils/auth.ts00
apps/dashboard/e2e/fixtures/test-data.ts07 specs
apps/dashboard/app/pages/dashboard.vue00

The fixture file works because Playwright specs import it. The other four don’t, and --graph can’t help.

The reason is structural. Today’s affected-tests runs three strategies: naming convention (auth.tsauth.test.ts), directory proximity (sibling __tests__/), and import-graph reverse-reachability (BFS over the project’s import graph). Playwright E2E specs interact with the application through page.goto(), not import. There is no edge from e2e/projects.spec.ts to the server route it exercises. The graph algorithm is doing exactly what it’s supposed to do; the algorithm just can’t see what we need it to see.

Why “empty” was the dangerous answer

Pre-fix, gaffer affected-tests returned this for every one of those edits:

{ "affected": [], "run_command": null, "framework": null }

A coding agent reading that has two reasonable conclusions:

  1. “No tests touch this file. Safe to skip running E2E.”
  2. “Tool said nothing. Run everything to be safe.”

Neither one is what you want. Option (1) is actively misleading for a server-route change that obviously breaks E2E. Option (2) defeats the whole point of affected-tests and the agent now runs a 60-second suite on every iteration. The output was indistinguishable between “we ran every signal and found nothing real” and “we ran in degraded mode because Gaffer history wasn’t reachable.” That ambiguity is the bug.

What shipped: signal-aware output

The fix lands in packages/gaffer-core/src/affected.rs. The CLI now tells the caller which signal sources it actually tried:

{
"affected": [],
"run_command": null,
"framework": null,
"signals": {
"attempted": ["naming_convention", "directory_proximity", "import_graph"],
"unavailable": ["coverage_history", "failure_history"]
}
}

An agent reading signals.unavailable now knows the answer is incomplete. The decision becomes legible: “run the full suite because the tool only had three of five signals” instead of “trust an empty list because the tool gave me an empty list.”

The same release adds a HistorySignalProvider trait that two strategies plug into:

  • Coverage history. For a changed source file, return tests that have historically executed coverage for that file. The Gaffer dashboard already stores coverage_files.test_run_ids linking covered files to the test runs that covered them. The data is already there. Wiring it up is mostly an HTTP round-trip the standalone CLI hasn’t gotten yet.
  • Failure history. For a changed source file, return tests that have flaked or failed in recent commits where that file was in the diff. A learned correlation rather than a static one.

Both are gated behind the trait so the CLI works fine without a Gaffer connection. It just reports coverage_history and failure_history as unavailable, and the agent treats the result accordingly. No new flags. The agent never has to know which mode it’s in.

A separate detail worth surfacing: each test in the affected list now carries the full list of signals that selected it.

{
"test_file": "packages/gaffer-core/src/affected.rs",
"confidence": 0.97,
"strategy": "naming_convention",
"signals": [
{ "strategy": "naming_convention", "confidence": 0.9 },
{ "strategy": "import_graph", "confidence": 0.7 }
]
}

The single-attribution strategy field is preserved for any existing JSON consumer; the signals array is additive. An agent that wants to reason about confidence intelligently (“two independent signals both selected this test, so I trust it more than the one with only directory proximity”) can do that now.

The Playwright fix that actually landed

The point of the signal-aware refactor is to make future flake hunts cheaper. The point of this branch was also to fix one. Even without coverage-history wired to a live dashboard yet, a static scan of apps/dashboard/e2e/** surfaced one concrete bug: setup-checklist.spec.ts used page.reload({ waitUntil: 'networkidle' }) four times.

networkidle waits for 500 ms of zero network activity. Our dashboard never reaches that: analytics pings, session refresh, background fetches, websocket heartbeats. Playwright’s own documentation recommends against it for SPAs. Every one of those four reloads was followed immediately by an expect(...).toBeVisible() assertion that auto-retries until the element appears. The networkidle argument added flake risk and no signal.

Removed. The default load event plus the existing assertions is the deterministic wait.

A second gap, found while validating the first

A funny thing happened on the way to merging this branch. Running setup-checklist.spec.ts six times locally to confirm the networkidle removal hadn’t introduced a regression, I got a clean result: two tests passed every time, one test failed every time. The failing one was the same Progress Updates spec across six rounds.

Querying Gaffer:

Terminal window
$ gaffer query flaky
[]

Empty. The deterministic failure was not surfaced as flaky, because Gaffer’s flaky algorithm looked for one signal: status flips. If a test always fails, flip_rate = 0, and the composite-score gate filtered it out.

But the test was failing differently each round. Some rounds hit a Target page, context or browser has been closed cascade. Others timed out on expect(getByText('3/6 complete')).toBeVisible(). Same test, same code, same database. Two distinct error patterns across six runs. That is a flakiness signal: the test’s failure mode is non-deterministic even though its pass/fail outcome is deterministic. The algorithm missed it.

The fix

The composite score in packages/gaffer-core/src/intel/flaky.rs gains a third term: failure-pattern variance. The full formula is now:

composite = (flip_rate × 0.4) + (failure_rate × 0.4) + (pattern_variance × 0.2)

pattern_variance is computed from the count of distinct normalized error messages across the test’s failures. One pattern: zero variance. Two patterns: 0.25. Saturates at five distinct patterns. The same normalization function that powers Gaffer’s failure clustering supplies the fingerprint, so this isn’t a separate ad-hoc heuristic. It’s the same notion of “different errors” that the clustering surface already uses.

The gate that decides whether a test enters the flaky list expands too. A test is considered flaky if flip_rate >= 0.1 OR distinct_failure_patterns >= 2. Both checks still require a minimum sample size of five runs, so a test that fails three times with three errors is not flagged on a thin sample.

Before and after

Same database, same query. Pre-fix:

Terminal window
$ gaffer query flaky
[]

Post-fix:

[
{
"test_name": "setup-checklist.spec.ts > Setup Checklist > Progress Updates > shows progress as steps are completed [chromium]",
"flip_rate": 0.0,
"flip_count": 0,
"total_runs": 6,
"composite_score": 0.45
}
]

The 0.45 composite score is (0.0 × 0.4) + (1.0 × 0.4) + (0.25 × 0.2): zero flips, 100% failure rate, two distinct error patterns out of six failures contributing the 0.25 variance term. The test is now visible to any agent reading the flaky-test surface.

A third change, prompted by reading the skill

While writing up the agentic-loop pattern in .claude/skills/gaffer-cli/SKILL.md, the canonical command looked like this:

Terminal window
gaffer test -- $(gaffer affected-tests --files <changed> --graph --format json | jq -r .run_command)

That is wild. Every flag in that command is the default an agent wants. --graph should be on. --format json is the only useful output mode. | jq -r .run_command is boilerplate extracting one field from a known shape. The whole command is a wrapper around a wrapper around the answer.

Three changes, all in this PR:

  1. --graph defaults to on. Pass --no-graph to opt out (faster on huge codebases, but loses indirect dependencies).
  2. --print-cmd shortcut. gaffer affected-tests --files X --print-cmd prints just the bare run_command string. Exit 1 if no command is available so gaffer test -- $(gaffer affected-tests --files X --print-cmd) naturally fails fast on the empty case.
  3. gaffer test --affected --files X. Integrated subcommand that collapses the whole loop into one invocation. Runs the affected-tests strategy stack internally, scopes the runner to just the relevant test files, and parses results as usual. --on-empty=skip|fail controls the empty-list behavior.

The agentic-loop sequence in the skill goes from three commands plus shell substitution to one:

Terminal window
gaffer test --affected --files path/to/changed-file.ext

Worth surfacing because the dogfood loop only revealed it once I tried to write the skill. Reading my own draft made the surface area obvious: an agent forced through that command was an agent paying a tax for tool ceremony that should never have existed.

What’s deferred

The honest list:

  • The dashboard API endpoint that backs coverage_history and failure_history lookups. Writing the code is straightforward; validating it requires the dashboard, a Neon database with seeded coverage data, and integration tests. Better as a follow-up PR against staging than rushed into this branch.
  • An MCP get_affected_tests tool so an agent can stay in one channel. The MCP architecture rule says tools call the dashboard API rather than shelling out to the CLI; that puts this work after the dashboard endpoint lands.
  • BasePage.waitForLoading() migration. Eleven E2E call sites use a @deprecated helper with a quiet race condition (count spinners, wait only if visible, swallow errors). Most call sites have an explicit expect(...).toBeVisible() right after, making the helper a no-op. Removing it would be a real simplification, but the change touches eleven specs and BasePage. Without a way to run the suite ten times in a row against real Gaffer history, this is filed as follow-up rather than risked in this branch.

Takeaway

The predecessor post landed on a clean answer: my flaky tests were slow because of an SSR auth bug in the app, not because of anything Gaffer was missing. Tools were fine, the app was wrong, fix the app.

This one landed in a different place: the tool itself had structural gaps. Two of them, found in the same loop. The first one was the agentic-loop framing the predecessor post promised (find a flaky test, narrow with affected-tests, fix, verify) quietly assumed affected-tests could see E2E specs that hit URLs. It can’t, and the empty result it returned looked exactly like a healthy result. The second one only surfaced because the fix for the first one was being validated: a deterministic test that fails with different errors every run is just as broken as one that flips, and Gaffer was reporting it as healthy.

The fix to the first gap isn’t to make the import graph smarter. It’s to admit when the answer is incomplete, scaffold the signals that can see E2E (coverage history, failure history), and let the data the SaaS already collects do the work the static algorithm structurally can’t. The fix to the second one is shorter: don’t only look at the binary pass/fail dimension when the failure mode itself can vary.

Coding agents reading flaky-test or affected-tests output don’t need new flags. They need the result to tell them what it actually means. That’s what this branch ships.

Start Free