Monday morning. Coffee in hand, you open the latest CI run and the page is a wall of red. Forty test failures. Eighty. Sometimes more. The first instinct is to start clicking the first failure and reading the stack trace. The second is to wonder, by failure number fifteen, whether you are looking at fifteen bugs or one bug echoed across fifteen test files.
That second instinct is the right one. Most red CI runs at scale are a handful of root causes wearing many costumes. The work is not investigating every failure. The work is figuring out which failures collapse into the same underlying problem, then fixing the cause once.
What is a test failure?
A test failure is an automated test whose actual result did not match its expected result. The test ran, the assertion fired, and the runner recorded a non-passing outcome. The word for it in most test runners is simply “fail” or “failed”; depending on the framework you may also see “error” (the test threw an exception before the assertion ran) or “broken” (the setup itself blew up). The distinction matters for triage but the high-level meaning is the same: something the suite expected to be true was not true.
A useful framing: a test failure is a signal, not a verdict. It tells you the suite caught a discrepancy. Whether that discrepancy is a real defect, a flake, an outdated assertion, or an environment problem is the next question.
Types of test failures
Before getting to the ten reasons individual tests fail, it helps to sort failures into a smaller number of buckets. Triage works differently for each bucket, and conflating them is the most common reason debugging stalls.
Consistent failures (real bugs)
A test fails every time on the same code, with the same error. This is the bucket you want most failures to live in. The signal is high, the cause is in the diff, and the fix is local. The two questions are which commit introduced the failure and what behavior changed.
Flaky test failures (non-deterministic)
The test fails sometimes and passes other times against the same code. Causes are usually timing, shared state, or environmental jitter (network, clock, ordering). Flaky failures are dangerous because they erode trust in the suite: developers learn to hit rerun, real regressions hide in the noise, and the actual fix is rarely in the test that “failed.” See our deeper write-up on flaky test detection for the diagnostic patterns.
New failures (regressions)
A test that passed yesterday fails today, and no one changed the test. By definition the cause is in the diff. The triage move is to read the changeset, not the assertion. New failures are the highest-value bucket for blocking a merge: if a test was healthy and now isn’t, something shipped that broke it.
Environment failures (infrastructure, not code)
The test failed because the runner couldn’t talk to the database, the Docker image didn’t pull, the test container ran out of disk, the secret rotated, or the CI provider had a region outage. The application code is fine. These failures cluster heavily because every test that depends on the broken dependency will fail with the same error.
Performance-related failures
Tests that pass functionally but exceed a timeout, a memory budget, or a duration threshold. Often a mix of real regression (a function got slower) and environment drift (the runner was loaded). Treat these as their own bucket because the fix is usually budget-tuning or batching, not assertion logic.
10 reasons software tests fail
These are the proximate causes you will see in the error messages. The taxonomy above tells you which bucket a given failure belongs to; this list tells you what produced the specific exception.
Incorrect assertions
The test asserts the wrong thing. The product is fine, the expectation is stale. Common after a refactor that legitimately changes a response shape or a UI label without anyone updating the corresponding test.
Real product bugs
The change under test introduced a defect. The assertion is correct, the code under it is wrong. The fix is in production code, not test code.
Flaky/unstable test scripts
The test is racy: it waits on a timer instead of a condition, reuses global state, or depends on test ordering. The test code itself is the bug.
Unstable test data
Tests share a database, a seed, or a fixture, and one test mutates state another test reads. The failure appears in test B but the cause is in test A’s teardown.
Environment configuration mismatches
A required environment variable is missing in CI, the Node version differs from local, or a feature flag is enabled in one environment and not another. Every test that exercises the affected path fails identically.
Dependency/third-party changes
A package updated. An external API changed a response. A staging service was redeployed mid-run. The suite is unchanged but the world around it shifted.
Timing and synchronization issues
waitForLoadState('networkidle') on an app that pings analytics every two seconds. setTimeout(500) on an async operation that usually finishes in 300ms but sometimes takes 700. Implicit waits where explicit waits belong.
Inadequate test coverage
A bug ships because no test guarded the path. Strictly this is the absence of a failure, but it surfaces later as a regression in a test that did exist, often a poorly-targeted integration test that catches the breakage downstream of where it was introduced.
Parallel execution conflicts
Two test files write to the same temp directory, claim the same port, or hit the same database row. Pass in isolation, fail when the runner sharded them onto the same worker.
Poor test maintenance
Old tests asserting deprecated behavior, commented-out skips that quietly stopped covering anything, fixtures referencing dead endpoints. The suite drifts away from the application until enough small mismatches accumulate to make every run noisy.
When 200 failed tests means 3 real bugs
This is the section that the rest of the post exists to set up.
Why failure counts are misleading
Failure counts measure how many tests caught a problem, not how many problems there were. One downed dependency can fail every test that touches it. One bad mock setup can fail every test that imports the affected module. One environment variable left out of the CI config can fail every test in a suite. The 200-failure run and the 3-failure run can describe the same underlying breakage; the spread is just how widely the code under test is exercised.
Triaging by count gives you the wrong priority. The infrastructure outage with 80 failures gets the team’s full attention while the genuine regression hiding in 2 of those failures sits invisible behind the noise.
How to cluster failures by root cause
The technique is to group failures by what their error messages have in common, then investigate one representative per cluster. The Gaffer dashboard does this automatically (see failure clustering for the full mechanics), but you can do it by hand: sort the failure list by error message, look for duplicates, look for near-duplicates that differ only in a timestamp or an ID, and treat each group as a single ticket.
A real example from this codebase. A static scan of our dashboard E2E suite surfaced four uses of page.reload({ waitUntil: 'networkidle' }) inside a single spec file. That is exactly the anti-pattern Playwright’s own docs flag: on an app that pings analytics every couple of seconds, networkidle never settles and every reload races a timeout. Four anti-pattern call sites, one spec, one root cause, one fix (waitUntil: 'domcontentloaded'). We wrote up the algorithm-gap story behind that finding in the affected-tests dogfood post when it shipped.
The same pattern scales linearly with the suite. The math at 200 is just the math at four with a larger denominator: a database outage that cascades to 200 specs is still one fix, and the failure surface tells you that in seconds if you cluster first. This is test failure analysis the way it actually pays back.
The “group by error message” heuristic
If you only do one thing, do this: before reading any individual stack trace, run the failure list through a simple grouper. Strip the variable parts of each error string (timestamps, UUIDs, line numbers in stack traces, port numbers) and count how many distinct normalized errors you have. That number is your real bug count, give or take.
In the Gaffer MCP this is the get_failure_clusters tool (defined in packages/mcp-server/src/tools/get-failure-clusters.ts), which returns:
{ "clusters": [ { "representativeError": "Target page, context or browser has been closed", "count": 9, "similarity": 0.94, "tests": [ { "name": "should create user", "fullName": "Auth > should create user", "errorMessage": "Target page, context or browser has been closed", "filePath": "e2e/auth.spec.ts" } ] }, { "representativeError": "expect(received).toBe(expected) // emerald-600", "count": 3, "similarity": 0.88, "tests": [ { "name": "renders primary CTA", "fullName": "Dashboard > renders primary CTA", "errorMessage": "expect(received).toBe(expected) // emerald-600", "filePath": "e2e/dashboard.spec.ts" } ] } ], "totalFailures": 12}12 failures, 2 clusters, 2 fixes. An agent (or a human) reading that response has 2 tickets to file, not 12. The full tool surface for this kind of analysis is documented in the MCP server reference.
How to investigate a test failure (step by step)
Once you have a cluster you actually want to investigate, the order of operations matters. Skipping steps wastes time; doing them in the wrong order wastes more.
Review the test failure output
Read the full error message, beyond the first line. The first line is usually the assertion that fired; the lines after it tell you what value was actually produced. A TimeoutError followed by waiting for selector '[data-testid="user-name"]' to be visible is a very different bug from a TimeoutError followed by waiting for navigation.
Reproduce the failure locally
If the test fails consistently in CI but passes locally, the gap between local and CI is itself the bug. Note what the local environment has that CI doesn’t: faster network, different timezone, a populated database, browser DevTools open. Adjust until you can reproduce, because a bug you can’t reproduce is a bug you can’t fix.
Check recent changes (commit / branch context)
For a new failure, the cause is almost always in the diff. Use git log on the test file, the source files it exercises, and the shared fixtures. Bisect if the change set is large. For a flaky failure, the diff usually isn’t the cause and you skip this step.
Verify the environment
For any cluster whose error message reads like infrastructure (ECONNREFUSED, ENOSPC, 429 Too Many Requests, secret-decryption errors), check the runner health and the dependent services before touching code. A 30-second status check beats a 30-minute investigation that lands on “the database was down.”
How to fix test failures: quarantine vs. fix
Not every failure should block a merge. Not every failure should be ignored either. The decision is binary: quarantine the test (skip it, with a tracking issue) or fix it now.
Quarantine when:
- The failure is flaky (passes on rerun) and the underlying defect is hard to reproduce
- The failure is in an environment you don’t control and a fix requires upstream coordination
- The blast radius of the failure is contained: one test, one cluster, no signal loss for the rest of the suite
Fix now when:
- The failure clusters with several others (the fix unblocks a chunk of the suite at once)
- The failure is a clean regression with the diff in your branch
- The test guards behavior that ships to users and the failure indicates real broken behavior
The clustering view is what makes this decision tractable at scale. With raw failure counts, “fix the top three clusters” is a one-day project; “fix the top fifty failures” is a week of context-switching. We documented the same triage workflow on the failure clustering solution page.
How to track whether failures are improving over time
Single CI runs are noisy. A run with zero failures today does not mean the suite is healthy, and a run with eighty failures does not always mean things got worse. The signal is in the trend.
Why single-run analysis isn’t enough
Any one run is a point estimate of the suite’s health. Infrastructure can flake. A long-quarantined test can suddenly start passing. A new contributor can add ten tests that catch real bugs that were already there. The right unit of measurement is a window of runs, not a single one.
Tracking failure trends across CI runs
What to watch over a 7- to 30-day window: total failure count per run (trending up or down?), distinct failure clusters per run (are new patterns appearing or are the old ones recurring?), the ratio of flaky to consistent failures (more flaky over time means the suite is decaying), and time-to-fix per cluster (how long does a regression stay in the suite once introduced?). Flaky test detection covers the flaky-vs-consistent split in more depth; the same trend approach applies to all four metrics.
If your failures live in a specific framework’s reporter output, the format-specific guides are worth a read too. We have one on parsing and analyzing Cypress reports that walks through the same trend metrics applied to Cypress’s HTML report format.
What are the three types of failure?
A common framing in testing literature splits failures into three categories by where the defect actually lives:
- Failures caused by defects in the code under test. The product is wrong. Fix the application.
- Failures caused by defects in the test itself. The test is wrong (bad assertion, brittle selector, race condition). Fix the test.
- Failures caused by defects in the environment. Neither the product nor the test is wrong; the world they run in is. Fix the infrastructure.
The clustering technique above maps to this split cleanly: a single root cause that affects many tests is usually category 2 or 3; a single failure that doesn’t cluster is more often category 1.
What are the 4 levels of testing?
Most teams structure their test suites in four layers. Different layers fail for different reasons, which is why triage benefits from knowing which layer the red test came from:
- Unit tests. Test a single function or module in isolation. Fail mostly for incorrect-assertion or real-bug reasons. Cheap to run, cheap to fix.
- Integration tests. Test how multiple modules work together. Fail for assertion errors, dependency mismatches, and shared-state issues. Still cheap to run, moderate to fix.
- System tests (end-to-end). Test the application as a whole, often through a browser or HTTP client. Fail for timing issues, environment problems, and flakiness as often as for real bugs. Expensive to run, expensive to fix.
- Acceptance tests. Test against user-facing requirements, often manually or with a user-acceptance framework. Fail because the product doesn’t meet the spec, regardless of whether the underlying code is “working.”
The clustering and trend-tracking approach above is most valuable at layers 2 and 3, where one root cause routinely cascades across many tests. Unit-test failures usually don’t cluster (each test exercises a small, distinct piece of code); E2E failures almost always do (each test touches the same browser, the same network, the same database).
Wrapping up
Two takeaways are worth carrying forward. First, the count of failed tests is not the count of bugs. If you treat them as the same number you will burn hours debugging cascades. Second, the work of triage is mostly grouping, not investigating. Cluster first, investigate one representative per cluster, fix the cause once.
The tooling for the grouping step exists. Gaffer does it automatically across every uploaded test run, the MCP server exposes it to AI agents that read CI output, and the failure clustering solution page walks through the algorithm. Whatever tool you use, the underlying habit is the same: when the wall of red shows up Monday morning, count distinct error patterns before you count failures.