How to Manage Flaky E2E Tests at Scale

A flaky test fails on CI. Someone checks the logs, recognizes the test name, and clicks “Re-run.” It passes. The PR merges. Nobody files a ticket. This happens three more times that week, across different PRs, with different people clicking the same button.

This is the re-run tax: the default response to flaky tests in most organizations. It’s not malicious or lazy. It’s rational. Investigating a flaky test takes time you don’t have when a feature needs to ship. So you retry, and the cost stays invisible.

We covered the financial side of this in How Much Are Flaky Tests Costing You?. This post is the practical counterpart: a systematic approach to managing flaky tests once you’ve decided to stop paying the tax.

Why “just fix them” doesn’t work

The advice is obvious. The execution isn’t. At any meaningful scale, you’re dealing with dozens to hundreds of flaky tests, and “fix them all” is not a strategy.

Google has published extensively on this. Their data shows that approximately 1.5% of all test runs exhibit flaky behavior, and at Google’s scale, that means millions of flaky test runs per day. They invested in dedicated infrastructure because human judgment alone can’t triage flakiness across thousands of tests.

The core issue is prioritization. Not all flaky tests are equally damaging. A test that flakes once a month on a non-critical path is different from one that flakes daily and blocks the deploy pipeline. Fixing them in the wrong order means you spend effort on low-impact tests while the expensive ones keep draining CI time.

You need a system.

Step 1: Detect — flip rate, not retry count

Most teams discover flaky tests reactively: a test fails, someone retries, it passes, and the test earns a reputation. This approach has two problems. It depends on institutional memory (“oh yeah, that one’s flaky”), and it misses tests that flake infrequently.

The better signal is flip rate: how often a test transitions between pass and fail across consecutive runs.

Flip Rate = (number of status transitions) / (total runs - 1)

A test with results [pass, fail, pass, fail, pass] has 4 transitions over 5 runs — a 100% flip rate. A test with results [pass, pass, pass, fail, fail] has 1 transition — a 25% flip rate. Both have a 40% failure rate, but the first is clearly flaky while the second might be a legitimate regression.

Why flip rate is better than retry-based detection:

  • Retry-based detection only catches tests that fail during a build where someone bothered to retry. If a test flakes but the build had other real failures, the retry won’t isolate it. It’s also biased by team behavior — some developers retry everything, others don’t.
  • Flip rate analysis works on historical data across all runs. It catches tests that flake infrequently and distinguishes true flakiness from tests that legitimately started failing after a code change.

You need at least 5-10 runs to get a reliable flip rate. Below that, you’re working with noise.

Step 2: Prioritize — which flaky tests hurt most

Once you have flip rates, resist the urge to sort by highest flip rate and start at the top. A test with an 80% flip rate that runs in a rarely-triggered integration suite matters less than a test with a 25% flip rate that runs on every PR and takes 3 minutes each time.

Rank flaky tests by cost, not just frequency. The factors that matter:

Flip frequency. How many times per week does this test actually flip? A test with a 50% flip rate that runs twice a week flips once. A test with a 15% flip rate that runs 100 times a week flips 15 times.

CI time per flip. When a flaky test causes a rerun, how much CI time does that rerun consume? A flaky test in a 2-minute unit test suite costs much less than one in a 20-minute E2E suite — because the entire suite reruns, not just the failing test.

Blast radius. Does this test block merges? Is it in the critical path for deploys? A flaky test in an optional nightly suite is annoying. A flaky test that gates your main branch merge queue is an emergency.

A rough prioritization formula:

Weekly cost = flips_per_week × rerun_duration_minutes × (1 + merge_blocking_weight)

Where merge_blocking_weight is 0 for non-blocking suites and something like 2-5 for tests in the merge queue (to account for the developer wait time, not just CI minutes).

The tests at the top of this ranking are where you start. Everything else goes in the backlog.

Step 3: Act — fix, quarantine, or delete

For each high-priority flaky test, you have three options. The right choice depends on the test.

Fix

Fix the test if:

  • The root cause is identifiable (timing issue, shared state, missing wait condition)
  • The test covers important behavior that isn’t tested elsewhere
  • The fix is bounded — you can estimate the effort

Common E2E flakiness patterns and their fixes:

// Problem: Race condition — element exists but isn't interactive yet
await page.click('#submit');
// Fix: Wait for the element to be actionable
await page.getByRole('button', { name: 'Submit' }).click();
// Playwright's actionability checks handle this, but only
// if you're using the right locator methods
// Problem: Test depends on animation/transition timing
await page.waitForTimeout(2000);
await expect(page.locator('.modal')).toBeVisible();
// Fix: Wait for the specific state, not an arbitrary duration
await expect(page.locator('.modal')).toBeVisible({ timeout: 5000 });
// Problem: Shared database state between tests
test('creates a user', async () => {
await api.post('/users', { email: '[email protected]' });
// ...
});
test('lists users', async () => {
// Fails if 'creates a user' didn't run first,
// or if another test deleted the user
const users = await api.get('/users');
expect(users).toHaveLength(1);
});
// Fix: Each test manages its own state
test('lists users', async () => {
await api.post('/users', { email: '[email protected]' });
const users = await api.get('/users');
expect(users.length).toBeGreaterThanOrEqual(1);
});

Quarantine

Quarantine the test if:

  • The root cause is unclear and investigation would take significant time
  • The test is blocking merges but fixing it isn’t a priority this sprint
  • You need to stabilize the pipeline immediately

Quarantining means the test still runs but doesn’t fail the build. The mechanics depend on your framework:

playwright.config.ts
// Playwright: move to a separate project that doesn't block CI
export default defineConfig({
projects: [
{
name: 'stable',
testDir: './tests',
testIgnore: /.*\.quarantine\.spec\.ts/,
},
{
name: 'quarantine',
testDir: './tests',
testMatch: /.*\.quarantine\.spec\.ts/,
},
],
});

Then in CI, only gate on the stable project. The quarantine project runs for data collection but doesn’t block the pipeline.

The risk with quarantine is that tests stay quarantined forever. Set a review date. If a quarantined test hasn’t been fixed within 30 days, it’s a candidate for deletion.

Delete

Delete the test if:

  • It tests behavior that’s covered by other, stable tests
  • The feature it tests has changed significantly and the test wasn’t updated
  • The cost of maintaining it exceeds the risk of removing it
  • Nobody on the team can explain what it’s actually verifying

This is the hardest option to accept. Deleting tests feels like reducing coverage. But a flaky test that nobody trusts provides zero actual coverage. It’s a line in your test count, not a safety net.

Before deleting, verify that the behavior is covered elsewhere. Check your coverage reports. If there’s a gap, write a new, simpler test — one designed for stability from the start.

Step 4: Verify — confirming fixes with trend data

You fixed a flaky test. Now you need to confirm the fix holds in CI, not just locally.

“I ran it 5 times locally and it passed” isn’t verification. Flaky tests are often environment-dependent — they flake on CI runners with different CPU, memory, or network characteristics. A test that’s stable on your machine can still flake in CI.

Verification requires watching the test’s flip rate over time in the environment where it runs. After applying a fix:

  1. Merge the fix and monitor. Watch the test’s pass/fail pattern over the next 10-20 CI runs.
  2. Compare flip rates. If the flip rate was 40% before the fix and it’s 0% after 20 runs, the fix worked.
  3. Set a threshold for “fixed.” A test with a 0% flip rate over its last 20 runs can be considered stable. Anything above your configured flaky threshold (e.g., 10%) means the fix didn’t fully resolve it.

This is where historical trend data matters. Without it, you’re guessing. With it, you have a clear before-and-after comparison.

Building the workflow with Gaffer

The four steps above require data that most CI systems don’t retain. Build logs expire. Test results from last month are gone. You can’t calculate flip rates without historical pass/fail data across runs.

Gaffer stores your test results and computes this automatically. Here’s what the data flow looks like:

  1. CI uploads test results after each run using the GitHub Action:
- name: Upload to Gaffer
if: always()
uses: gaffer-sh/gaffer-uploader@v2
with:
gaffer_upload_token: ${{ secrets.GAFFER_UPLOAD_TOKEN }}
  1. Flaky test detection runs automatically. Gaffer analyzes flip rates across runs and flags tests that exceed a configurable threshold (default: 10% flip rate). If you’re in triage mode with a noisy suite, raise it to 20-30% to focus on the worst offenders first, then lower it as your suite stabilizes.

  2. The analytics dashboard surfaces your flakiest tests ranked by a composite score that factors in failure rate, flip rate, and duration variability — the data you need for the prioritization step.

  3. After a fix, you can watch the test’s flip rate trend downward across subsequent runs, confirming the fix worked in CI — not just locally.

Prevention: catching flakiness before merge

Managing existing flaky tests is necessary. Preventing new ones from entering your suite is better.

Run new tests multiple times in PR. Before merging a new E2E test, run it 5-10 times in CI. If it passes consistently, it’s more likely stable. Some teams do this with a dedicated CI job that uses the framework’s repeat option:

Terminal window
# Playwright: repeat tests to check for flakiness
npx playwright test --repeat-each=5 tests/new-feature.spec.ts

Monitor flip rates on feature branches. If a test starts flaking on a branch before it merges to main, you know which change introduced the instability.

Set alerts for new flaky tests. Gaffer’s health score alerts notify your team when new tests cross the flaky threshold, so you can catch regressions in stability before they become normalized.

Write stable tests from the start. Most E2E flakiness comes from a few recurring patterns: hardcoded waits, shared state, missing retry logic for network calls, and overly specific assertions. A short checklist during code review catches most of these.

The difference between tribal knowledge — “that test is flaky, just rerun it” — and a managed process is historical data: flip rates per test, cost rankings, and trend lines that confirm whether a fix held across 20 CI runs.

Start Free
Start Free