Flaky Test Detection: How to Find and Fix Unreliable Tests

A test passes. You push to main. The same test fails. You re-run the pipeline - it passes again. Sound familiar? Flaky tests are one of the most frustrating problems in software development, and they’re more damaging than most teams realize.

What Makes a Test Flaky?

A flaky test is one that produces different results (pass/fail) without any code changes. The test itself is non-deterministic - it depends on factors outside the code being tested.

Common causes include:

  • Race conditions - Tests that depend on timing or async operations
  • Shared state - Tests that pollute or depend on global state
  • External dependencies - Network calls, databases, or third-party APIs
  • Order dependence - Tests that only pass when run in a specific sequence
  • Resource constraints - CI runners with different CPU/memory than local machines
  • Time-based logic - Tests that depend on current date/time

The Real Cost of Flaky Tests

Flaky tests seem like minor annoyances, but they compound into serious problems:

1. Wasted Developer Time

Every flaky failure triggers investigation. Developers check the diff, read the error, realize it’s “that test again,” and re-run the pipeline. This cycle repeats multiple times per day across the team.

A study by Google found that flaky tests cost them 2% of all engineering time - and that’s with dedicated infrastructure to manage them.

2. Eroded Trust in CI

When developers expect failures to be flaky, they stop paying attention. Real failures get dismissed as “probably flaky.” Bugs slip through. The test suite loses its value as a safety net.

3. Slower Releases

Teams add manual verification steps because they don’t trust automated tests. Release cycles slow down. The promise of CI/CD - fast, confident deployments - breaks down.

4. Hidden in Plain Sight

The worst part? Most teams don’t know which tests are flaky. They experience the symptoms (random failures, re-runs) but can’t identify the specific culprits without historical data.

Identifying Flaky Tests with Flip Rate Analysis

The key metric for flaky test detection is flip rate: how often a test transitions between pass and fail states. A test that constantly flips (pass → fail → pass → fail) is flaky, while a test that stays in one state is stable.

Flip Rate = (number of status transitions) / (total runs - 1)

For example, a test with results [pass, fail, pass, fail, pass] has 4 flips across 5 runs = 80% flip rate. That’s extremely flaky.

Flip RateInterpretation
0-10%Stable test (consistent results)
10-30%Moderately flaky, worth investigating
30-50%Severely flaky, fix immediately or quarantine
50%+Essentially random - might as well flip a coin

Gaffer requires a minimum of 5 test runs before flagging a test as flaky to avoid false positives from small sample sizes.

Why You Need Historical Data

You can’t calculate failure rate from a single test run. You need data across many runs to identify patterns:

  • Which tests have inconsistent pass/fail ratios?
  • Is flakiness consistent or does it come and go?
  • Did a recent change introduce new flakiness?
  • Are certain tests only flaky on specific branches?

This is where most teams struggle. CI artifacts expire, logs get deleted, and there’s no persistent history to analyze.

How Gaffer Detects Flaky Tests

Gaffer stores your test results and automatically tracks flip rates across runs.

Automatic Flip Rate Tracking

Every test run uploads to Gaffer, building a chronological history of pass/fail states per test. The analytics engine calculates flip rates by counting how many times each test transitions between states. Tests need at least 5 runs before they’re evaluated to avoid false positives.

Flaky Test Dashboard

The analytics page surfaces your most problematic tests:

  • Flip rate - How often the test switches between pass and fail
  • Flip count - Total number of status transitions observed
  • Total runs - How many times the test has been executed
  • Last seen - When the flaky behavior last occurred

Configurable Thresholds

Different teams have different tolerances. Gaffer lets you configure the flip rate threshold that triggers a test being flagged as flaky (default: 10%). If your test suite is particularly noisy, you might raise this to 20% to focus on the worst offenders first.

Strategies for Fixing Flaky Tests

Once you’ve identified flaky tests, here’s how to address them:

1. Quarantine Immediately

Don’t let flaky tests block your pipeline. Move them to a separate test suite that runs but doesn’t fail the build. This preserves signal while you investigate.

// Jest example: skip flaky test temporarily
test.skip('flaky test - investigating', () => {
  // ...
});

2. Add Retry Logic (Carefully)

Retries can mask flakiness, but they’re sometimes necessary for external dependencies:

// Playwright example: retry flaky tests
export default defineConfig({
  retries: process.env.CI ? 2 : 0,
});

Use retries sparingly - they increase build time and hide root causes.

3. Isolate Shared State

Ensure each test starts with a clean state:

beforeEach(() => {
  // Reset database, clear caches, etc.
  jest.clearAllMocks();
});

4. Make Async Tests Deterministic

Replace arbitrary waits with explicit conditions:

// Bad: arbitrary timeout
await page.waitForTimeout(2000);

// Good: wait for specific condition
await page.waitForSelector('[data-testid="loaded"]');

5. Mock External Dependencies

Network calls are a major source of flakiness. Mock them:

// Playwright example: mock API responses
await page.route('**/api/data', route => {
  route.fulfill({ json: mockData });
});

6. Run Tests in Isolation

If order-dependence is suspected, run tests in random order to surface the issue:

# Jest: randomize test order
jest --randomize

Prevention: Catching Flakiness Early

The best flaky test is one you never merge:

  • Run tests multiple times in PR - If it passes 3x in a row, it’s more likely stable
  • Monitor failure rates over time - Catch regressions early
  • Set alerts for test failures - Gaffer can notify your team in Slack when tests fail, including consecutive failure triggers to catch recurring problems

Get Started with Flaky Test Detection

Stop guessing which tests are flaky. Gaffer’s analytics dashboard shows you exactly which tests are unreliable, how often they flip, and whether they’re getting better or worse.

Start Free