Flaky Test Detection: How to Find and Fix Unreliable Tests

A test passes. You push to main. The same test fails. You re-run the pipeline - it passes again. Sound familiar? Flaky tests are one of the most frustrating problems in software development, and they’re more damaging than most teams realize.

What Makes a Test Flaky?

A flaky test is one that produces different results (pass/fail) without any code changes. The test itself is non-deterministic - it depends on factors outside the code being tested.

Common causes include:

  • Race conditions - Tests that depend on timing or async operations
  • Shared state - Tests that pollute or depend on global state
  • External dependencies - Network calls, databases, or third-party APIs
  • Order dependence - Tests that only pass when run in a specific sequence
  • Resource constraints - CI runners with different CPU/memory than local machines
  • Time-based logic - Tests that depend on current date/time

The Real Cost of Flaky Tests

Flaky tests seem like minor annoyances, but they compound into serious problems:

1. Wasted Developer Time

Every flaky failure triggers investigation. Developers check the diff, read the error, realize it’s “that test again,” and re-run the pipeline. This cycle repeats multiple times per day across the team.

A study by Google found that flaky tests cost them 2% of all engineering time - and that’s with dedicated infrastructure to manage them.

2. Eroded Trust in CI

When developers expect failures to be flaky, they stop paying attention. Real failures get dismissed as “probably flaky.” Bugs slip through. The test suite loses its value as a safety net.

3. Slower Releases

Teams add manual verification steps because they don’t trust automated tests. Release cycles slow down. The promise of CI/CD - fast, confident deployments - breaks down.

4. Hidden in Plain Sight

The worst part? Most teams don’t know which tests are flaky. They experience the symptoms (random failures, re-runs) but can’t identify the specific culprits without historical data.

Identifying Flaky Tests with Flip Rate Analysis

The key metric for flaky test detection is flip rate: how often a test changes state (pass→fail or fail→pass) between consecutive runs on the same branch.

Flip RateInterpretation
0%Stable test (always passes or always fails)
1-5%Minor flakiness, worth monitoring
5-15%Problematic, should be investigated
15%+Severely flaky, fix immediately or quarantine

A test that passes 95% of the time might seem fine, but if it flips state on 20% of runs, it’s causing significant disruption.

Why You Need Historical Data

You can’t calculate flip rate from a single test run. You need data across many runs to identify patterns:

  • Which tests flip most often?
  • Is flakiness consistent or does it come and go?
  • Did a recent change introduce new flakiness?
  • Are certain tests only flaky on specific branches?

This is where most teams struggle. CI artifacts expire, logs get deleted, and there’s no persistent history to analyze.

How Gaffer Detects Flaky Tests

Gaffer stores all your test results permanently and automatically calculates flip rates across runs.

Automatic Flip Rate Tracking

Every test run uploads to Gaffer, building a history of pass/fail states per test. The analytics engine calculates flip rates using a configurable sample window (minimum 5 runs to avoid false positives).

Flaky Test Dashboard

The analytics page surfaces your most problematic tests:

  • Flip rate - Percentage of runs where the test changed state
  • Run count - How many times the test has been observed
  • Last seen - When the flaky behavior last occurred
  • Trend - Is flakiness getting better or worse?

Configurable Thresholds

Different teams have different tolerances. Gaffer lets you set the flip rate threshold that triggers a test being flagged as flaky (default: 10%). Adjust based on your suite’s baseline health.

Strategies for Fixing Flaky Tests

Once you’ve identified flaky tests, here’s how to address them:

1. Quarantine Immediately

Don’t let flaky tests block your pipeline. Move them to a separate test suite that runs but doesn’t fail the build. This preserves signal while you investigate.

// Jest example: skip flaky test temporarily
test.skip('flaky test - investigating', () => {
  // ...
});

2. Add Retry Logic (Carefully)

Retries can mask flakiness, but they’re sometimes necessary for external dependencies:

// Playwright example: retry flaky tests
export default defineConfig({
  retries: process.env.CI ? 2 : 0,
});

Use retries sparingly - they increase build time and hide root causes.

3. Isolate Shared State

Ensure each test starts with a clean state:

beforeEach(() => {
  // Reset database, clear caches, etc.
  jest.clearAllMocks();
});

4. Make Async Tests Deterministic

Replace arbitrary waits with explicit conditions:

// Bad: arbitrary timeout
await page.waitForTimeout(2000);

// Good: wait for specific condition
await page.waitForSelector('[data-testid="loaded"]');

5. Mock External Dependencies

Network calls are a major source of flakiness. Mock them:

// Playwright example: mock API responses
await page.route('**/api/data', route => {
  route.fulfill({ json: mockData });
});

6. Run Tests in Isolation

If order-dependence is suspected, run tests in random order to surface the issue:

# Jest: randomize test order
jest --randomize

Prevention: Catching Flakiness Early

The best flaky test is one you never merge:

  • Run tests multiple times in PR - If it passes 3x in a row, it’s more likely stable
  • Monitor flip rates over time - Catch regressions early
  • Set alerts for new flaky tests - Gaffer can notify you when a previously stable test becomes flaky

Get Started with Flaky Test Detection

Stop guessing which tests are flaky. Gaffer’s analytics dashboard shows you exactly which tests are unreliable, how often they flip, and whether they’re getting better or worse.