Use case

Flaky Test Detection: Find Which Tests Are Unreliable

Stop guessing which tests are flaky. Gaffer tracks pass/fail history, calculates flip rates, and surfaces the unreliable tests wasting your team's time.

On this page

A test fails on main, you re-run the pipeline, and it passes. You know flaky tests are eating your team’s time, but you can’t name the specific tests doing the damage. CI artifacts expire before patterns emerge, retries hide which tests are actually unstable, and nobody has time to keep a spreadsheet. The missing piece is persistent pass/fail history per test.

The Flaky Test Detection Problem

Most teams experience flakiness without being able to fix it because the data needed to identify the specific bad tests is gone by the time anyone goes looking.

  • Re-runs mask the signal. A test that passes on attempt 2 looks identical to a test that passed on attempt 1.
  • CI artifacts expire. GitHub Actions defaults to 90 days, GitLab to 30. By the time a pattern is obvious, the underlying runs are gone.
  • No history per test. Build pages show “this run.” Nobody shows “this test, last 50 runs.”
  • Retry configs lie. retries: 2 keeps the build green and tells you nothing about which tests needed the retry.
  • Quarantine has no exit criteria. Tests get .skip’d, then forgotten. Coverage rots.

Common (Bad) Solutions

1. Manual re-run tracking

  • Slack channel for “this failed but passed on rerun”
  • Falls apart at >2 engineers; nobody catches the trends

2. Retry everything

  • retries: 3 in CI keeps builds green
  • Hides root cause, doubles or triples build time, masks regressions

3. Spreadsheets

  • Someone updates a sheet with flaky tests after each red build
  • Dies in week three when that engineer goes on vacation

4. Reading CI artifacts manually

  • actions/upload-artifact keeps reports for 90 days
  • Finding the right artifact across hundreds of runs is its own job
Stop guessing which tests are flaky.
Free tier · 500 MB storage · 7-day retention. Detection on by default.
Start free →

Better Solution: Flip-Rate Analysis Across Runs

The metric that actually identifies flaky tests is flip rate: how often a test transitions between pass and fail across consecutive runs. A test that stays passing is stable. A test that flips constantly is the one wasting your time.

Flip Rate = (number of pass/fail transitions) / (total runs - 1)

A test with results [pass, fail, pass, fail, pass] has 4 flips across 5 runs, or 80%.

Flip RateInterpretation
0–10%Stable test
10–30%Moderately flaky, worth investigating
30–50%Severely flaky, fix immediately or quarantine
50%+Random; remove from the build until rewritten

Flip rate only works with persistent history. A single run cannot tell you anything. A handful of runs cannot either: Gaffer requires at least 5 runs before evaluating a test, to keep a single bad run from poisoning the score.

Detecting Flaky Tests with Gaffer

Gaffer stores every test result and computes flip rate per test, per branch, automatically.

  • Flip rate: transitions per run, last 30 days
  • Flip count: total pass/fail transitions observed
  • Total runs: sample size, so you know whether to trust the score
  • Last seen flaky: when the most recent flip happened
  • Branch filter: separate main flakiness from feature-branch noise

No new test framework to adopt, no annotation to add. If your CI already produces JUnit XML, CTRF, Playwright HTML, Jest JSON, or pytest output, Gaffer ingests it and starts the count.

Threshold defaults to 10% flip rate before a test is flagged. Tune it higher if your suite is noisy and you want to focus on the worst offenders first.

Setting Up Flaky Detection in CI

One step in your CI pipeline. Test results upload, Gaffer parses them, flaky detection happens server-side.

.github/workflows/test.yml
- name: Run tests
run: npx playwright test --reporter=junit
- name: Upload to Gaffer
if: always()
uses: gaffer-sh/gaffer-uploader@v2
with:
api-key: ${{ secrets.GAFFER_UPLOAD_TOKEN }}
report-path: ./test-results/junit.xml

Jest, Vitest, pytest, and Mocha all work the same way: emit a JUnit XML or CTRF JSON, point the uploader at it, done. Full setup per CI provider lives in the CI guides.

Fixing Flaky Tests Once You’ve Found Them

Detection is the hard part. Once you have a list of specific tests with high flip rates, the fixes are mostly mechanical.

Quarantine first, fix second

Don’t let a flaky test block the build while you investigate. Move it to a separate suite that runs but doesn’t gate merges. Set a calendar reminder; quarantine without an exit date is just deletion.

// Jest example: skip with a tracking issue
test.skip('checkout flow: flaky, see #4127', () => {
// ...
});

Replace arbitrary waits with explicit conditions

// Bad: fixed timeout, hopes the page is ready
await page.waitForTimeout(2000);
// Good: wait for the specific thing you're testing
await page.waitForSelector('[data-testid="loaded"]');

Mock external dependencies

Network calls are the single largest source of flakiness in E2E suites. Mock the API at the test boundary; let an integration test exercise the real call.

await page.route('**/api/data', route => {
route.fulfill({ json: mockData });
});

Reset shared state per test

beforeEach(() => {
jest.clearAllMocks();
db.reset();
});

Run in random order to surface order dependence

Terminal window
jest --randomize

Comparing Flaky Test Detection Approaches

MethodHistory per testIdentifies which testsBranch-awareSetup
Re-run, hope, repeatNoneNone
Manual spreadsheetManual~ (whoever’s tracking)ManualHours/week
CI artifact archaeology30–90 days~ (if you click through)None
Retry configsNoneOne line
GafferUp to 90 days✓ flip rate per testOne CI step

What Comes With Detection

Once test history is hosted, the analytics that depend on it come along: pass-rate trends, duration regressions per test, failure clustering, and a health score that combines them. Gaffer can also post a Slack message the moment a new flaky test crosses your threshold.

Who Uses Flaky Test Detection

Small teams and OSS projects who can’t justify dedicated test-infrastructure engineers but still want to know which 10 tests in a 2,000-test suite are responsible for half the red builds.

Free tier covers 500 MB storage with 7-day retention. Paid plans extend to 90 days, which is usually long enough for the worst flakiness patterns to surface.