Analytics & Metrics

Understanding Gaffer's test analytics and how they help your team.

Gaffer provides analytics to help you understand your test suite's health and identify problems before they slow your team down. Our metrics are designed around one principle: helping developers stay productive.

The Problem with Test Suites

As test suites grow, they often become a source of friction rather than confidence. Tests that randomly pass or fail waste developer time. Slow feedback loops block PRs. And without visibility into trends, problems compound silently until the whole suite becomes a burden.

Gaffer's analytics give you the visibility to catch these issues early and keep your test suite working for you, not against you.

Health Score

The Health Score is a single number (0-100) that summarizes your test suite's overall condition. It's designed to answer the question: "How much can I trust my tests right now?"

The score is calculated from three factors:

  • Pass Rate (40%) - Higher pass rates contribute positively
  • Flaky Test Percentage (40%) - Fewer flaky tests means a higher score
  • Trend Direction (20%) - Improving trends boost the score; declining trends reduce it

We weight flaky tests heavily because they have an outsized impact on developer productivity. A test suite with a 95% pass rate but 20% flaky tests is often more frustrating than one with a 90% pass rate and no flaky tests.

Score Ranges

  • 90-100 - Excellent. Your test suite is reliable and trustworthy.
  • 75-89 - Healthy. Minor issues to address, but tests are useful.
  • 50-74 - Needs Attention. Flaky tests or failures are impacting productivity.
  • 25-49 - At Risk. Significant issues undermining test value.
  • 0-24 - Critical. Tests may be causing more harm than good.

Flaky Test Detection

A test is marked as "flaky" when it flip-flops between passing and failing without any code changes. These tests are particularly harmful because they:

  • Force developers to re-run CI pipelines, wasting time and compute
  • Erode trust in the test suite ("it's probably just flaky, merge anyway")
  • Hide real failures in noise

How We Calculate It

We use a flip rate algorithm. For each test, we track the sequence of pass/fail results across runs. A "flip" occurs when a test changes from pass to fail (or vice versa) between consecutive runs.

The flip rate is calculated as:

flip_rate = number_of_flips / (total_runs - 1)

For example, if a test has results [pass, fail, pass, pass, fail] over 5 runs, that's 3 flips in 4 transitions = 75% flip rate.

By default, tests with a flip rate of 10% or higher are flagged as flaky. You can adjust this threshold in Settings > Analytics based on your team's tolerance.

Minimum Sample Size

To avoid false positives, we require at least 5 runs before flagging a test as flaky. A single failure in 2 runs might be a real bug; a pattern across 5+ runs is a signal.

A Note on "Flaky" vs "Unreliable"

There's ongoing debate in the testing community about what "flaky" really means. Is it a race condition? A test environment issue? A genuine intermittent bug in production code?

Our perspective: it doesn't matter for developer productivity. Whether a test fails randomly due to timing issues, external dependencies, or cosmic rays, the impact is the same - developers waste time investigating, re-running, and eventually ignoring it.

Gaffer flags these tests so you can decide what to do: fix them, quarantine them, or delete them. The goal is to keep your test suite providing reliable signal, not noise.

Pass Rate

Pass rate is the percentage of tests that passed, calculated as:

pass_rate = passed_tests / (passed_tests + failed_tests)

Skipped tests are excluded from this calculation since they don't represent actual test execution.

We track pass rate over 30 days and show the trend direction (improving, stable, or declining) to help you spot regressions early.

Analytics Settings

You can configure analytics behavior in Settings > Analytics:

  • Flaky Threshold - Adjust the flip rate percentage that triggers flaky detection (default: 10%)
  • Manual Recompute - Trigger an immediate analytics refresh instead of waiting for the next scheduled computation

Data Freshness

Analytics are pre-computed every 4 hours to ensure fast dashboard loading. When you upload new test reports, the data will be reflected in the next computation cycle.

If you need immediate results (e.g., after uploading a batch of historical reports), use the "Compute now" button in Settings > Analytics.

Next Steps

Analytics work best when you have consistent test data flowing in. If you haven't already: