Playwright Test Agents: Planner, Generator & Healer

Playwright v1.56 (October 2025) shipped three built-in AI agents: a planner that reads your app and writes a test plan, a generator that turns that plan into runnable specs, and a healer that repairs tests after a selector or flow changes. The setup is a few commands. The part every other guide skips is what happens after the healer says “fixed”: a test that passes once in the agent’s session can still flake on the next fifty builds, and you only find that out if you keep the results.

This guide walks all three agents end to end, runs them in CI, then covers the question the official docs stop short of: are the tests an agent wrote or healed actually stable over time, or did they just pass today?

What are Playwright Test Agents?

Playwright Test Agents are three AI-driven workflows built into Playwright 1.56 and later. They don’t run your suite; they help you author and maintain it. Each one owns a single job:

Planner explores the running app and produces a Markdown test plan: a list of scenarios in plain language, grouped by feature.
Generator takes that plan and emits actual *.spec.ts files, driving a real browser to verify each step works before it writes the assertion.
Healer runs a failing test, watches where it breaks, and proposes a patch (usually an updated selector or a corrected wait) so a test that broke from a UI change starts passing again.

The agents are exposed as Model Context Protocol definitions that you load into an AI coding tool (Claude Code, VS Code’s agent mode, OpenCode). The tool drives the agent; Playwright provides the browser, the plan format, and the heal loop. If you’ve already wired up Playwright MCP in Claude Code, this is the same transport carrying a different set of definitions.

What are Playwright agents?

They’re AI assistants for writing and fixing Playwright tests, not a runtime your tests depend on. A useful mental model: the planner is a QA analyst sketching test cases, the generator is the engineer who types them up, and the healer is the person who fixes the one spec that broke after a redesign. You still run the resulting suite with plain npx playwright test.

Is Playwright agent free?

The agents themselves ship with Playwright, which is open source and free. There’s no Playwright-side license fee. What costs money is the AI model behind whichever coding tool you point at them. The planner and generator both make a lot of tool calls (navigate, snapshot, click, assert), so a full generation pass against a non-trivial app consumes real tokens. Budget for model usage, not for Playwright.

Does a Playwright test need an agent?

No. This is the most common point of confusion, because “agent” means two different things. Playwright has always run perfectly well without AI: you write specs by hand, run npx playwright test, and read the report. The Test Agents are an optional authoring aid introduced in 1.56. Your existing suite doesn’t need them, and the agents produce ordinary spec files that run the same way handwritten ones do.

Getting started: setup and seed tests

Installing Playwright v1.56+

The agents require Playwright 1.56 or newer. Upgrade and pull the browser binaries:

npm install -D @playwright/test@latest
npx playwright install

Confirm the version, since anything below 1.56 won’t expose the agent definitions:

npx playwright --version
# Version 1.56.0 (or higher)

You also need an AI coding tool that speaks MCP. Playwright ships agent definitions for Claude Code, VS Code, and OpenCode; the rest of this guide uses Claude Code’s syntax, but the prompts translate directly.

Seeding your initial test files

The generator works far better when it has an example to imitate. Before running any agent, write one small “seed” spec by hand that shows the conventions you want: your fixtures, your base URL, your auth helper, your assertion style.

import { test, expect } from "@playwright/test";

test("home page loads and shows the sign-in link", async ({ page }) => {
  await page.goto("/");
  await expect(page.getByRole("link", { name: "Sign in" })).toBeVisible();
});

This file is the generator’s style guide. Without it, the agent invents its own structure and you spend the review pass undoing decisions you didn’t ask for. With it, generated specs match your existing ones.

The Planner Agent: creating a test plan

The planner opens your app in a real browser, navigates around, and writes a Markdown plan describing what should be tested. It produces prose, not code. That separation is deliberate: a plan is fast to read and cheap to correct before any specs exist.

Example prompt for the planner

Point your coding tool at the Playwright planner agent and describe the surface you want covered:

Use the Playwright planner agent. The app is running at http://localhost:5173.
Explore the authentication flow (sign up, sign in, sign out, password reset)
and write a test plan covering the happy paths and the obvious failure cases.

The planner navigates the flows, then writes something like specs/auth.md:

# Authentication test plan

## Sign in
- Valid credentials land on the dashboard
- Invalid password shows an inline error, stays on /login
- Empty email disables the submit button

## Password reset
- Requesting a reset shows a confirmation message
- Reset link with an expired token shows "link expired"

How the planner structures specs vs. tests

The planner writes .md plan files, not .spec.ts test files. Keep the two straight: the plan is the human-reviewable artifact, the generated tests are the machine-runnable artifact. Review and edit the Markdown first. A wrong scenario costs one line to fix in the plan and a full regeneration to fix in code. Treat the plan as the cheap place to be wrong.

The Generator Agent: writing the tests

The generator reads a plan file and turns each scenario into a real test. The thing that makes it more reliable than asking a raw LLM to “write Playwright tests” is that it drives the browser as it goes: it navigates to the page, snapshots the accessibility tree, finds the actual role and label of the element, and writes a selector it has confirmed resolves. It isn’t guessing selectors from memory.

Example prompt for the generator

Use the Playwright generator agent. Read specs/auth.md and generate tests
into tests/auth.spec.ts. Match the conventions in tests/seed.spec.ts.

The generator works scenario by scenario, verifying each step against the live app before committing the assertion. The output is a normal spec file:

// tests/auth.spec.ts (generated)
import { test, expect } from "@playwright/test";

test("invalid password shows an inline error", async ({ page }) => {
  await page.goto("/login");
  await page.getByLabel("Email").fill("[email protected]");
  await page.getByLabel("Password").fill("wrong-password");
  await page.getByRole("button", { name: "Sign in" }).click();
  await expect(page.getByText("Incorrect email or password")).toBeVisible();
  await expect(page).toHaveURL(/\/login/);
});

Reviewing and editing generated tests

Read every generated test before committing it. The generator confirms selectors resolve, but it can’t tell whether the assertion captures the behavior you actually care about. Two failure modes to look for:

Over-fitted assertions. The agent may assert on the exact error string. If your copy changes, the test breaks for the wrong reason. Loosen it to a stable substring or a role.
Missing negative cases. Generators are good at happy paths and weaker at the “this should be blocked” cases. The plan-review step is where you add those back.

Generated tests are a draft you edit, not output you trust blind. The verification step buys you correct selectors; it doesn’t buy you good test design.

The Healer Agent: fixing broken tests

The healer is the agent most teams reach for first, because broken-selector maintenance is the tax on every E2E suite. When a test fails, the healer re-runs it, observes where the failure happens, inspects the live page at that point, and proposes a patch.

How to use the healer agent

Run it against a failing test:

Use the Playwright healer agent on tests/auth.spec.ts. The "sign in" test
is failing after yesterday's login redesign. Diagnose and fix it.

The healer reproduces the failure, snapshots the page at the break point, and figures out that (for example) the “Sign in” button is now labeled “Log in” and the email field moved behind a “Continue with email” step. It rewrites the selectors and the navigation, re-runs, and hands you a green test plus a diff.

Strategies for getting reliable heals

Heal one test at a time. Batch heals make it hard to see whether each patch is sound. One test, one diff, one review.
Give it the “why.” Telling the healer “the login page was redesigned” narrows its search. Without that, it can spend its budget exploring unrelated theories.
Read the diff like a code review. A heal that swaps a precise getByRole for a brittle nth(3) or a hard waitForTimeout is technically green and a future flake. Reject those.

What the Playwright healer agent cannot fix

The healer fixes tests, not bugs. If the test fails because the application is genuinely broken, the “correct” heal is no heal: the test is doing its job. A healer that patches around a real regression has made your suite lie. It also can’t fix a test whose intent it can’t infer (a missing assertion, an unclear scenario name) and it can’t repair flakiness rooted in timing or test-data races, because those don’t reproduce reliably enough for it to observe the break. The healer is strongest on structural drift (renamed labels, moved fields, changed routes) and weakest on anything nondeterministic.

That last gap is the whole reason the next two sections exist. A healer can make a flaky test pass once. Whether it stays passing is a question the heal session can’t answer.

Running Playwright agents in CI (GitHub Actions example)

The planner and generator are authoring tools you run locally while writing tests. The healer is the one worth wiring into CI, because the failures you most want patched are the ones CI surfaces. Either way, the suite the agents produce runs in CI as ordinary Playwright tests, and the results need to go somewhere durable so you can answer the stability question later.

A workflow that runs the suite, then uploads the results to Gaffer regardless of pass or fail:

name: Playwright

on:
  push:
    branches: [main]
  pull_request:

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 20
          cache: npm

      - name: Install dependencies
        run: npm ci

      - name: Install Playwright browsers
        run: npx playwright install --with-deps

      - name: Run Playwright tests
        run: npx playwright test --reporter=junit,html
        env:
          PLAYWRIGHT_JUNIT_OUTPUT_NAME: results/junit.xml

      - name: Upload results to Gaffer
        if: always()
        uses: gaffer-sh/gaffer-uploader@v2
        with:
          gaffer_upload_token: ${{ secrets.GAFFER_UPLOAD_TOKEN }}
          report_path: ./results
          commit_sha: ${{ github.sha }}
          branch: ${{ github.ref_name }}
          test_framework: playwright

The if: always() matters. The runs you most want to keep are the failing ones, so gating the upload on success throws away exactly the data you need to judge whether a healed test holds. The same upload works from the CLI if you’d rather not use the Action:

gaffer upload ./results \
  --token $GAFFER_UPLOAD_TOKEN \
  --commit-sha $GITHUB_SHA \
  --branch $GITHUB_REF_NAME \
  --test-framework playwright

What to do with agent-generated test results

Every guide to these agents ends at the green checkmark. That’s the wrong place to stop. A test that an AI generated or healed is a test whose stability you have less reason to trust, not more, until it has survived a few dozen real runs. The agent optimized for “make this pass now.” It did not optimize for “make this pass every Tuesday for the next three months.” Those are different goals, and the gap between them is exactly where flaky tests live.

So keep the results. One Playwright run tells you the suite passed today. Fifty runs tell you whether the test the healer touched last week has been quietly flaking ever since.

Are agent-written tests actually stable?

You can’t answer this from a single run, and you can’t answer it from CI logs that expire. You answer it by treating each run as a row in a time series and asking how a specific test has behaved across all of them. The signal you’re looking for is the flip rate: how often a test switches between pass and fail without the code under test changing.

This is where an AI coding agent stops driving the browser and starts reading history. Gaffer exposes its analytics through an MCP server, so the same agent that healed the test can ask whether the heal held. The tools that answer the stability question directly:

get_flaky_tests returns every test above a flip-rate threshold, each with its flipRate, flipCount, totalRuns, and a composite flakinessScore, over a window you choose (default 30 days). Run it after a healing session and a test the healer “fixed” that shows up here with a high flip count didn’t get fixed, it got patched into passing once.
get_test_history returns the pass/fail record for one named test across recent runs, with branch, commit, and duration per entry, plus a summary pass rate. The single most useful question to ask before trusting a healed test: did it actually stay green on main after the heal landed, or did it flip back two builds later?
get_project_health gives the top-level pulse: a health score, pass rate, run count, flaky-test count, and trend for the window. It’s the “is the suite on fire?” check an agent runs before it starts work.

The MCP reference, including every tool’s input and output schema, is at /docs/mcp/. For the classification logic behind “is this test flaky or actually broken,” the flaky test detection page walks through how the flip rate and flakiness score are computed.

Tracking healed-test quality over time with Gaffer

Here’s the loop that closes the gap the official docs leave open. The agent heals a failing test and pushes. CI runs the suite and uploads the result. On a later session, before declaring the heal a success, the agent calls get_test_history on that exact test and looks at the record since the heal landed. A heal that holds shows an unbroken run of passes. A heal that didn’t hold shows a sawtooth: pass, pass, fail, pass, fail.

To make this concrete, here’s the shape of what get_project_health returns against a healthy suite (these figures illustrate the response format, not a fixed promise):

{
  "projectName": "dashboard",
  "healthScore": 95,
  "passRate": 99.9,
  "testRunCount": 170,
  "flakyTestCount": 4,
  "trend": "stable"
}

A health score in the 90s with a low single-digit flaky count is the state you want a healed test to leave you in. If healing a test pushes flakyTestCount up or drags passRate down over the following weeks, the heal traded a visible failure for an invisible flake. That trade is worse than the original break, because a hard failure stops the line and a flake just erodes trust in the suite until people start re-running jobs on red and ignoring it.

That over-time view is the part you can’t get from the heal session or the Playwright HTML report. Both show you one moment. Stability is a property of many moments, and you only see it if you keep the results and ask the right question of them later.

Artifacts and file conventions

A few conventions keep an agent-driven Playwright setup legible:

Plans live in specs/*.md. The planner writes Markdown plans here. They’re human-reviewable and belong in version control so the next planner run can diff against them.
Tests live in tests/*.spec.ts. The generator writes runnable specs here, matching your seed file’s conventions. These are the files CI runs.
Traces and reports are CI artifacts. Playwright’s HTML report and trace zips capture a single run in detail. Upload them alongside the JUnit XML so a failing run links to the exact trace, while the cross-run history lives in your analytics layer.

Keep the moment and the trend in separate places. The Playwright report and trace are the best tool for debugging one run. Knowing whether a healed test stayed healed across the next fifty is a different question, and it’s the one that tells you whether the agents are actually saving you maintenance or just deferring it.

Gaffer