2022-04-04

Fixing flaky end-to-end tests with Playwright and Reflow

Five strategies for de-flaking browser tests — stability events, intelligent waiting, selector design, wait-until checkpoints, and zero-dependency data seeding — applicable whether or not you use reflow.

End-to-end testing exercises an application’s workflow from start to finish, the way a real user would. It is the highest-fidelity automated check a product team has — and the hardest to keep healthy:

Tests generally assume the system starts in a consistent state, which means seeding or wiping data around every run.
Application changes break test sequences that were true when they were written.
Even when nothing changes, some tests fail anyway. These are flaky.

Building reflow has meant accumulating a toolkit for fixing flaky tests and healing sequences when the application changes. Reflow records browser flows and replays them with self-repair, built on Playwright — so every technique below applies to a plain Playwright suite too.

Why this matters

Flaky tests cost time.

Every flake triggers an investigation: true failure or false positive? Multiplied across a team, this cost dominates the value the test provides.
A flaked test can block downstream jobs until someone re-runs or fixes it.
A failed run can leave the system in a non-deterministic state that takes manual effort to clean up.

Flaky tests cost morale. A test exists because someone cared enough to automate away manual effort. When that effort returns as recurring flake triage, the team knows it is stuck maintaining the flake in perpetuity.

Flaky tests kill QA programs. Flaky tests aren’t trusted; untrusted tests get deprecated and deleted; deleted tests take coverage with them; and a less-trusted codebase slows everyone down. Call it the QA death cycle.

Strategy 1: generic pre-action stability

The most common flake we see is interacting with the page too quickly. Elements render before they can be safely interacted with, so checking existence is not enough.

Playwright exposes generic stability events through waitForLoadState:

await page.waitForLoadState('domcontentloaded', { timeout: 15000 });
await page.waitForLoadState('load', { timeout: 30000 });
await page.waitForLoadState('networkidle', { timeout: 5000 });

domcontentloaded — the initial document is loaded and parsed. For SPAs, stylesheets, images, and most JavaScript will not have run yet; usually too early.
load — all markup, stylesheets, scripts, and static assets are loaded. Still too early for SPAs that fetch data after first render.
networkidle — no network connections for at least 500ms. Useful for data-fetching SPAs, though it can fire too early or too late.

Reflow adds a fourth event: screenshotstable — the page has stopped changing visually and looks like it did in the most recent successful run. Most applications either show a loading animation or re-render continuously while loading, so “the page looks settled and familiar” is a stronger signal than any network event. Each run stores a screenshot of the page before every action; the next run compares against it for the same device, browser, and operating system. Timeouts are tuned automatically from how the page behaved in the recording and the last successful run, so a changed application fails fast instead of hanging.

Strategy 2: intelligent waiting

If an action expects to be on a given page, wait for that navigation explicitly (page.waitForNavigation) — multiple load events can fire during one navigation sequence, so a load-state wait alone is not always enough.

If an action targets a specific element, wait on element-level conditions:

attached to the DOM
visible
stable — not animating, or animation completed
able to receive events
enabled, for clickable elements
editable, for text-entry elements

Playwright applies these automatically per interaction type — its actionability checks.

Reflow extends element stability with the same visual baseline it uses for pages: wait until the element looks as it did in the last successful run, with the timeout derived from how long the element historically took to settle. If the button took 7 seconds to turn green via a class change last time, the replay waits at least that long before giving up.

Strategy 3: pick good selectors

When an application changes, the locators identifying elements change with it. The strongest defense is a deliberate, stable attribute:

<button data-test-id={`test-actions-${testId}`} />

await page.click(`[data-test-id="test-actions-${testId}"]`);

Where adding test attributes is undesirable, prefer selectors that encode meaning rather than structure:

Selector	Why it tends to be stable
`placeholder="..."`	Placeholders are often unique to the element
`[aria-label="..."]`	Assistive-technology label; changes only when meaning changes
`img[alt="..."]`	Alternate text changes only when the image’s meaning changes
`role="..."`	Semantic role for assistive technologies
`input[type="..."]`	Input types are stable in short forms
`nodeName`	If a node type appears once (`a`, `input`, `button`), it’s enough
`#id`	Unique ids added for scripting tend to persist

Reflow collects candidate selectors automatically at recording time and scores them by uniqueness and type. It also:

Combines parent and child selectors to remove ambiguity — [data-test-id="foo"] >> [data-test-id="bar"].
Ranks every viable selector set at replay time and picks the page element closest to the one used in the last successful run.
Falls back to comparing partial-match candidates against a screenshot of the previous element — a visual selector — to heal the locator when CSS alone cannot find or disambiguate it.

Strategy 4: “wait until” checkpoints

Sometimes the right move is an application-specific assertion that the system has reached a known state. If a page element represents a calculation, wait for the calculated value:

await page.waitForSelector('[aria-label="calculation"] >> text=29.76');

A visual variant of this is the strongest de-flaking primitive reflow has. Most of reflow’s own test suite (reflow tests itself) starts by creating a fresh test, navigating to its recording UI, and waiting — up to five minutes — until the page matches a recorded screenshot. That one checkpoint absorbs server cold-starts and DNS propagation without a line of code. When the starting page legitimately changes, the run fails and offers to heal the baseline to the new snapshot.

Strategy 5: zero-dependency data seeding

The “consistent starting state” problem deserves its own test stage: a small initial suite whose only job is resetting application data.

Reflow’s own suite does this in-product — add a user to a team, accept the invite, remove them; the user’s data is now associated with the team they left, and they start empty. For most applications the pragmatic version is an endpoint invoked at the start of the run:

await page.request.post(`${event.variables.url}/reset/scenario/empty?token=${event.variables.secret}`);

Is there a way to eliminate all flaky tests forever?

No.

These strategies drastically reduce the time a team spends on end-to-end maintenance, but the cost never reaches zero while the application is actively developed. The realistic goal is to keep QA effort at the boundary of new feature development, instead of endlessly re-covering existing features on every change.

That is the value proposition reflow is built around: the expensive part of end-to-end testing is not writing tests, it is keeping them true as the product changes — so the tool should carry that burden, using everything it learned about your application when the flow was recorded.