The flow format
A flow is one markdown file in .reflow/flows/. Steps and assertions are
plain language — no selectors, no test framework syntax. Reflow compiles them
to Playwright at run time and caches the compilation, so execution is fast
when nothing changed and adaptive when something did.
Anatomy
Section titled “Anatomy”---slug: checkout-happy-pathname: Checkout — happy pathurl: /checkoutdevices: [desktop, iphone-15, ipad]budgets: minutes: 5 tokens: 50000provider_key: anthropic-prod # optional overridetags: [smoke, billing]---
A signed-in user buys a single item.
## Steps1. Add the first product on the page to the cart2. Check out with the standard test card3. The confirmation page shows an order number and the correct total
## Always true- the cart badge shows the running item count- prices never render as NaN or $0.00- no error banners appear at any pointThree parts:
- Prose intent (anything before
## Steps) — what this flow is for, in a sentence or two. The agent uses it to judge whether the end state is right. ## Steps— ordered, cucumber-esque natural language. Each line is one user action or one observable outcome.## Always true— assertions about important state, checked throughout the run. Not selectors: state that matters, in your words.
How execution works
Section titled “How execution works”- Compile. The agent translates each step into RFL — the Reflow Language, a deterministic step language embedded in fenced blocks under your steps and committed with the flow. RFL maps 1:1 onto Playwright, so from there execution is a pure function.
- Run fast while reality matches. RFL executes like any scripted test: real browser, no model in the loop, seconds per step. A step line that doesn’t parse as RFL still works — it’s evaluated as free text by the agent (slower; the agent proposes RFL for next time).
- Adapt when reality changed. If an RFL anchor no longer matches the page — the button became a menu item, the form gained a step — the agent re-reads your plain-language step, regenerates the RFL against the live page, and carries on. The update arrives in your PR as a small diff.
- Judge state, not pixels alone.
Always trueassertions and the final intent check are semantic: the model verifies them against the live page and screenshots.
Because the file is natural language, a redesign doesn’t rot it. What changes on heal is rarely the flow — usually it’s the expectations, and those come back to your PR as a plain-language diff:
## Always true- the cart badge shows the running item count- the page shows no banners above the product grid- a single promo banner may appear above the product gridOutcome semantics
Section titled “Outcome semantics”- Pass — steps completed, expectations hold, end state satisfies intent.
- Visual differences detected — expectations hold but the page diverges from the visual baseline. Not a failure: the check shows a comparison and proposes a new baseline (and any expectation edits) for one-click approval.
- Fail — an expectation about important state cannot be met: the checkout 500s, the confirmation never renders, data is wrong. Nothing to heal — that’s a product regression, and the run report says exactly what broke.
Devices
Section titled “Devices”devices: lists the viewports the flow runs on — every entry executes the
same plain-language steps with its own visual baseline and its own evidence
in the PR check. Named profiles cover phones, tablets, and desktop
(iphone-15, pixel-9, ipad, desktop, …); omitting devices runs
desktop only. A step that only applies to one form factor reads naturally:
## Steps1. Open the navigation menu (on phones this is behind the hamburger button)2. Go to "Account settings"The agent resolves form-factor differences from the prose; RFL fallback chains keep both compilations cached.
Paths, not URLs
Section titled “Paths, not URLs”url: is a path. The runner composes it with the target-url supplied at
run time, so the same flow runs against localhost:3000, a PR preview, or
staging without edits.
Budgets
Section titled “Budgets”budgets.minutes caps wall-clock; budgets.tokens caps model usage per run.
A fully cached green run typically costs one model call (the final intent
check). Recompilation and semantic assertions spend more, bounded by the
budget.