Compare & gate CI¶
Two (or more) saved runs in, one comparison out — a table (benchmem compare) or an
interactive view (benchmem plot). The table shows time and peak across every stat by
default (pick metrics with --columns, a stat with --stat); the plot takes one
(--columns). Both group by the dims your tests carry.
A regression to catch¶
A benchmark builds a table of (i, i²) rows. On main, each row is a lightweight tuple:
@pytest.mark.parametrize("n", [10_000, 50_000, 200_000, 500_000])
def test_build_rows(benchmark, n):
benchmark(lambda: [(i, i * i) for i in range(n)])
A branch switches the rows to dicts for readability — a classic memory regression, since a dict is several times heavier than a 2-tuple:
@pytest.mark.parametrize("n", [10_000, 50_000, 200_000, 500_000])
def test_build_rows(benchmark, n):
benchmark(lambda: [{"x": i, "sq": i * i} for i in range(n)])
Run each on its branch and save the --benchmark-json — here, baseline.json (main) and
candidate.json (the branch).
benchmem compare — the comparison table¶
Modelled on pytest-benchmark's own table: one row per (benchmark × run), columns are
metric × stat, and each cell carries a relative (N.NN) multiplier vs the column's best
run (best green, worst red). Rows are grouped into sub-tables by --group-by (default
fullname, so each benchmark's runs sit together and the multiplier reads as the cross-run
ratio). By default it shows time and peak, each across the full stat spread
(min/max/mean/median/stddev) — so no single statistic is privileged:
!benchmem compare {baseline} {candidate}
test_rows.py::test_build_rows[10000]
time (s) time (s) time (s) time (s) time (s) peak (KiB) peak (KiB) peak (KiB) peak (KiB) peak (B) name min max mean median stddev │ min max mean median stddev ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── (baseline) 0.001154 (1.0) 0.001522 (1.0) 0.001205 (1.0) 0.001192 (1.0) 4.308e-05 (1.0) │ 83.12 (1.0) 83.12 (1.0) 83.12 (1.0) 83.12 (1.0) 0 (candidate) 0.0018 (1.56) 0.002482 (1.63) 0.002054 (1.70) 0.002042 (1.71) 6.124e-05 (1.42) │ 2,131.12 (25.64) 2,131.12 (25.64) 2,131.12 (25.64) 2,131.12 (25.64) 0 test_rows.py::test_build_rows[200000] time (s) time (s) time (s) time (s) time (s) peak (MiB) peak (MiB) peak (MiB) peak (MiB) peak (B) name min max mean median stddev │ min max mean median stddev ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── (baseline) 0.03566 (1.0) 0.03712 (1.0) 0.03619 (1.0) 0.03616 (1.0) 0.0003695 (1.0) │ 23.55 (1.0) 23.55 (1.0) 23.55 (1.0) 23.55 (1.0) 0 (candidate) 0.05522 (1.55) 0.05827 (1.57) 0.05634 (1.56) 0.05605 (1.55) 0.0008266 (2.24) │ 46.55 (1.98) 46.55 (1.98) 46.55 (1.98) 46.55 (1.98) 0 test_rows.py::test_build_rows[500000] time (s) time (s) time (s) time (s) time (s) peak (MiB) peak (MiB) peak (MiB) peak (MiB) peak (B) name min max mean median stddev │ min max mean median stddev ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── (baseline) 0.09289 (1.0) 0.09604 (1.0) 0.09431 (1.0) 0.09432 (1.0) 0.001012 (1.08) │ 61.97 (1.0) 61.97 (1.0) 61.97 (1.0) 61.97 (1.0) 0 (candidate) 0.144 (1.55) 0.1469 (1.53) 0.1457 (1.55) 0.1461 (1.55) 0.0009414 (1.0) │ 120.97 (1.95) 120.97 (1.95) 120.97 (1.95) 120.97 (1.95) 0 test_rows.py::test_build_rows[50000] time (s) time (s) time (s) time (s) time (s) peak (MiB) peak (MiB) peak (MiB) peak (MiB) peak (B) name min max mean median stddev │ min max mean median stddev ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── (baseline) 0.007536 (1.0) 0.01271 (1.0) 0.007865 (1.0) 0.007737 (1.0) 0.0005557 (1.97) │ 4.42 (1.0) 4.42 (1.0) 4.42 (1.0) 4.42 (1.0) 0 (candidate) 0.01174 (1.56) 0.01307 (1.03) 0.01214 (1.54) 0.01206 (1.56) 0.0002814 (1.0) │ 10.42 (2.36) 10.42 (2.36) 10.42 (2.36) 10.42 (2.36) 0
Pick the metrics with --columns (a comma list of time / peak / allocated /
allocations; a metric absent from every run is dropped) and the stat with --stat
(min / max / mean / median / stddev, or all):
!benchmem compare {baseline} {candidate} --columns peak --stat min
test_rows.py::test_build_rows[10000] peak (KiB) name min ──────────────────────────────── (baseline) 83.12 (1.0) (candidate) 2,131.12 (25.64) test_rows.py::test_build_rows[200000] peak (MiB) name min ──────────────────────────── (baseline) 23.55 (1.0) (candidate) 46.55 (1.98) test_rows.py::test_build_rows[500000] peak (MiB) name min ───────────────────────────── (baseline) 61.97 (1.0) (candidate) 120.97 (1.95) test_rows.py::test_build_rows[50000] peak (MiB) name min ──────────────────────────── (baseline) 4.42 (1.0) (candidate) 10.42 (2.36)
--group-by follows pytest-benchmark's grammar (fullname | name | func | group |
module | class | param:NAME, comma-composable); pass it to cluster param-variants
(--group-by func) or collapse everything into one table.
Gate CI on a regression¶
--fail-on exits non-zero past a threshold — drop it into CI after the run. Thresholds are
percent (peak:10%) or absolute (peak:5MiB), on peak, allocated, or allocations
(repeatable):
# on the PR branch, against a baseline saved from main:
pytest --benchmark-only --benchmark-memory --benchmark-json=pr.json
benchmem compare main.json pr.json --fail-on peak:10% --fail-on allocations:5%
The dict rows blow past the threshold on every size, so the offending ids print and it exits
1 — failing the CI job:
!benchmem compare {baseline} {candidate} --columns peak --fail-on peak:10% --fail-on allocations:5%; echo "exit: $?"
test_rows.py::test_build_rows[10000] peak (KiB) peak (KiB) peak (KiB) peak (KiB) peak (B) name min max mean median stddev ──────────────────────────────────────────────────────────────────────────────────────────────────── (baseline) 83.12 (1.0) 83.12 (1.0) 83.12 (1.0) 83.12 (1.0) 0 (candidate) 2,131.12 (25.64) 2,131.12 (25.64) 2,131.12 (25.64) 2,131.12 (25.64) 0 test_rows.py::test_build_rows[200000] peak (MiB) peak (MiB) peak (MiB) peak (MiB) peak (B) name min max mean median stddev ──────────────────────────────────────────────────────────────────────────────────── (baseline) 23.55 (1.0) 23.55 (1.0) 23.55 (1.0) 23.55 (1.0) 0 (candidate) 46.55 (1.98) 46.55 (1.98) 46.55 (1.98) 46.55 (1.98) 0 test_rows.py::test_build_rows[500000] peak (MiB) peak (MiB) peak (MiB) peak (MiB) peak (B) name min max mean median stddev ──────────────────────────────────────────────────────────────────────────────────────── (baseline) 61.97 (1.0) 61.97 (1.0) 61.97 (1.0) 61.97 (1.0) 0 (candidate) 120.97 (1.95) 120.97 (1.95) 120.97 (1.95) 120.97 (1.95) 0 test_rows.py::test_build_rows[50000] peak (MiB) peak (MiB) peak (MiB) peak (MiB) peak (B) name min max mean median stddev ──────────────────────────────────────────────────────────────────────────────────── (baseline) 4.42 (1.0) 4.42 (1.0) 4.42 (1.0) 4.42 (1.0) 0 (candidate) 10.42 (2.36) 10.42 (2.36) 10.42 (2.36) 10.42 (2.36) 0 8 regression(s) over threshold: peak test_build_rows[10000]: 83.1 KiB → 2.08 MiB (+2463.8%) peak test_build_rows[200000]: 23.5 MiB → 46.5 MiB (+97.7%) peak test_build_rows[500000]: 62 MiB → 121 MiB (+95.2%) peak test_build_rows[50000]: 4.42 MiB → 10.4 MiB (+135.6%) allocations test_build_rows[10000]: 39 → 41 (+5.1%) allocations test_build_rows[200000]: 86 → 109 (+26.7%) allocations test_build_rows[500000]: 130 → 189 (+45.4%) allocations test_build_rows[50000]: 57 → 63 (+10.5%)
exit: 1
allocations is usually the steadiest tripwire — see Choosing a metric.
benchmem plot — interactive views¶
benchmem plot writes an interactive plotly view to standalone HTML, picking the view by run
count. Scaling (one run) draws cost vs. input size — the baseline's peak-memory curve:
benchmem plot baseline.json --columns peak -o scaling.html
Scatter (two runs) puts baseline cost on x (log) and the candidate/baseline ratio on y, colour = absolute Δ. The top-right corner is "big and got bigger" — where a regression actually costs you:
benchmem plot baseline.json candidate.json --columns peak -o scatter.html
Where to go next¶
- Which metric to gate on → Choosing a metric
- Compare across versions of a package → Cross-version sweeps
- Every CLI flag and option → Reference
Going further¶
For timing comparisons you can also use pytest-benchmark's own tooling directly —
pytest-benchmark compare,--benchmark-histogram. pytest-benchmem doesn't reimplement those; it adds the memory-aware, dims-aware views. Addtimeto--columns(or use--columns timeon the plot) to put both on the same footing.
More from compare¶
Order rows with --sort (name | value — largest last-run first — | change), and write
raw numbers for another tool with --csv out.csv:
benchmem compare baseline.json candidate.json --columns peak --sort value --csv peak.csv
Gating without separate files¶
The approach above keeps two JSON files. Alternatively, gate inline against
pytest-benchmark's own storage — save a baseline once, then fail the next run against it.
--benchmark-memory-compare-fail implies --benchmark-memory-compare:
# on main — record the baseline into .benchmarks/ storage:
pytest --benchmark-only --benchmark-memory --benchmark-save=main
# on the PR branch — fail if peak grows >10% vs that baseline:
pytest --benchmark-only --benchmark-memory --benchmark-memory-compare-fail=peak:10%
Without a prior saved run, the inline gate is a no-op — it prints "no prior run with memory to compare against" and passes. Save a baseline first.
Profile the offenders¶
A peak +20% number says that a benchmark regressed, not where. Add
--benchmark-memory-profile DIR to keep the memray profile (.bin) for each regressing id,
so you can render the allocating call paths after the fact:
pytest --benchmark-only --benchmark-memory \
--benchmark-memory-compare-fail=peak:10% \
--benchmark-memory-profile profiles/
# -> profiles/<id>.bin for every id over threshold; clean ids get nothing
The run prints a ready-to-paste command per saved profile:
benchmem: saved 1 memory profile(s) to profiles/ — render with:
memray flamegraph profiles/test_solve.bin
The .bin is memray's raw capture, so the same file also feeds memray tree / summary /
stats — pick the lens you want. Off by default (retaining .bins costs disk), and in CI
it's the natural artifact to upload and render locally on the PR.
One-step render — benchmem flamegraph¶
Instead of finding the right .bin and remembering the memray subcommand, point benchmem flamegraph at the profile dir and name the test (an exact id or a unique substring):
benchmem flamegraph profiles/ test_solve # → profiles/test_solve.flamegraph.html
benchmem flamegraph profiles/ --worst peak --open # auto-pick the heaviest, open it
benchmem flamegraph profiles/ test_solve --report tree # terminal lens instead of HTML
--worst peak|allocated|allocations reads the metric straight from each .bin and renders the
heaviest, so you don't have to look up the id. --report passes through to any memray reporter
(flamegraph default, plus table / tree / summary / stats); HTML reports land next to
the .bin (override with -o, overwrite with -f). --native asserts the profile actually
carries native traces (see below) and errors with the fix if it doesn't.
Native-backed workloads: attribute the C/Rust memory¶
By default the capture records Python frames only. For a native-backed workload
(polars/Rust, numpy/C, solver bindings) the bulk of peak memory is allocated inside the
extension, so the flamegraph collapses it into one unresolved ??? at ??? bucket — exactly
the part you wanted to localize. Add --benchmark-memory-profile-native to also capture native
stacks:
pytest --benchmark-only --benchmark-memory \
--benchmark-memory-profile profiles/ \
--benchmark-memory-profile-native
Now the flamegraph attributes memory to the real frames (e.g. jemalloc _rjem_je_* under
rayon workers, reached via the polars write path) instead of an opaque native bucket. It's
opt-in — native traces cost runtime and produce bigger .bins — and only applies on the
--benchmark-memory-profile path (without a kept profile there's nothing to enrich, so the
flag errors rather than silently doing nothing). Scope it to one test with
@pytest.mark.benchmem(profile_native=True) instead of the suite-wide flag.
!!! note "Symbols sharpen native frames"
Native traces resolve against interpreter/library symbols. On a stripped build memray warns
No symbol information was found for the Python interpreter; frames then show as mangled
Rust/<unknown> but stay attributable by symbol name (_rjem_je_*, rayon_*). A
debug-symbol interpreter sharpens the picture.
Which benchmarks get a profile follows the gate:
- with
--benchmark-memory-compare-fail→ only the regressing ids (keep the failing run cheap and the output small); - without a fail-gate → every measured benchmark — drop the gate and keep
--benchmark-memory-profile DIRalone to archive them all, regressing or not.
A minimal GitHub Actions job using the two-file approach, caching the baseline across runs:
- uses: actions/cache@v4
with:
path: main.json
key: benchmem-baseline-${{ github.base_ref }}
- run: pytest --benchmark-only --benchmark-memory --benchmark-json=pr.json
- run: benchmem compare main.json pr.json --fail-on peak:10% --fail-on allocations:5%
The other two views¶
plot auto-selects by run count; override with --view:
| Runs | Default view | Answers |
|---|---|---|
| 1 | scaling |
how does cost grow with input size? |
| 2 | scatter |
which ids moved, and were they already big? |
| 2 | compare (--view compare) |
ranked — what moved most, in native units? |
| 3+ | sweep |
fold-change across versions, one cell per (id, run) |
--facet splits any view into small multiples by a dim (including node.*),
--where keeps only rows matching a dim=value filter (repeatable, AND-combined),
--free-axes unmatches a faceted axis from the shared default, and --label/-l names
the series per run (defaulting to the file stems):
benchmem plot run.json --columns peak --facet node.func # one panel per operation
benchmem plot run.json --facet node.func --free-axes y # ...each on its own cost scale
benchmem plot run.json --where axis=n # one sweep at a time
benchmem plot v1.json v2.json v3.json -l 0.6 -l 0.7 -l 0.8 # name series, not file stems
Faceted panels share axes by default — right when they're commensurable (the same n
grid across functions). Two cases want them unmatched, and they free different axes:
- Different cost scales —
--facet node.funcwhere one function is far costlier flattens the cheap panels on a shared y.--free-axes ygives each its own cost range. - Incommensurable sweeps — a run mixing sizes
n(10–10⁴) and a 0–100severityunder one numeric dim squishes both onto one x.--free-axes x(or filter to one with--where axis=n).
--free-axes both frees each panel entirely. --where is the cleaner reach when you only
care about one slice — it drops the rest rather than just rescaling.