Compare & gate CI¶

Two (or more) saved runs in, one comparison out — a table (benchmem compare) or an interactive view (benchmem plot). The table shows time and peak across every stat by default (pick metrics with --columns, a stat with --stat); the plot takes one (--columns). Both group by the dims your tests carry.

A regression to catch¶

A benchmark builds a table of (i, i²) rows. On main, each row is a lightweight tuple:

@pytest.mark.parametrize("n", [10_000, 50_000, 200_000, 500_000])
def test_build_rows(benchmark, n):
    benchmark(lambda: [(i, i * i) for i in range(n)])

A branch switches the rows to dicts for readability — a classic memory regression, since a dict is several times heavier than a 2-tuple:

@pytest.mark.parametrize("n", [10_000, 50_000, 200_000, 500_000])
def test_build_rows(benchmark, n):
    benchmark(lambda: [{"x": i, "sq": i * i} for i in range(n)])

Run each on its branch and save the --benchmark-json — here, baseline.json (main) and candidate.json (the branch).

`benchmem compare` — the comparison table¶

Modelled on pytest-benchmark's own table: one row per (benchmark × run), columns are metric × stat, and each cell carries a relative (N.NN) multiplier vs the column's best run (best green, worst red). Rows are grouped into sub-tables by --group-by (default fullname, so each benchmark's runs sit together and the multiplier reads as the cross-run ratio). By default it shows time and peak, each across the full stat spread (min/max/mean/median/stddev) — so no single statistic is privileged:

In [2]:

Copied!

!benchmem compare {baseline} {candidate}
!benchmem compare {baseline} {candidate}

test_rows.py::test_build_rows[10000]

                     time (s)          time (s)          time (s)          time (s)           time (s)             peak (KiB)         peak (KiB)         peak (KiB)         peak (KiB)   peak (B) 
 name                     min               max              mean            median             stddev   │                min                max               mean             median     stddev 
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 (baseline)    0.001154 (1.0)    0.001522 (1.0)    0.001205 (1.0)    0.001192 (1.0)    4.308e-05 (1.0)   │        83.12 (1.0)        83.12 (1.0)        83.12 (1.0)        83.12 (1.0)          0 
 (candidate)    0.0018 (1.56)   0.002482 (1.63)   0.002054 (1.70)   0.002042 (1.71)   6.124e-05 (1.42)   │   2,131.12 (25.64)   2,131.12 (25.64)   2,131.12 (25.64)   2,131.12 (25.64)          0 

test_rows.py::test_build_rows[200000]
                     time (s)         time (s)         time (s)         time (s)           time (s)         peak (MiB)     peak (MiB)     peak (MiB)     peak (MiB)   peak (B) 
 name                     min              max             mean           median             stddev   │            min            max           mean         median     stddev 
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 (baseline)     0.03566 (1.0)    0.03712 (1.0)    0.03619 (1.0)    0.03616 (1.0)    0.0003695 (1.0)   │    23.55 (1.0)    23.55 (1.0)    23.55 (1.0)    23.55 (1.0)          0 
 (candidate)   0.05522 (1.55)   0.05827 (1.57)   0.05634 (1.56)   0.05605 (1.55)   0.0008266 (2.24)   │   46.55 (1.98)   46.55 (1.98)   46.55 (1.98)   46.55 (1.98)          0 

test_rows.py::test_build_rows[500000]
                    time (s)        time (s)        time (s)        time (s)          time (s)          peak (MiB)      peak (MiB)      peak (MiB)      peak (MiB)   peak (B) 
 name                    min             max            mean          median            stddev   │             min             max            mean          median     stddev 
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 (baseline)    0.09289 (1.0)   0.09604 (1.0)   0.09431 (1.0)   0.09432 (1.0)   0.001012 (1.08)   │     61.97 (1.0)     61.97 (1.0)     61.97 (1.0)     61.97 (1.0)          0 
 (candidate)    0.144 (1.55)   0.1469 (1.53)   0.1457 (1.55)   0.1461 (1.55)   0.0009414 (1.0)   │   120.97 (1.95)   120.97 (1.95)   120.97 (1.95)   120.97 (1.95)          0 

test_rows.py::test_build_rows[50000]
                     time (s)         time (s)         time (s)         time (s)           time (s)         peak (MiB)     peak (MiB)     peak (MiB)     peak (MiB)   peak (B) 
 name                     min              max             mean           median             stddev   │            min            max           mean         median     stddev 
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 (baseline)    0.007536 (1.0)    0.01271 (1.0)   0.007865 (1.0)   0.007737 (1.0)   0.0005557 (1.97)   │     4.42 (1.0)     4.42 (1.0)     4.42 (1.0)     4.42 (1.0)          0 
 (candidate)   0.01174 (1.56)   0.01307 (1.03)   0.01214 (1.54)   0.01206 (1.56)    0.0002814 (1.0)   │   10.42 (2.36)   10.42 (2.36)   10.42 (2.36)   10.42 (2.36)          0

Pick the metrics with --columns (a comma list of time / peak / allocated / allocations; a metric absent from every run is dropped) and the stat with --stat (min / max / mean / median / stddev, or all):

In [3]:

Copied!

!benchmem compare {baseline} {candidate} --columns peak --stat min
!benchmem compare {baseline} {candidate} --columns peak --stat min

test_rows.py::test_build_rows[10000]
                     peak (KiB) 
 name                       min 
────────────────────────────────
 (baseline)         83.12 (1.0) 
 (candidate)   2,131.12 (25.64) 

test_rows.py::test_build_rows[200000]
                 peak (MiB) 
 name                   min 
────────────────────────────
 (baseline)     23.55 (1.0) 
 (candidate)   46.55 (1.98) 

test_rows.py::test_build_rows[500000]
                  peak (MiB) 
 name                    min 
─────────────────────────────
 (baseline)      61.97 (1.0) 
 (candidate)   120.97 (1.95) 

test_rows.py::test_build_rows[50000]
                 peak (MiB) 
 name                   min 
────────────────────────────
 (baseline)      4.42 (1.0) 
 (candidate)   10.42 (2.36)

Gate CI on a regression¶

--fail-on exits non-zero past a threshold — drop it into CI after the run. Thresholds are percent (peak:10%) or absolute (peak:5MiB), on peak, allocated, or allocations (repeatable):

# on the PR branch, against a baseline saved from main:
pytest --benchmark-only --benchmark-memory --benchmark-json=pr.json
benchmem compare main.json pr.json --fail-on peak:10% --fail-on allocations:5%

The dict rows blow past the threshold on every size, so the offending ids print and it exits 1 — failing the CI job:

In [4]:

Copied!

!benchmem compare {baseline} {candidate} --columns peak --fail-on peak:10% --fail-on allocations:5%; echo "exit: $?"
!benchmem compare {baseline} {candidate} --columns peak --fail-on peak:10% --fail-on allocations:5%; echo "exit: $?"

test_rows.py::test_build_rows[10000]
                     peak (KiB)         peak (KiB)         peak (KiB)         peak (KiB)   peak (B) 
 name                       min                max               mean             median     stddev 
────────────────────────────────────────────────────────────────────────────────────────────────────
 (baseline)         83.12 (1.0)        83.12 (1.0)        83.12 (1.0)        83.12 (1.0)          0 
 (candidate)   2,131.12 (25.64)   2,131.12 (25.64)   2,131.12 (25.64)   2,131.12 (25.64)          0 

test_rows.py::test_build_rows[200000]
                 peak (MiB)     peak (MiB)     peak (MiB)     peak (MiB)   peak (B) 
 name                   min            max           mean         median     stddev 
────────────────────────────────────────────────────────────────────────────────────
 (baseline)     23.55 (1.0)    23.55 (1.0)    23.55 (1.0)    23.55 (1.0)          0 
 (candidate)   46.55 (1.98)   46.55 (1.98)   46.55 (1.98)   46.55 (1.98)          0 

test_rows.py::test_build_rows[500000]
                  peak (MiB)      peak (MiB)      peak (MiB)      peak (MiB)   peak (B) 
 name                    min             max            mean          median     stddev 
────────────────────────────────────────────────────────────────────────────────────────
 (baseline)      61.97 (1.0)     61.97 (1.0)     61.97 (1.0)     61.97 (1.0)          0 
 (candidate)   120.97 (1.95)   120.97 (1.95)   120.97 (1.95)   120.97 (1.95)          0 

test_rows.py::test_build_rows[50000]
                 peak (MiB)     peak (MiB)     peak (MiB)     peak (MiB)   peak (B) 
 name                   min            max           mean         median     stddev 
────────────────────────────────────────────────────────────────────────────────────
 (baseline)      4.42 (1.0)     4.42 (1.0)     4.42 (1.0)     4.42 (1.0)          0 
 (candidate)   10.42 (2.36)   10.42 (2.36)   10.42 (2.36)   10.42 (2.36)          0 

8 regression(s) over threshold:
  peak         test_build_rows[10000]: 83.1 KiB → 2.08 MiB (+2463.8%)
  peak         test_build_rows[200000]: 23.5 MiB → 46.5 MiB (+97.7%)
  peak         test_build_rows[500000]: 62 MiB → 121 MiB (+95.2%)
  peak         test_build_rows[50000]: 4.42 MiB → 10.4 MiB (+135.6%)
  allocations  test_build_rows[10000]: 39 → 41 (+5.1%)
  allocations  test_build_rows[200000]: 86 → 109 (+26.7%)
  allocations  test_build_rows[500000]: 130 → 189 (+45.4%)
  allocations  test_build_rows[50000]: 57 → 63 (+10.5%)

exit: 1

allocations is usually the steadiest tripwire — see Choosing a metric.

`benchmem plot` — interactive views¶

benchmem plot writes an interactive plotly view to standalone HTML, picking the view by run count. Scaling (one run) draws cost vs. input size — the baseline's peak-memory curve:

benchmem plot baseline.json --columns peak -o scaling.html

Scatter (two runs) puts baseline cost on x (log) and the candidate/baseline ratio on y, colour = absolute Δ. The top-right corner is "big and got bigger" — where a regression actually costs you:

benchmem plot baseline.json candidate.json --columns peak -o scatter.html

Where to go next¶

Which metric to gate on → Choosing a metric
Compare across versions of a package → Cross-version sweeps
Every CLI flag and option → Reference

Going further¶

For timing comparisons you can also use pytest-benchmark's own tooling directly — pytest-benchmark compare, --benchmark-histogram. pytest-benchmem doesn't reimplement those; it adds the memory-aware, dims-aware views. Add time to --columns (or use --columns time on the plot) to put both on the same footing.

More from `compare`¶

Order rows with --sort (name | value — largest last-run first — | change), and write raw numbers for another tool with --csv out.csv:

benchmem compare baseline.json candidate.json --columns peak --sort value --csv peak.csv

Gating without separate files¶

The approach above keeps two JSON files. Alternatively, gate inline against pytest-benchmark's own storage — save a baseline once, then fail the next run against it. --benchmark-memory-compare-fail implies --benchmark-memory-compare:

# on main — record the baseline into .benchmarks/ storage:
pytest --benchmark-only --benchmark-memory --benchmark-save=main

# on the PR branch — fail if peak grows >10% vs that baseline:
pytest --benchmark-only --benchmark-memory --benchmark-memory-compare-fail=peak:10%

Without a prior saved run, the inline gate is a no-op — it prints "no prior run with memory to compare against" and passes. Save a baseline first.

Profile the offenders¶

A peak +20% number says that a benchmark regressed, not where. Add --benchmark-memory-profile DIR to keep the memray profile (.bin) for each regressing id, so you can render the allocating call paths after the fact:

pytest --benchmark-only --benchmark-memory \
       --benchmark-memory-compare-fail=peak:10% \
       --benchmark-memory-profile profiles/
# -> profiles/<id>.bin for every id over threshold; clean ids get nothing

The run prints a ready-to-paste command per saved profile:

benchmem: saved 1 memory profile(s) to profiles/ — render with:
    memray flamegraph profiles/test_solve.bin

The .bin is memray's raw capture, so the same file also feeds memray tree / summary / stats — pick the lens you want. Off by default (retaining .bins costs disk), and in CI it's the natural artifact to upload and render locally on the PR.

One-step render — `benchmem flamegraph`¶

Instead of finding the right .bin and remembering the memray subcommand, point benchmem flamegraph at the profile dir and name the test (an exact id or a unique substring):

benchmem flamegraph profiles/ test_solve          # → profiles/test_solve.flamegraph.html
benchmem flamegraph profiles/ --worst peak --open # auto-pick the heaviest, open it
benchmem flamegraph profiles/ test_solve --report tree   # terminal lens instead of HTML

--worst peak|allocated|allocations reads the metric straight from each .bin and renders the heaviest, so you don't have to look up the id. --report passes through to any memray reporter (flamegraph default, plus table / tree / summary / stats); HTML reports land next to the .bin (override with -o, overwrite with -f). --native asserts the profile actually carries native traces (see below) and errors with the fix if it doesn't.

Native-backed workloads: attribute the C/Rust memory¶

By default the capture records Python frames only. For a native-backed workload (polars/Rust, numpy/C, solver bindings) the bulk of peak memory is allocated inside the extension, so the flamegraph collapses it into one unresolved ??? at ??? bucket — exactly the part you wanted to localize. Add --benchmark-memory-profile-native to also capture native stacks:

pytest --benchmark-only --benchmark-memory \
       --benchmark-memory-profile profiles/ \
       --benchmark-memory-profile-native

Now the flamegraph attributes memory to the real frames (e.g. jemalloc _rjem_je_* under rayon workers, reached via the polars write path) instead of an opaque native bucket. It's opt-in — native traces cost runtime and produce bigger .bins — and only applies on the --benchmark-memory-profile path (without a kept profile there's nothing to enrich, so the flag errors rather than silently doing nothing). Scope it to one test with @pytest.mark.benchmem(profile_native=True) instead of the suite-wide flag.

!!! note "Symbols sharpen native frames" Native traces resolve against interpreter/library symbols. On a stripped build memray warns No symbol information was found for the Python interpreter; frames then show as mangled Rust/<unknown> but stay attributable by symbol name (_rjem_je_*, rayon_*). A debug-symbol interpreter sharpens the picture.

Which benchmarks get a profile follows the gate:

with --benchmark-memory-compare-fail → only the regressing ids (keep the failing run cheap and the output small);
without a fail-gate → every measured benchmark — drop the gate and keep --benchmark-memory-profile DIR alone to archive them all, regressing or not.

A minimal GitHub Actions job using the two-file approach, caching the baseline across runs:

- uses: actions/cache@v4
  with:
    path: main.json
    key: benchmem-baseline-${{ github.base_ref }}
- run: pytest --benchmark-only --benchmark-memory --benchmark-json=pr.json
- run: benchmem compare main.json pr.json --fail-on peak:10% --fail-on allocations:5%

The other two views¶

plot auto-selects by run count; override with --view:

Runs	Default view	Answers
1	`scaling`	how does cost grow with input size?
2	`scatter`	which ids moved, and were they already big?
2	`compare` (`--view compare`)	ranked — what moved most, in native units?
3+	`sweep`	fold-change across versions, one cell per (id, run)

--facet splits any view into small multiples by a dim (including node.*), --where keeps only rows matching a dim=value filter (repeatable, AND-combined), --free-axes unmatches a faceted axis from the shared default, and --label/-l names the series per run (defaulting to the file stems):

benchmem plot run.json --columns peak --facet node.func             # one panel per operation
benchmem plot run.json --facet node.func --free-axes y             # ...each on its own cost scale
benchmem plot run.json --where axis=n                              # one sweep at a time
benchmem plot v1.json v2.json v3.json -l 0.6 -l 0.7 -l 0.8         # name series, not file stems

Faceted panels share axes by default — right when they're commensurable (the same n grid across functions). Two cases want them unmatched, and they free different axes:

Different cost scales — --facet node.func where one function is far costlier flattens the cheap panels on a shared y. --free-axes y gives each its own cost range.
Incommensurable sweeps — a run mixing sizes n (10–10⁴) and a 0–100 severity under one numeric dim squishes both onto one x. --free-axes x (or filter to one with --where axis=n).

--free-axes both frees each panel entirely. --where is the cleaner reach when you only care about one slice — it drops the rest rather than just rescaling.