Benchmark#

The surfaces.benchmark module for systematic optimizer comparison. See the Benchmarking for usage examples and concepts.

Core#

Benchmark#

class Benchmark(budget_cu: float | None = None, budget_iter: int | None = None, n_seeds: int = 1, seed: int = 0, catch: str = 'raise')[source]#

Bases: object

Configurable, incremental benchmark runner.

Collects test functions and optimizer specs via add methods, then runs only the missing combinations on each run() call. Results accumulate in the instance across runs.

Parameters:

budget_cu (float, optional) – Maximum compute budget per run in Compute Units.
budget_iter (int, optional) – Maximum number of function evaluations per run.
n_seeds (int) – Number of independent runs per (function, optimizer) pair.
seed (int) – Base random seed. Run i uses seed + i.
catch (str) – Error handling for individual trials. "raise" (default) propagates exceptions immediately. "warn" logs a warning and continues. "skip" silently skips the failed trial. Failed trials are recorded in bench.errors regardless of mode.

Examples

>>> bench = Benchmark(budget_cu=50_000, n_seeds=5)
>>> bench.add_functions(collection.filter(category="bbob"))
>>> bench.add_optimizers([HillClimbing, RandomSearch])
>>> bench.run()
>>> bench.results.summary()

add_functions(functions: Any) → Benchmark[source]#

Add test functions to the benchmark.

Accepts a single class, a list of classes, or a Collection. Duplicates are silently ignored.

add_optimizers(optimizers: Any) → Benchmark[source]#

Add optimizer specs to the benchmark.

Accepts a single spec or a list of specs. Each spec can be a class (auto-detected by module path) or a (class, params_dict) tuple. Duplicates are silently ignored.

remove_functions(functions: Any) → Benchmark[source]#

Remove test functions and their associated traces.

Accepts a single class, a list of classes, or a Collection. Classes not currently registered are silently ignored. All traces recorded for removed functions are deleted.

remove_optimizers(optimizers: Any) → Benchmark[source]#

Remove optimizer specs and their associated traces.

Accepts a single spec or a list of specs. Each spec can be a bare class or a (class, params_dict) tuple.

When a bare class is passed, all entries using that class are removed regardless of their params. When a tuple is passed, only the entry with exactly matching params is removed. This lets you selectively drop one configuration while keeping others:

bench.remove_optimizers(TPESampler)                # all TPESampler entries
bench.remove_optimizers((TPESampler, {"n": 10}))   # only this config

Specs not currently registered are silently ignored. All traces recorded for removed optimizers are deleted.

run(*, verbose: bool = True, callback: Callable[[TrialInfo], None] | None = None, backend: ParallelBackend | None = None) → Benchmark[source]#

Run all missing (function, optimizer, seed) combinations.

Only executes combinations that have no trace yet. New traces are added to the existing results, so previous data is preserved.

Parameters:

verbose (bool) – Print progress to stdout for each trial plus a summary line.
callback (callable, optional) – Called with a TrialInfo after each trial (including skipped ones). Useful for custom progress bars or logging.
backend (ParallelBackend, optional) –
Parallel execution backend. When provided, pending trials are dispatched to workers via backend.map(). When None (the default), trials run sequentially in the current process.

Note: with a parallel backend and catch != "raise", all trials complete before errors are collected. In sequential mode, catch="raise" stops on the first failure.
chaining (Returns self for)

property results: ResultAccessor[source]#

summary, traces, dataframe export.

Type:: Access benchmark results

property io: IOAccessor[source]#: Access save/load functionality.

property plot: PlotAccessor[source]#: Access benchmark visualizations.

property errors: dict[tuple[str, str, int], Exception][source]#: Failed trials from the last run, keyed by (function, optimizer, seed).

classmethod load(path: str | Path) → Benchmark[source]#

Load a benchmark (config + results) from a JSON file.

Functions must originate from the surfaces package. A warning is emitted if the Surfaces version differs from the one used when saving.

Parameters:: path (str or Path) – Path to a JSON file previously created by io.save().

classmethod from_suite(suite: Suite, **overrides: Any) → Benchmark[source]#

Create a Benchmark pre-configured from a Suite definition.

The suite provides function filters and default budget/seed settings. Add optimizers and call run() to execute.

Parameters:

suite (Suite) – A predefined suite from surfaces.benchmark.suites.
**overrides – Override suite defaults (budget_cu, budget_iter, n_seeds, seed).

Suite#

class Suite(name: str, description: str, function_filter: dict[str, Any], budget_cu: float | None = None, budget_iter: int | None = None, n_seeds: int = 5)[source]#

Bases: object

A pre-configured benchmark scenario.

Use with run_suite() or unpack manually into run().

Result Analysis#

ResultAccessor#

Accessed via bench.results.

class ResultAccessor(benchmark: Benchmark)[source]#

Bases: object

Query and analyze benchmark results.

Accessed via bench.results. All methods operate on the accumulated traces inside the parent Benchmark instance.

property function_names: list[str][source]#: Sorted unique function names across all traces.

property optimizer_names: list[str][source]#: Sorted unique optimizer names across all traces.

property seeds: list[int][source]#: Sorted unique seeds across all traces.

property n_traces: int[source]#: Total number of traces.

traces(function: str | None = None, optimizer: str | None = None, seed: int | None = None) → dict[tuple[str, str, int], Trace][source]#: Filter traces by function name, optimizer name, and/or seed.

summary(at_cu: float | None = None, at_iter: int | None = None, show_ci: bool = False) → str[source]#

Generate a formatted summary table of benchmark results.

Parameters:

at_cu (float, optional) – Report best score at this CU budget.
at_iter (int, optional) – Report best score at this iteration count.
show_ci (bool, default=False) – Show standard deviation and 95% confidence interval of the best score across seeds. Useful with >= 3 seeds.

to_dataframe() → Any[source]#

Export all evaluation records as a pandas DataFrame.

Each row is a single evaluation with function/optimizer/seed metadata and full cost breakdown. Parameter values are expanded into individual columns.

ert(precision: float = 1.0, targets: dict[str, float] | None = None) → Any[source]#

Compute Expected Running Time for all (function, optimizer) pairs.

A problem counts as “solved” when best_so_far <= f_global + precision. ERT follows the COCO convention: total budget across all seeds divided by the number of successful seeds.

Parameters:

precision (float, default=1.0) – Absolute distance from the known optimum (f_global).
targets (dict, optional) – Per-function target scores. Overrides precision for functions present in this dict.

Returns:

Printable, subscriptable, and exportable to DataFrame.

Return type:

ERTTable

ranking(at_cu: float | None = None, alpha: float = 0.05, correction: str | None = 'holm') → Any[source]#

Rank optimizers by normalized performance with pairwise tests.

Scores are normalized per function (0 = worst observed, 1 = best observed) and averaged over seeds. Ranks use tied-rank averaging within each function. Pairwise Wilcoxon signed-rank tests assess statistical significance.

Parameters:

at_cu (float, optional) – Evaluate scores at this CU budget instead of using the final best score.
alpha (float, default=0.05) – Significance level for the Wilcoxon tests.
correction (str or None, default="holm") – Multiple comparison correction. "holm" applies the Holm step-down procedure (controls family-wise error rate). None returns raw uncorrected p-values.

Returns:

Printable, subscriptable, and exportable to DataFrame.

Return type:

RankingTable

friedman(at_cu: float | None = None, alpha: float = 0.05) → Any[source]#

Friedman omnibus test for comparing multiple optimizers.

Tests whether at least one optimizer’s performance differs significantly. This is the recommended first step before pairwise comparisons: if the Friedman test does not reject, pairwise differences are not statistically supported.

Requires at least 3 optimizers and 3 functions where all optimizers produced results.

Parameters:

at_cu (float, optional) – Evaluate scores at this CU budget.
alpha (float, default=0.05) – Significance level.

Returns:

Printable result with chi-squared and Iman-Davenport statistics, average ranks, and significance verdict.

Return type:

FriedmanResult

FriedmanResult#

Returned by bench.results.friedman().

class FriedmanResult(chi2_statistic: float, chi2_p_value: float, f_statistic: float, f_p_value: float, n_functions: int, n_optimizers: int, alpha: float, avg_ranks: dict[str, float])[source]#

Bases: object

Result of the Friedman omnibus test for comparing multiple optimizers.

The Friedman test checks whether at least one optimizer differs significantly. If significant is True, proceed with post-hoc pairwise tests. If False, observed differences are not statistically supported.

The Iman-Davenport variant uses an F-distribution and is less conservative than the chi-squared approximation.

property significant: bool[source]#: Whether the Iman-Davenport test rejects the null hypothesis.

ERTTable#

Returned by bench.results.ert(). Subscriptable by function name, then optimizer name.

class ERTTable(data: dict[str, dict[str, ERTEntry]], precision: float, function_names: list[str], optimizer_names: list[str])[source]#

Bases: object

Expected Running Time results across functions and optimizers.

Subscriptable by function name, then optimizer name:

ert = bench.results.ert()
entry = ert["AckleyFunction"]["HillClimbing"]
print(entry.ert_cu, entry.solved, entry.total)

to_dataframe() → Any[source]#: Export ERT results as a pandas DataFrame.

ERTEntry#

class ERTEntry(ert_cu: float, solved: int, total: int, median_cu: float, individual_cu: tuple[float, ...])[source]#

Bases: object

ERT result for a single (function, optimizer) pair.

property success_rate: float[source]#: Fraction of seeds that reached the target.

RankingTable#

Returned by bench.results.ranking(). Subscriptable by optimizer name.

class RankingTable(entries: list[RankingEntry], pvalues: dict[tuple[str, str], float], alpha: float, normalized_scores: dict[str, dict[str, float]], correction: str | None = None)[source]#

Bases: object

Optimizer ranking with pairwise statistical tests.

Subscriptable by optimizer name:

ranking = bench.results.ranking()
entry = ranking["HillClimbing"]
print(entry.rank, entry.mean_normalized)
print(ranking.pvalues)

to_dataframe() → Any[source]#: Export ranking results as a pandas DataFrame.

pvalues_dataframe() → Any[source]#: Pairwise Wilcoxon p-values as a square DataFrame.

RankingEntry#

class RankingEntry(optimizer: str, rank: float, mean_normalized: float)[source]#

Bases: object

Ranking result for a single optimizer.

Persistence#

IOAccessor#

Accessed via bench.io.

class IOAccessor(benchmark: Benchmark)[source]#

Bases: object

Save benchmark state (config + results) to disk.

Accessed via bench.io. Loading is done via the Benchmark.load() classmethod.

save(path: str | Path) → None[source]#

Save the full benchmark state to a JSON file.

Stores configuration (budget, seeds), registered functions and optimizers, the Surfaces version, and all accumulated traces. The file can be loaded back with Benchmark.load().

Parameters:: path (str or Path) – Output file path. Will be overwritten if it exists.

Visualization#

PlotAccessor#

Accessed via bench.plot.

class PlotAccessor(benchmark: Benchmark)[source]#

Bases: object

Benchmark visualization via Plotly.

Accessed via bench.plot. All methods return a plotly.graph_objects.Figure that can be displayed with fig.show() or rendered automatically in Jupyter notebooks.

Requires the viz extra (pip install surfaces[viz]).

ecdf(precision: float | list[float] = 1.0, log_x: bool = True) → Any[source]#

Empirical Cumulative Distribution Function of running times.

Shows for each optimizer what fraction of (function, seed) problems it solved within a given CU budget. A problem counts as “solved” when best_so_far <= target, where target is f_global + precision (or best-known + precision as fallback).

Parameters:

precision (float or list[float]) – Target precision(s). A list produces stacked subplots, one per precision level, useful for comparing difficulty grades like [1.0, 0.1, 0.01].
log_x (bool) – Logarithmic x-axis. Standard in benchmark literature.

Return type:

plotly.graph_objects.Figure

convergence(function: str, band: str = 'iqr', center: str = 'median', log_y: bool = False) → Any[source]#

Convergence plot for a single function across all optimizers.

Shows how quickly each optimizer converges by plotting the center line (median or mean of best_so_far across seeds) with an optional uncertainty band.

Parameters:

function (str) – Function name to plot.
band (str or None) – Uncertainty band style: "iqr" (25th-75th percentile), "minmax", "std" (center +/- 1 standard deviation), or None to hide the band.
center (str) – Center line statistic: "median" or "mean".
log_y (bool) – Logarithmic y-axis.

Return type:

plotly.graph_objects.Figure

cd_diagram(at_cu: float | None = None, alpha: float = 0.05, correction: str | None = 'holm', title: str | None = None, width: float = 8.0) → Any[source]#

Critical Difference diagram comparing optimizer ranks.

Visualizes average ranks on a horizontal axis with thick bars connecting groups of optimizers that are not statistically distinguishable (Demsar, 2006).

Average ranks are computed with proper tied-rank handling using only functions where all optimizers produced results (complete blocks), matching the Friedman test methodology.

Requires matplotlib (pip install surfaces[viz]).

Parameters:

at_cu (float, optional) – Evaluate scores at this CU budget.
alpha (float, default=0.05) – Significance level for clique detection.
correction (str or None, default="holm") – P-value correction for pairwise Wilcoxon tests.
title (str, optional) – Figure title. Defaults to include the alpha value.
width (float, default=8.0) – Figure width in inches.

Return type:

matplotlib.figure.Figure

Traces#

Trace#

One complete optimization trajectory (single optimizer, single function, single seed).

class Trace[source]#

Bases: object

Ordered sequence of evaluation records from a single benchmark run.

Represents one complete optimization trajectory: a single optimizer on a single function with a single seed. The cumulative CU values form a monotonically increasing sequence that enables CU-indexed lookups via binary search.

append(record: EvalRecord) → None[source]#: Append an evaluation record to the trace.

property records: list[EvalRecord][source]#: List copy of all evaluation records.

property best_score: float | None[source]#: Best (lowest) score across all evaluations, or None if empty.

property total_cu: float[source]#: Total cumulative compute units consumed.

property n_evaluations: int[source]#: Number of evaluations recorded.

property total_overhead_cu: float[source]#: Sum of optimizer overhead CU across all evaluations.

property total_eval_cu: float[source]#: Sum of function evaluation CU across all evaluations.

property overhead_fraction: float[source]#: Fraction of total CU spent on optimizer overhead.

score_at_cu(cu_budget: float) → float | None[source]#

Best score achieved within the given CU budget.

Uses binary search on cumulative CU for efficient lookup.

score_at_iter(n_iter: int) → float | None[source]#: Best score after exactly n_iter evaluations.

EvalRecord#

class EvalRecord(params: dict[str, Any], score: float, eval_cu: float, overhead_cu: float, cumulative_cu: float, best_so_far: float, wall_seconds: float)[source]#

Bases: object

Single evaluation record within a benchmark trace.

Captures both the evaluation result and the computational cost breakdown between function evaluation and optimizer overhead.

Execution#

TrialInfo#

Passed to callbacks after each trial completes.

class TrialInfo(function: str, optimizer: str, seed: int, index: int, total: int, skipped: bool, wall_seconds: float | None, error: Exception | None = None)[source]#

Bases: object

Information about a single benchmark trial, passed to callbacks.

Parameters:

function (str) – Name of the test function class.
optimizer (str) – Display name of the optimizer adapter.
seed (int) – Random seed used for this trial.
index (int) – 1-based index of this trial within the run.
total (int) – Total number of trials (including skipped) in the run.
skipped (bool) – True if the trial was skipped because a trace already existed.
wall_seconds (float or None) – Wall-clock time for the trial. None when skipped.

ParallelBackend#

class ParallelBackend(n_jobs: int = -1)[source]#

Bases: ABC

Base class for parallel benchmark execution.

Subclasses must implement map to distribute work across workers. The n_jobs attribute controls the number of workers.

Parameters:: n_jobs (int) – Number of parallel workers. Use -1 for all available CPU cores.

property n_jobs: int[source]#: Number of workers requested (-1 means all cores).

property effective_n_jobs: int[source]#: Resolved worker count (always a positive integer).

abstractmethod map(fn: Callable, tasks: list) → list[source]#

Execute fn(task) for each task and return results.

Results must be returned in the same order as tasks.

Parameters:

fn (callable) – A picklable callable accepting a single positional argument.
tasks (list) – Task arguments to distribute across workers.

Returns:

One result per task, in submission order.

Return type:

list

ProcessBackend#

class ProcessBackend(n_jobs: int = -1)[source]#

Bases: ParallelBackend

Process-based parallelism via ProcessPoolExecutor.

Each worker runs in a separate OS process, bypassing the GIL. Task arguments and return values must be picklable.

Parameters:: n_jobs (int) – Number of worker processes. Defaults to -1 (all CPU cores).

map(fn: Callable, tasks: list) → list[source]#: Execute tasks in separate processes via multiprocessing.

property effective_n_jobs: int[source]#: Resolved worker count (always a positive integer).

property n_jobs: int[source]#: Number of workers requested (-1 means all cores).

ThreadBackend#

class ThreadBackend(n_jobs: int = -1)[source]#

Bases: ParallelBackend

Thread-based parallelism via ThreadPoolExecutor.

Useful when the objective function or optimizer releases the GIL (e.g. calls into C extensions). Avoids the pickling overhead of process-based backends.

Parameters:: n_jobs (int) – Number of worker threads. Defaults to -1 (all CPU cores).

map(fn: Callable, tasks: list) → list[source]#: Execute tasks in threads. Useful when tasks release the GIL.

property effective_n_jobs: int[source]#: Resolved worker count (always a positive integer).

property n_jobs: int[source]#: Number of workers requested (-1 means all cores).

Statistical Functions#

These are called internally by the accessor methods but can also be used standalone.

compute_friedman(traces: dict[tuple[str, str, int], Trace], alpha: float = 0.05, at_cu: float | None = None) → FriedmanResult[source]#

Friedman omnibus test on benchmark traces.

Requires at least 3 optimizers and 3 functions where all optimizers produced results. Uses normalized scores.

The Iman-Davenport correction replaces the chi-squared approximation with an F-distribution, giving a less conservative (more powerful) test.

compute_ert(traces: dict[tuple[str, str, int], Trace], optimal_scores: dict[str, float | None], precision: float, targets: dict[str, float] | None = None) → ERTTable[source]#

Compute Expected Running Time for all (function, optimizer) pairs.

For each trace, finds the first cumulative_cu where best_so_far <= target. ERT is computed as the COCO standard: sum(all running times including inf) / number_of_successful_runs.

compute_ranking(traces: dict[tuple[str, str, int], Trace], alpha: float = 0.05, at_cu: float | None = None, correction: str | None = 'holm') → RankingTable[source]#

Rank optimizers by normalized scores with pairwise statistical tests.

Normalization is per-function: 0 = worst observed, 1 = best observed. Ranks use tied-rank averaging within each function, then are averaged across functions. Pairwise Wilcoxon signed-rank tests assess significance, with optional Holm correction for multiple comparisons.

Parameters:: correction (str or None) – "holm" applies Holm step-down correction (recommended). None returns raw uncorrected p-values.