Benchmark#
The surfaces.benchmark module for systematic optimizer comparison.
See the Benchmarking for usage examples and concepts.
Core#
Benchmark#
- class Benchmark(budget_cu: float | None = None, budget_iter: int | None = None, n_seeds: int = 1, seed: int = 0, catch: str = 'raise')[source]#
Bases:
objectConfigurable, incremental benchmark runner.
Collects test functions and optimizer specs via add methods, then runs only the missing combinations on each run() call. Results accumulate in the instance across runs.
- Parameters:
budget_cu (float, optional) – Maximum compute budget per run in Compute Units.
budget_iter (int, optional) – Maximum number of function evaluations per run.
n_seeds (int) – Number of independent runs per (function, optimizer) pair.
seed (int) – Base random seed. Run i uses seed + i.
catch (str) – Error handling for individual trials.
"raise"(default) propagates exceptions immediately."warn"logs a warning and continues."skip"silently skips the failed trial. Failed trials are recorded inbench.errorsregardless of mode.
Examples
>>> bench = Benchmark(budget_cu=50_000, n_seeds=5) >>> bench.add_functions(collection.filter(category="bbob")) >>> bench.add_optimizers([HillClimbing, RandomSearch]) >>> bench.run() >>> bench.results.summary()
- add_functions(functions: Any) Benchmark[source]#
Add test functions to the benchmark.
Accepts a single class, a list of classes, or a Collection. Duplicates are silently ignored.
- add_optimizers(optimizers: Any) Benchmark[source]#
Add optimizer specs to the benchmark.
Accepts a single spec or a list of specs. Each spec can be a class (auto-detected by module path) or a (class, params_dict) tuple. Duplicates are silently ignored.
- remove_functions(functions: Any) Benchmark[source]#
Remove test functions and their associated traces.
Accepts a single class, a list of classes, or a Collection. Classes not currently registered are silently ignored. All traces recorded for removed functions are deleted.
- remove_optimizers(optimizers: Any) Benchmark[source]#
Remove optimizer specs and their associated traces.
Accepts a single spec or a list of specs. Each spec can be a bare class or a
(class, params_dict)tuple.When a bare class is passed, all entries using that class are removed regardless of their params. When a tuple is passed, only the entry with exactly matching params is removed. This lets you selectively drop one configuration while keeping others:
bench.remove_optimizers(TPESampler) # all TPESampler entries bench.remove_optimizers((TPESampler, {"n": 10})) # only this config
Specs not currently registered are silently ignored. All traces recorded for removed optimizers are deleted.
- run(*, verbose: bool = True, callback: Callable[[TrialInfo], None] | None = None, backend: ParallelBackend | None = None) Benchmark[source]#
Run all missing (function, optimizer, seed) combinations.
Only executes combinations that have no trace yet. New traces are added to the existing results, so previous data is preserved.
- Parameters:
verbose (bool) – Print progress to stdout for each trial plus a summary line.
callback (callable, optional) – Called with a
TrialInfoafter each trial (including skipped ones). Useful for custom progress bars or logging.backend (ParallelBackend, optional) –
Parallel execution backend. When provided, pending trials are dispatched to workers via
backend.map(). WhenNone(the default), trials run sequentially in the current process.Note: with a parallel backend and
catch != "raise", all trials complete before errors are collected. In sequential mode,catch="raise"stops on the first failure.chaining (Returns self for)
- property results: ResultAccessor[source]#
summary, traces, dataframe export.
- Type:
Access benchmark results
- property io: IOAccessor[source]#
Access save/load functionality.
- property plot: PlotAccessor[source]#
Access benchmark visualizations.
- property errors: dict[tuple[str, str, int], Exception][source]#
Failed trials from the last run, keyed by (function, optimizer, seed).
- classmethod load(path: str | Path) Benchmark[source]#
Load a benchmark (config + results) from a JSON file.
Functions must originate from the surfaces package. A warning is emitted if the Surfaces version differs from the one used when saving.
- Parameters:
path (str or Path) – Path to a JSON file previously created by
io.save().
- classmethod from_suite(suite: Suite, **overrides: Any) Benchmark[source]#
Create a Benchmark pre-configured from a Suite definition.
The suite provides function filters and default budget/seed settings. Add optimizers and call run() to execute.
- Parameters:
suite (Suite) – A predefined suite from
surfaces.benchmark.suites.**overrides – Override suite defaults (budget_cu, budget_iter, n_seeds, seed).
Suite#
Result Analysis#
ResultAccessor#
Accessed via bench.results.
- class ResultAccessor(benchmark: Benchmark)[source]#
Bases:
objectQuery and analyze benchmark results.
Accessed via
bench.results. All methods operate on the accumulated traces inside the parent Benchmark instance.- traces(function: str | None = None, optimizer: str | None = None, seed: int | None = None) dict[tuple[str, str, int], Trace][source]#
Filter traces by function name, optimizer name, and/or seed.
- summary(at_cu: float | None = None, at_iter: int | None = None, show_ci: bool = False) str[source]#
Generate a formatted summary table of benchmark results.
- to_dataframe() Any[source]#
Export all evaluation records as a pandas DataFrame.
Each row is a single evaluation with function/optimizer/seed metadata and full cost breakdown. Parameter values are expanded into individual columns.
- ert(precision: float = 1.0, targets: dict[str, float] | None = None) Any[source]#
Compute Expected Running Time for all (function, optimizer) pairs.
A problem counts as “solved” when
best_so_far <= f_global + precision. ERT follows the COCO convention: total budget across all seeds divided by the number of successful seeds.
- ranking(at_cu: float | None = None, alpha: float = 0.05, correction: str | None = 'holm') Any[source]#
Rank optimizers by normalized performance with pairwise tests.
Scores are normalized per function (0 = worst observed, 1 = best observed) and averaged over seeds. Ranks use tied-rank averaging within each function. Pairwise Wilcoxon signed-rank tests assess statistical significance.
- Parameters:
at_cu (float, optional) – Evaluate scores at this CU budget instead of using the final best score.
alpha (float, default=0.05) – Significance level for the Wilcoxon tests.
correction (str or None, default="holm") – Multiple comparison correction.
"holm"applies the Holm step-down procedure (controls family-wise error rate).Nonereturns raw uncorrected p-values.
- Returns:
Printable, subscriptable, and exportable to DataFrame.
- Return type:
- friedman(at_cu: float | None = None, alpha: float = 0.05) Any[source]#
Friedman omnibus test for comparing multiple optimizers.
Tests whether at least one optimizer’s performance differs significantly. This is the recommended first step before pairwise comparisons: if the Friedman test does not reject, pairwise differences are not statistically supported.
Requires at least 3 optimizers and 3 functions where all optimizers produced results.
- Parameters:
- Returns:
Printable result with chi-squared and Iman-Davenport statistics, average ranks, and significance verdict.
- Return type:
FriedmanResult#
Returned by bench.results.friedman().
- class FriedmanResult(chi2_statistic: float, chi2_p_value: float, f_statistic: float, f_p_value: float, n_functions: int, n_optimizers: int, alpha: float, avg_ranks: dict[str, float])[source]#
Bases:
objectResult of the Friedman omnibus test for comparing multiple optimizers.
The Friedman test checks whether at least one optimizer differs significantly. If
significantis True, proceed with post-hoc pairwise tests. If False, observed differences are not statistically supported.The Iman-Davenport variant uses an F-distribution and is less conservative than the chi-squared approximation.
ERTTable#
Returned by bench.results.ert(). Subscriptable by function name,
then optimizer name.
- class ERTTable(data: dict[str, dict[str, ERTEntry]], precision: float, function_names: list[str], optimizer_names: list[str])[source]#
Bases:
objectExpected Running Time results across functions and optimizers.
Subscriptable by function name, then optimizer name:
ert = bench.results.ert() entry = ert["AckleyFunction"]["HillClimbing"] print(entry.ert_cu, entry.solved, entry.total)
ERTEntry#
RankingTable#
Returned by bench.results.ranking(). Subscriptable by optimizer name.
- class RankingTable(entries: list[RankingEntry], pvalues: dict[tuple[str, str], float], alpha: float, normalized_scores: dict[str, dict[str, float]], correction: str | None = None)[source]#
Bases:
objectOptimizer ranking with pairwise statistical tests.
Subscriptable by optimizer name:
ranking = bench.results.ranking() entry = ranking["HillClimbing"] print(entry.rank, entry.mean_normalized) print(ranking.pvalues)
RankingEntry#
Persistence#
IOAccessor#
Accessed via bench.io.
- class IOAccessor(benchmark: Benchmark)[source]#
Bases:
objectSave benchmark state (config + results) to disk.
Accessed via
bench.io. Loading is done via theBenchmark.load()classmethod.- save(path: str | Path) None[source]#
Save the full benchmark state to a JSON file.
Stores configuration (budget, seeds), registered functions and optimizers, the Surfaces version, and all accumulated traces. The file can be loaded back with
Benchmark.load().- Parameters:
path (str or Path) – Output file path. Will be overwritten if it exists.
Visualization#
PlotAccessor#
Accessed via bench.plot.
- class PlotAccessor(benchmark: Benchmark)[source]#
Bases:
objectBenchmark visualization via Plotly.
Accessed via
bench.plot. All methods return aplotly.graph_objects.Figurethat can be displayed withfig.show()or rendered automatically in Jupyter notebooks.Requires the
vizextra (pip install surfaces[viz]).- ecdf(precision: float | list[float] = 1.0, log_x: bool = True) Any[source]#
Empirical Cumulative Distribution Function of running times.
Shows for each optimizer what fraction of (function, seed) problems it solved within a given CU budget. A problem counts as “solved” when
best_so_far <= target, where target isf_global + precision(or best-known + precision as fallback).
- convergence(function: str, band: str = 'iqr', center: str = 'median', log_y: bool = False) Any[source]#
Convergence plot for a single function across all optimizers.
Shows how quickly each optimizer converges by plotting the center line (median or mean of best_so_far across seeds) with an optional uncertainty band.
- Parameters:
- Return type:
plotly.graph_objects.Figure
- cd_diagram(at_cu: float | None = None, alpha: float = 0.05, correction: str | None = 'holm', title: str | None = None, width: float = 8.0) Any[source]#
Critical Difference diagram comparing optimizer ranks.
Visualizes average ranks on a horizontal axis with thick bars connecting groups of optimizers that are not statistically distinguishable (Demsar, 2006).
Average ranks are computed with proper tied-rank handling using only functions where all optimizers produced results (complete blocks), matching the Friedman test methodology.
Requires matplotlib (
pip install surfaces[viz]).- Parameters:
at_cu (float, optional) – Evaluate scores at this CU budget.
alpha (float, default=0.05) – Significance level for clique detection.
correction (str or None, default="holm") – P-value correction for pairwise Wilcoxon tests.
title (str, optional) – Figure title. Defaults to include the alpha value.
width (float, default=8.0) – Figure width in inches.
- Return type:
Traces#
Trace#
One complete optimization trajectory (single optimizer, single function, single seed).
- class Trace[source]#
Bases:
objectOrdered sequence of evaluation records from a single benchmark run.
Represents one complete optimization trajectory: a single optimizer on a single function with a single seed. The cumulative CU values form a monotonically increasing sequence that enables CU-indexed lookups via binary search.
- append(record: EvalRecord) None[source]#
Append an evaluation record to the trace.
- property records: list[EvalRecord][source]#
List copy of all evaluation records.
- property best_score: float | None[source]#
Best (lowest) score across all evaluations, or None if empty.
EvalRecord#
- class EvalRecord(params: dict[str, Any], score: float, eval_cu: float, overhead_cu: float, cumulative_cu: float, best_so_far: float, wall_seconds: float)[source]#
Bases:
objectSingle evaluation record within a benchmark trace.
Captures both the evaluation result and the computational cost breakdown between function evaluation and optimizer overhead.
Execution#
TrialInfo#
Passed to callbacks after each trial completes.
- class TrialInfo(function: str, optimizer: str, seed: int, index: int, total: int, skipped: bool, wall_seconds: float | None, error: Exception | None = None)[source]#
Bases:
objectInformation about a single benchmark trial, passed to callbacks.
- Parameters:
function (str) – Name of the test function class.
optimizer (str) – Display name of the optimizer adapter.
seed (int) – Random seed used for this trial.
index (int) – 1-based index of this trial within the run.
total (int) – Total number of trials (including skipped) in the run.
skipped (bool) – True if the trial was skipped because a trace already existed.
wall_seconds (float or None) – Wall-clock time for the trial. None when skipped.
ParallelBackend#
- class ParallelBackend(n_jobs: int = -1)[source]#
Bases:
ABCBase class for parallel benchmark execution.
Subclasses must implement
mapto distribute work across workers. Then_jobsattribute controls the number of workers.- Parameters:
n_jobs (int) – Number of parallel workers. Use
-1for all available CPU cores.
ProcessBackend#
- class ProcessBackend(n_jobs: int = -1)[source]#
Bases:
ParallelBackendProcess-based parallelism via
ProcessPoolExecutor.Each worker runs in a separate OS process, bypassing the GIL. Task arguments and return values must be picklable.
- Parameters:
n_jobs (int) – Number of worker processes. Defaults to
-1(all CPU cores).
ThreadBackend#
- class ThreadBackend(n_jobs: int = -1)[source]#
Bases:
ParallelBackendThread-based parallelism via
ThreadPoolExecutor.Useful when the objective function or optimizer releases the GIL (e.g. calls into C extensions). Avoids the pickling overhead of process-based backends.
- Parameters:
n_jobs (int) – Number of worker threads. Defaults to
-1(all CPU cores).
Statistical Functions#
These are called internally by the accessor methods but can also be used standalone.
- compute_friedman(traces: dict[tuple[str, str, int], Trace], alpha: float = 0.05, at_cu: float | None = None) FriedmanResult[source]#
Friedman omnibus test on benchmark traces.
Requires at least 3 optimizers and 3 functions where all optimizers produced results. Uses normalized scores.
The Iman-Davenport correction replaces the chi-squared approximation with an F-distribution, giving a less conservative (more powerful) test.
- compute_ert(traces: dict[tuple[str, str, int], Trace], optimal_scores: dict[str, float | None], precision: float, targets: dict[str, float] | None = None) ERTTable[source]#
Compute Expected Running Time for all (function, optimizer) pairs.
For each trace, finds the first cumulative_cu where best_so_far <= target. ERT is computed as the COCO standard: sum(all running times including inf) / number_of_successful_runs.
- compute_ranking(traces: dict[tuple[str, str, int], Trace], alpha: float = 0.05, at_cu: float | None = None, correction: str | None = 'holm') RankingTable[source]#
Rank optimizers by normalized scores with pairwise statistical tests.
Normalization is per-function: 0 = worst observed, 1 = best observed. Ranks use tied-rank averaging within each function, then are averaged across functions. Pairwise Wilcoxon signed-rank tests assess significance, with optional Holm correction for multiple comparisons.
- Parameters:
correction (str or None) –
"holm"applies Holm step-down correction (recommended).Nonereturns raw uncorrected p-values.