SSBC Documentation

Small-Sample Beta Correction provides PAC guarantees for conformal prediction with small calibration sets.

API Reference

Core Algorithm

ssbc.core_pkg

Core API facade.

ssbc.bounds

Unified bounds computation module for SSBC.

Calibration & Conformal Prediction

ssbc.calibration

Calibration-related APIs.

Metrics & Operational Bounds

ssbc.metrics

Operational metrics and uncertainty APIs.

Reporting & Visualization

ssbc.reporting

Reporting and visualization APIs.

Uncertainty Analysis & Validation

ssbc.validation_pkg

Validation API facade.

Utilities & Tools

ssbc.utils

Utility functions for conformal prediction.

ssbc.simulation

Simulation utilities for testing conformal prediction.

ssbc.hyperparameter

Hyperparameter sweep and optimization for Mondrian conformal prediction.

Complete Package

Top-level package for SSBC (Small-Sample Beta Correction).

class ssbc.SSBCResult(alpha_target, alpha_corrected, u_star, n, satisfied_mass, mode, details)[source]

Bases: object

Result of SSBC correction.

Parameters:
alpha_target

Target miscoverage rate

Type:

float

alpha_corrected

Corrected miscoverage rate (u_star / (n+1))

Type:

float

u_star

Optimal u value found by the algorithm

Type:

int

n

Calibration set size

Type:

int

satisfied_mass

Probability that coverage >= target

Type:

float

mode

“beta” for infinite test window, “beta-binomial” for finite

Type:

Literal[‘beta’, ‘beta-binomial’]

details

Additional diagnostic information

Type:

dict[str, Any]

alpha_target: float
alpha_corrected: float
u_star: int
n: int
satisfied_mass: float
mode: Literal['beta', 'beta-binomial']
details: dict[str, Any]
__init__(alpha_target, alpha_corrected, u_star, n, satisfied_mass, mode, details)
Parameters:
Return type:

None

ssbc.ssbc_correct(alpha_target, n, delta, *, mode='beta', m=None, bracket_width=None)[source]

Small-Sample Beta Correction (SSBC), corrected acceptance rule.

Find the largest α’ = u/(n+1) ≤ α_target such that: P(Coverage(α’) ≥ 1 - α_target) ≥ 1 - δ

where Coverage(α’) ~ Beta(n+1-u, u) for infinite test window.

Trivial regime: if α_target < 1/(n+1), return α_corrected=0.

Parameters:
  • alpha_target (float) – Target miscoverage rate (must be in (0,1))

  • n (int) – Calibration set size (must be >= 1)

  • delta (float) – PAC risk tolerance (must be in (0,1)). This is the probability that the coverage guarantee fails. For example, delta=0.10 means we want a 90% PAC confidence (1-delta) that coverage ≥ target.

  • mode ({"beta", "beta-binomial"}, default="beta") – “beta” for infinite test window “beta-binomial” for finite test window (defaults to m=n)

  • m (int, optional) – Test window size for beta-binomial mode (defaults to n)

  • bracket_width (int, optional) – Search radius around initial guess (default: adaptive based on n)

Returns:

Dataclass containing correction results and diagnostic details

Return type:

SSBCResult

Raises:

ValueError – If parameters are out of valid ranges

Examples

>>> result = ssbc_correct(alpha_target=0.10, n=50, delta=0.10)
>>> print(f"Corrected alpha: {result.alpha_corrected:.4f}")

Notes

The algorithm uses a bracketed search with an initial guess based on normal approximation to the Beta distribution. If the initial bracket fails to find a solution, it performs adaptive outward expansion (downward then upward) with O(n) worst-case complexity.

ssbc.alpha_scan(labels, probs, fixed_threshold=None)[source]

Scan through all possible alpha thresholds and report prediction set statistics.

For each unique threshold value derived from the calibration data’s non-conformity scores, this function computes the number of abstentions, singletons, and doublets for both classes using Mondrian conformal prediction.

Optionally, a fixed threshold can be evaluated separately and returned as a dict.

Parameters:
  • labels (np.ndarray, shape (n,)) – True binary labels (0 or 1)

  • probs (np.ndarray, shape (n, 2)) – Classification probabilities [P(class=0), P(class=1)]

  • fixed_threshold (float, optional) – Fixed non-conformity score threshold for special case analysis. If None (default), no fixed threshold is evaluated.

Returns:

If fixed_threshold is None:

DataFrame with scan results

If fixed_threshold is provided:

Tuple of (DataFrame with scan results, dict with fixed threshold results)

DataFrame columns: - alpha: miscoverage rate (alpha) - qhat_0: threshold for class 0 - qhat_1: threshold for class 1 - n_abstentions: number of empty prediction sets - n_singletons: number of singleton prediction sets - n_doublets: number of doublet prediction sets - n_singletons_correct: number of correct singletons (marginal) - singleton_coverage: fraction of singletons that are correct (marginal) - n_singletons_0: singletons when true label is 0 - n_singletons_correct_0: correct singletons when true label is 0 - singleton_coverage_0: coverage for class 0 singletons - n_singletons_1: singletons when true label is 1 - n_singletons_correct_1: correct singletons when true label is 1 - singleton_coverage_1: coverage for class 1 singletons

Fixed threshold dict (when provided) has same keys as DataFrame columns

Return type:

pd.DataFrame or tuple[pd.DataFrame, dict]

Examples

>>> labels = np.array([0, 1, 0, 1])
>>> probs = np.array([[0.8, 0.2], [0.3, 0.7], [0.9, 0.1], [0.2, 0.8]])
>>> df = alpha_scan(labels, probs)
>>> print(df.head())
ssbc.compute_pac_operational_metrics(y_cal, probs_cal, alpha, delta, ci_level=0.95, class_label=1)[source]

Compute PAC-controlled confidence intervals for operational metrics.

Extends SSBC to provide rigorous bounds on operational metrics (singleton rates, escalation rates) without accepting risk by fiat. Uses a two-step approach:

  1. SSBC for coverage: Compute α_adj that achieves Pr(coverage ≥ 1-α) ≥ 1-δ

  2. PAC bounds on operational rates: For each possible α’ in discrete grid, run LOO-CV to estimate operational metrics, weight by Beta distribution probability, and aggregate to get PAC-controlled bounds.

Parameters:
  • y_cal (np.ndarray, shape (n,)) – Binary labels (0 or 1) for calibration set

  • probs_cal (np.ndarray, shape (n,) or (n, 2)) – Predicted probabilities. If 1D, interpreted as P(class=1). If 2D, uses column corresponding to class_label.

  • alpha (float) – Target miscoverage rate (must be in (0, 1))

  • delta (float) – PAC risk tolerance (must be in (0, 1))

  • ci_level (float, default=0.95) – Confidence level for operational metric CIs (e.g., 0.95 for 95% CI)

  • class_label (int, default=1) – Which class to calibrate for (0 or 1). Uses class_label column if probs_cal is 2D.

Returns:

Dictionary with keys: - ‘alpha_adj’: Adjusted miscoverage from SSBC - ‘singleton_rate_ci’: [lower, upper] PAC-controlled bounds - ‘doublet_rate_ci’: [lower, upper] - ‘abstention_rate_ci’: [lower, upper] - ‘expected_singleton_rate’: Probability-weighted mean singleton rate - ‘expected_doublet_rate’: Probability-weighted mean doublet rate - ‘expected_abstention_rate’: Probability-weighted mean abstention rate - ‘alpha_grid’: Discrete grid of possible alphas - ‘singleton_fractions’: Singleton rate for each alpha in grid - ‘doublet_fractions’: Doublet rate for each alpha in grid - ‘abstention_fractions’: Abstention rate for each alpha in grid - ‘beta_weights’: Probability weights from Beta distribution - ‘n_calibration’: Number of calibration points

Return type:

dict

Examples

>>> y_cal = np.array([0, 1, 0, 1, 1])
>>> probs_cal = np.array([0.2, 0.8, 0.3, 0.9, 0.7])
>>> result = compute_pac_operational_metrics(
...     y_cal, probs_cal, alpha=0.1, delta=0.1
... )
>>> print(f"Singleton rate: [{result['singleton_rate_ci'][0]:.3f}, "
...       f"{result['singleton_rate_ci'][1]:.3f}]")

Notes

Mathematical Framework:

Coverage decomposes as:

coverage = p_s(1 - α_singleton) + p_d·1 + p_a·0

where p_s, p_d, p_a are fractions of singletons, doublets, abstentions.

For each α’ in discrete grid {k/(n+1)}, k=1,…,n: 1. Run LOO-CV to determine prediction sets for each point 2. Calculate operational rates: p_s(α’), p_d(α’), p_a(α’) 3. Compute Clopper-Pearson CIs for each rate 4. Weight by Beta(k, n+1-k) probability

Aggregate across α’ with probability weighting to get PAC-controlled bounds.

Edge Cases: - Small n: Discretization is coarse, bounds may be conservative - Extreme α or δ: May result in very wide bounds - Class imbalance: Focus on class_label, ensure sufficient samples

ssbc.mondrian_conformal_calibrate(class_data, alpha_target, delta, mode='beta', m=None)[source]

Perform Mondrian (per-class) conformal calibration with SSBC correction.

For each class, compute: 1. Nonconformity scores: s(x, y) = 1 - P(y|x) 2. SSBC-corrected alpha for PAC guarantee 3. Conformal quantile threshold 4. Singleton error rate bounds via PAC guarantee

Then evaluate prediction set sizes on calibration data PER CLASS and MARGINALLY.

Parameters:
  • class_data (dict) – Output from split_by_class()

  • alpha_target (float or dict) – Target miscoverage rate for each class If float: same for both classes If dict: {0: α0, 1: α1} for per-class control

  • delta (float or dict) – PAC risk tolerance for each class If float: same for both classes If dict: {0: δ0, 1: δ1} for per-class control

  • mode (str, default="beta") – “beta” (infinite test) or “beta-binomial” (finite test)

  • m (int, optional) – Test window size for beta-binomial mode

Returns:

  • calibration_result (dict) – Dictionary with keys 0 and 1, each containing calibration info

  • prediction_stats (dict) – Dictionary with keys: - 0, 1: per-class statistics (conditioned on true label) - ‘marginal’: overall statistics (ignoring true labels)

Return type:

tuple[dict[int, dict[str, Any]], dict[Any, Any]]

Examples

>>> labels = np.array([0, 1, 0, 1])
>>> probs = np.array([[0.8, 0.2], [0.3, 0.7], [0.9, 0.1], [0.2, 0.8]])
>>> class_data = split_by_class(labels, probs)
>>> cal_result, pred_stats = mondrian_conformal_calibrate(
...     class_data, alpha_target=0.1, delta=0.1
... )
ssbc.split_by_class(labels, probs)[source]

Split calibration data by true class for Mondrian conformal prediction.

Parameters:
  • labels (np.ndarray, shape (n,)) – True binary labels (0 or 1)

  • probs (np.ndarray, shape (n, 2)) – Classification probabilities [P(class=0), P(class=1)]

Returns:

Dictionary with keys 0 and 1, each containing: - ‘labels’: labels for this class (all same value) - ‘probs’: probabilities for samples in this class - ‘indices’: original indices (for tracking) - ‘n’: number of samples in this class

Return type:

dict

Examples

>>> labels = np.array([0, 1, 0, 1])
>>> probs = np.array([[0.8, 0.2], [0.3, 0.7], [0.9, 0.1], [0.2, 0.8]])
>>> class_data = split_by_class(labels, probs)
>>> print(class_data[0]['n'])  # Number of class 0 samples
2
ssbc.clopper_pearson_intervals(labels, confidence=0.95)[source]

Compute Clopper-Pearson (exact binomial) confidence intervals for class prevalences.

Parameters:
  • labels (np.ndarray) – Binary labels (0 or 1)

  • confidence (float, default=0.95) – Confidence level (e.g., 0.95 for 95% CI)

Returns:

Dictionary with keys 0 and 1, each containing: - ‘count’: number of samples in this class - ‘proportion’: observed proportion - ‘lower’: lower bound of CI - ‘upper’: upper bound of CI

Return type:

dict

Examples

>>> labels = np.array([0, 0, 1, 1, 0])
>>> intervals = clopper_pearson_intervals(labels, confidence=0.95)
>>> print(intervals[0]['proportion'])
0.6

Notes

The Clopper-Pearson interval is an exact binomial confidence interval based on Beta distribution quantiles. It provides conservative coverage guarantees.

ssbc.clopper_pearson_lower(k, n, confidence=0.95)[source]

Compute lower Clopper-Pearson (one-sided) confidence bound.

Parameters:
  • k (int) – Number of successes

  • n (int) – Total number of trials

  • confidence (float, default=0.95) – Confidence level (e.g., 0.95 for 95% confidence)

Returns:

Lower confidence bound for the true proportion

Return type:

float

Examples

>>> lower = clopper_pearson_lower(k=5, n=10, confidence=0.95)
>>> print(f"Lower bound: {lower:.3f}")

Notes

Uses Beta distribution quantiles for exact binomial confidence bounds. For PAC-style guarantees, you may want to use delta = 1 - confidence.

ssbc.clopper_pearson_upper(k, n, confidence=0.95)[source]

Compute upper Clopper-Pearson (one-sided) confidence bound.

Parameters:
  • k (int) – Number of successes

  • n (int) – Total number of trials

  • confidence (float, default=0.95) – Confidence level (e.g., 0.95 for 95% confidence)

Returns:

Upper confidence bound for the true proportion

Return type:

float

Examples

>>> upper = clopper_pearson_upper(k=5, n=10, confidence=0.95)
>>> print(f"Upper bound: {upper:.3f}")

Notes

Uses Beta distribution quantiles for exact binomial confidence bounds. For PAC-style guarantees, you may want to use delta = 1 - confidence.

ssbc.prediction_bounds(k_cal, n_cal, n_test, confidence=0.95, method='simple')[source]

Compute prediction bounds accounting for both calibration and test set sampling uncertainty.

This function provides two methods for computing prediction bounds: 1. “simple”: Uses standard error formula (faster, good for large samples) 2. “beta_binomial”: Uses Beta-Binomial distribution (more accurate for small samples)

Parameters:
  • k_cal (int) – Number of successes in calibration data for a single well-defined Bernoulli event. Must be the count of a binary indicator (e.g., Z_i = 1{event}) across all n_cal trials.

  • n_cal (int) – Total number of trials in calibration data for the same Bernoulli event. This is the fixed denominator (total sample size or conditional subpopulation size).

  • n_test (int) – Expected number of future trials for the same Bernoulli event. For joint rates, this is the planned test size (fixed). For conditional rates, this is an estimated future conditional subpopulation size.

  • confidence (float, default=0.95) – Confidence level (e.g., 0.95 for 95% prediction bounds)

  • method (str, default="simple") – Method to use: “simple” or “beta_binomial”

Returns:

(lower_bound, upper_bound) for operational rates on future test sets

Return type:

tuple[float, float]

Examples

>>> # Simple method (default)
>>> lower, upper = prediction_bounds(k_cal=50, n_cal=100, n_test=1000, confidence=0.95)
>>> print(f"Simple bounds: [{lower:.3f}, {upper:.3f}]")
>>> # Beta-Binomial method (more accurate for small samples)
>>> lower, upper = prediction_bounds(k_cal=50, n_cal=100, n_test=1000, confidence=0.95, method="beta_binomial")
>>> print(f"Beta-Binomial bounds: [{lower:.3f}, {upper:.3f}]")

Notes

The prediction bounds account for both: 1. Calibration uncertainty: uncertainty in the true rate p from calibration data 2. Test set sampling uncertainty: variability when sampling n_test points from the true distribution

Simple method (default): - Mathematical formula: SE = sqrt(p̂(1-p̂) * (1/n_cal + 1/n_test)) - Good for large sample sizes - Faster computation

Beta-Binomial method: - Uses Beta-Binomial distribution for exact finite-sample modeling - More accurate for small sample sizes - Slower computation - Uses uniform prior Beta(1,1) and equal-tailed intervals by default - For advanced use (Jeffreys prior or HPD intervals), call

prediction_bounds_beta_binomial() directly

For large n_test, bounds approach calibration-only bounds. For small n_test, bounds are wider due to additional test set sampling uncertainty.

This is the recommended function for computing operational rate bounds when applying fixed thresholds to future test sets.

ssbc.compute_robust_prediction_bounds(loo_predictions, n_test, alpha=0.05, method='auto', inflation_factor=None, verbose=True)[source]

Main function: Compute robust prediction bounds for small-sample LOO-CV.

This is the primary entry point. It intelligently selects methods based on sample size and provides comprehensive diagnostics.

Parameters:

loo_predictionsnp.ndarray, shape (n_cal,)

Binary LOO predictions (1=singleton/success, 0=not/failure)

n_testint

Expected size of future test sets

alphafloat

Significance level (e.g., 0.05 for 95% confidence)

methodstr

‘auto’ - Automatically select best method (recommended) ‘analytical’ - Method 1: Analytical with LOO correction ‘exact’ - Method 2: Exact binomial with effective n ‘hoeffding’ - Method 3: Distribution-free bound ‘all’ - Compute all three and report

inflation_factorfloat, optional

Manual override for LOO variance inflation factor. If None, automatically estimated. Typical values: 1.0 (no inflation), 2.0 (standard LOO), 1.5-2.5 (empirical range)

verbosebool, default=True

If True, print diagnostic information about method selection and inflation factors.

Returns:

L_primefloat

Lower prediction bound

U_primefloat

Upper prediction bound

reportdict

Comprehensive diagnostics and method comparison

Usage Examples:

# Basic usage (auto-selects best method) L, U, report = compute_robust_prediction_bounds(loo_preds, n_test=50)

# Force conservative method L, U, report = compute_robust_prediction_bounds(

loo_preds, n_test=50, method=’exact’

)

# Compare all methods L, U, report = compute_robust_prediction_bounds(

loo_preds, n_test=50, method=’all’

) print(report[‘comparison_table’])

Parameters:
Return type:

tuple[float, float, dict]

ssbc.format_prediction_bounds_report(rate_name, loo_predictions, n_test, alpha=0.05, include_all_methods=True)[source]

Generate a formatted text report of prediction bounds.

This produces human-readable output suitable for inclusion in rigorous analysis reports.

Parameters:

rate_namestr

Name of the rate (e.g., ‘Singleton Rate’, ‘Doublet Rate’)

loo_predictionsnp.ndarray

Binary LOO predictions

n_testint

Test set size

alphafloat

Significance level

include_all_methodsbool

If True, compare all three methods in report

Returns:

reportstr

Formatted text report

Parameters:
Return type:

str

ssbc.cp_interval(count, total, confidence=0.95)[source]

Compute Clopper-Pearson exact confidence interval.

Helper function for computing a single CI from count and total.

Parameters:
  • count (int) – Number of successes

  • total (int) – Total number of trials

  • confidence (float, default=0.95) – Confidence level

Returns:

Dictionary with keys: - ‘count’: original count - ‘proportion’: count/total - ‘lower’: lower CI bound - ‘upper’: upper CI bound

Return type:

dict

ssbc.compute_operational_rate(prediction_sets, true_labels, rate_type)[source]

Compute operational rate indicators for prediction sets.

For each prediction set, compute a binary indicator showing whether a specific operational event occurred (singleton, doublet, abstention, error in singleton, or correct in singleton).

Parameters:
  • prediction_sets (list[set | list]) – Prediction sets for each sample. Each set contains predicted labels.

  • true_labels (np.ndarray) – True labels for each sample

  • rate_type ({"singleton", "doublet", "abstention", "error_in_singleton", "correct_in_singleton"}) – Type of operational rate to compute: - “singleton”: prediction set contains exactly one label - “doublet”: prediction set contains exactly two labels - “abstention”: prediction set is empty - “error_in_singleton”: singleton prediction that doesn’t contain true label - “correct_in_singleton”: singleton prediction that contains true label

Returns:

Binary indicators (0 or 1) for whether the event holds for each sample

Return type:

np.ndarray

Examples

>>> pred_sets = [{0}, {0, 1}, set(), {1}]
>>> true_labels = np.array([0, 0, 1, 0])
>>> indicators = compute_operational_rate(pred_sets, true_labels, "singleton")
>>> print(indicators)  # [1, 0, 0, 1]
>>> indicators = compute_operational_rate(pred_sets, true_labels, "correct_in_singleton")
>>> print(indicators)  # [1, 0, 0, 0] - first and last are singletons, first is correct

Notes

This function is useful for computing operational statistics on conformal prediction sets, such as singleton rates, escalation rates, and error rates.

ssbc.evaluate_test_dataset(test_labels, test_probs, threshold_0, threshold_1)[source]

Evaluate a test dataset and compute empirical operational rates.

This function takes a test dataset with true labels and probability predictions, applies Mondrian conformal prediction thresholds, and returns comprehensive empirical rates for both marginal and per-class statistics.

Parameters:
  • test_labels (np.ndarray) – True labels for test samples (0 or 1)

  • test_probs (np.ndarray) – Probability predictions for test samples, shape (n_samples, 2) test_probs[i, 0] = P(class=0), test_probs[i, 1] = P(class=1)

  • threshold_0 (float) – Conformal prediction threshold for class 0

  • threshold_1 (float) – Conformal prediction threshold for class 1

Returns:

Dictionary containing empirical rates with structure: - ‘marginal’: Marginal rates across all samples - ‘class_0’: Rates for class 0 samples only - ‘class_1’: Rates for class 1 samples only Each containing: - ‘singleton_rate’: Fraction of samples with singleton predictions - ‘doublet_rate’: Fraction of samples with doublet predictions - ‘abstention_rate’: Fraction of samples with abstention (empty set) - ‘singleton_error_rate’: Fraction of singleton predictions that are incorrect - ‘n_samples’: Number of samples in this group - ‘n_singletons’: Number of singleton predictions - ‘n_doublets’: Number of doublet predictions - ‘n_abstentions’: Number of abstentions

Return type:

dict

Examples

>>> import numpy as np
>>> from ssbc import evaluate_test_dataset
>>>
>>> # Generate test data
>>> test_labels = np.array([0, 0, 1, 1, 0])
>>> test_probs = np.array([
...     [0.8, 0.2],  # High confidence class 0
...     [0.6, 0.4],  # Medium confidence class 0
...     [0.3, 0.7],  # High confidence class 1
...     [0.4, 0.6],  # Medium confidence class 1
...     [0.5, 0.5],  # Uncertain
... ])
>>>
>>> # Evaluate with thresholds
>>> results = evaluate_test_dataset(test_labels, test_probs, 0.3, 0.3)
>>> print(f"Marginal singleton rate: {results['marginal']['singleton_rate']:.3f}")
>>> print(f"Class 0 singleton rate: {results['class_0']['singleton_rate']:.3f}")

Notes

This function is useful for: - Evaluating conformal prediction performance on test data - Comparing empirical rates to theoretical bounds - Computing operational statistics for reporting - Validating that thresholds work as expected

The function builds prediction sets using the Mondrian approach: - For each sample, include class 0 if score_0 <= threshold_0 - For each sample, include class 1 if score_1 <= threshold_1 - Where score_k = 1 - P(class=k)

class ssbc.BinaryClassifierSimulator(p_class1, beta_params_class0, beta_params_class1, seed=None)[source]

Bases: object

Simulate binary classification data with probabilities from Beta distributions.

This simulator generates realistic classification scenarios where the predicted probabilities for each class follow Beta distributions. Useful for testing and benchmarking conformal prediction methods.

Parameters:
  • p_class1 (float) – Probability of drawing class 1 (class imbalance parameter) Must be in [0, 1]

  • beta_params_class0 (tuple of (a, b)) – Beta distribution parameters for p(class=1) when true label is 0 Typically use parameters that give low probabilities (e.g., (2, 8))

  • beta_params_class1 (tuple of (a, b)) – Beta distribution parameters for p(class=1) when true label is 1 Typically use parameters that give high probabilities (e.g., (8, 2))

  • seed (int, optional) – Random seed for reproducibility

p_class1

Probability of class 1

Type:

float

p_class0

Probability of class 0 (= 1 - p_class1)

Type:

float

a0, b0

Beta parameters for class 0

Type:

float

a1, b1

Beta parameters for class 1

Type:

float

rng

Random number generator

Type:

numpy.random.Generator

Examples

>>> # Simulate imbalanced data: 10% positive class
>>> # Class 0: Beta(2, 8) → mean p(class=1) = 0.2 (low scores, correct)
>>> # Class 1: Beta(8, 2) → mean p(class=1) = 0.8 (high scores, correct)
>>> sim = BinaryClassifierSimulator(
...     p_class1=0.10,
...     beta_params_class0=(2, 8),
...     beta_params_class1=(8, 2),
...     seed=42
... )
>>> labels, probs = sim.generate(n_samples=100)
>>> print(labels.shape)
(100,)
>>> print(probs.shape)
(100, 2)

Notes

The Beta distribution parameters (a, b) control the shape: - Mean = a / (a + b) - For a classifier that works well:

  • Class 0 should have low p(class=1): use (a, b) with a < b

  • Class 1 should have high p(class=1): use (a, b) with a > b

__init__(p_class1, beta_params_class0, beta_params_class1, seed=None)[source]

Initialize the binary classifier simulator.

Parameters:
generate(n_samples)[source]

Generate n_samples of (label, p(class=0), p(class=1)).

Parameters:

n_samples (int) – Number of samples to generate

Returns:

  • labels (np.ndarray, shape (n_samples,)) – True binary labels (0 or 1)

  • probs (np.ndarray, shape (n_samples, 2)) – Classification probabilities [p(class=0), p(class=1)] Each row sums to 1.0

Return type:

tuple[ndarray, ndarray]

Examples

>>> sim = BinaryClassifierSimulator(
...     p_class1=0.5,
...     beta_params_class0=(2, 8),
...     beta_params_class1=(8, 2),
...     seed=42
... )
>>> labels, probs = sim.generate(n_samples=5)
>>> print(f"Generated {len(labels)} samples")
Generated 5 samples
>>> print(f"Class balance: {np.bincount(labels)}")
Class balance: [2 3]
__repr__()[source]

String representation of the simulator.

Return type:

str

ssbc.report_prediction_stats(prediction_stats, calibration_result, operational_bounds_per_class=None, marginal_operational_bounds=None, verbose=True)[source]

Report rigorous statistics for Mondrian conformal prediction with valid CIs.

Only displays statistics with valid confidence intervals: - Per-class statistics from calibration data (valid within class) - Per-class operational bounds from cross-validation (rigorous PAC bounds) - Marginal operational bounds from cross-validated Mondrian (rigorous PAC bounds)

Does NOT display marginal statistics from calibration data (invalid CIs for Mondrian).

Parameters:
  • prediction_stats (dict) – Output from mondrian_conformal_calibrate (second return value)

  • calibration_result (dict) – Output from mondrian_conformal_calibrate (first return value)

  • operational_bounds_per_class (dict[int, OperationalRateBoundsResult], optional) – Per-class operational bounds (from generate_rigorous_pac_report)

  • marginal_operational_bounds (OperationalRateBoundsResult, optional) – Marginal operational bounds (from generate_rigorous_pac_report)

  • verbose (bool, default=True) – If True, print detailed statistics to stdout

Returns:

Structured summary with valid CIs: - Keys 0, 1 for per-class statistics - Key ‘marginal_bounds’ if marginal_operational_bounds provided

Return type:

dict

Examples

>>> # Get operational bounds from rigorous PAC report
>>> from ssbc import generate_rigorous_pac_report
>>> report = generate_rigorous_pac_report(labels, probs, alpha_target=0.10, delta=0.10)
>>> cal_result = report['calibration_result']
>>> pred_stats = report['prediction_stats']
>>> op_bounds = report['pac_bounds_class_0']  # Per-class bounds
>>> marginal = report['pac_bounds_marginal']  # Marginal bounds
>>> summary = report_prediction_stats(pred_stats, cal_result, op_bounds, marginal)
ssbc.plot_parallel_coordinates_plotly(df, columns=None, color='err_all', color_continuous_scale=None, title='Mondrian sweep interactive parallel coordinates', height=600, base_opacity=0.9, unselected_opacity=0.06)[source]

Create interactive parallel coordinates plot for hyperparameter sweep results.

Parameters:
  • df (pd.DataFrame) – DataFrame with hyperparameter sweep results

  • columns (list of str, optional) – Columns to display in parallel coordinates Default: [‘a0’,’d0’,’a1’,’d1’,’cov’,’sing_rate’,’err_all’,’err_pred0’,’err_pred1’,’err_y0’,’err_y1’,’esc_rate’]

  • color (str, default='err_all') – Column to use for coloring lines

  • color_continuous_scale (plotly colorscale, optional) – Color scale for the lines

  • title (str, default="Mondrian sweep – interactive parallel coordinates") – Plot title

  • height (int, default=600) – Plot height in pixels

  • base_opacity (float, default=0.9) – Opacity of selected lines

  • unselected_opacity (float, default=0.06) – Opacity of unselected lines (creates contrast)

Returns:

Interactive plotly figure

Return type:

plotly.graph_objects.Figure

Examples

>>> import pandas as pd
>>> df = sweep_hyperparams_and_collect(...)
>>> fig = plot_parallel_coordinates_plotly(df, color='err_all')
>>> fig.show()  # In notebook
>>> # Or save: fig.write_html("sweep_results.html")
ssbc.bootstrap_calibration_uncertainty(labels, probs, simulator, alpha_target=0.1, delta=0.1, test_size=1000, n_bootstrap=1000, n_jobs=-1, seed=None)[source]

Bootstrap analysis of calibration uncertainty.

For each bootstrap iteration: 1. Resample calibration data with replacement 2. Calibrate (compute SSBC thresholds) 3. Evaluate on fresh independent test set 4. Record operational rates

This models: “If I recalibrate on similar datasets, how do rates vary?”

Parameters:
  • labels (np.ndarray) – Calibration labels

  • probs (np.ndarray) – Calibration probabilities

  • simulator (DataGenerator) – Simulator to generate independent test sets

  • alpha_target (float, default=0.10) – Target miscoverage

  • delta (float, default=0.10) – PAC risk

  • test_size (int, default=1000) – Size of test sets for evaluation

  • n_bootstrap (int, default=1000) – Number of bootstrap iterations

  • n_jobs (int, default=-1) – Parallel jobs (-1 for all cores)

  • seed (int, optional) – Random seed

Returns:

Bootstrap distributions with keys: - ‘marginal’: dict with ‘singleton’, ‘doublet’, ‘abstention’, ‘singleton_error’ - ‘class_0’: dict with same metrics - ‘class_1’: dict with same metrics Each metric contains: - ‘samples’: array of rates across bootstrap trials - ‘mean’: mean rate - ‘std’: standard deviation - ‘quantiles’: dict with q05, q25, q50, q75, q95

Return type:

dict

Examples

>>> from ssbc import BinaryClassifierSimulator, bootstrap_calibration_uncertainty
>>> sim = BinaryClassifierSimulator(p_class1=0.2, beta_params_class0=(1,7), beta_params_class1=(5,2))
>>> labels, probs = sim.generate(100)
>>> results = bootstrap_calibration_uncertainty(labels, probs, sim, n_bootstrap=100)
>>> print(results['marginal']['singleton']['mean'])
ssbc.plot_bootstrap_distributions(bootstrap_results, figsize=(16, 12), save_path=None)[source]

Plot bootstrap distributions.

Parameters:
  • bootstrap_results (dict) – Results from bootstrap_calibration_uncertainty()

  • figsize (tuple, default=(16, 12)) – Figure size

  • save_path (str, optional) – Path to save figure. If None, displays interactively.

Raises:

ImportError – If matplotlib is not installed

Return type:

None

Examples

>>> from ssbc import bootstrap_calibration_uncertainty, plot_bootstrap_distributions
>>> results = bootstrap_calibration_uncertainty(...)
>>> plot_bootstrap_distributions(results, save_path='bootstrap_results.png')
ssbc.cross_conformal_validation(labels, probs, alpha_target=0.1, delta=0.1, n_folds=5, stratify=True, seed=None)[source]

K-fold cross-conformal validation for Mondrian conformal prediction.

Estimates the variability of operational rates (abstentions, singletons, doublets) due to finite calibration sample effects by splitting data into K folds.

For each fold: 1. Train: Compute SSBC-corrected thresholds on K-1 folds 2. Test: Evaluate operational rates on held-out fold 3. Record: Store rates for this fold

Aggregate rates across folds to quantify finite-sample variability.

Parameters:
  • labels (np.ndarray, shape (n,)) – Calibration labels (0 or 1)

  • probs (np.ndarray, shape (n, 2)) – Calibration probabilities [P(class=0), P(class=1)]

  • alpha_target (float, default=0.10) – Target miscoverage rate

  • delta (float, default=0.10) – PAC risk for SSBC correction

  • n_folds (int, default=5) – Number of folds (K)

  • stratify (bool, default=True) – Stratify folds by class labels

  • seed (int, optional) – Random seed for reproducibility

Returns:

Cross-conformal results with keys: - ‘fold_rates’: List of rate dicts for each fold - ‘marginal’: Statistics for marginal rates - ‘class_0’: Statistics for class 0 rates - ‘class_1’: Statistics for class 1 rates Each statistics dict contains: - ‘samples’: Array of rates across folds - ‘mean’: Mean rate - ‘std’: Standard deviation - ‘quantiles’: Dict with q05, q25, q50, q75, q95 - ‘ci_95’: 95% Clopper-Pearson CI (if applicable)

Return type:

dict

Examples

>>> from ssbc import cross_conformal_validation
>>> results = cross_conformal_validation(labels, probs, n_folds=10)
>>> m = results['marginal']['singleton']
>>> print(f"Singleton rate: {m['mean']:.3f} ± {m['std']:.3f}")
>>> print(f"95% range: [{m['quantiles']['q05']:.3f}, {m['quantiles']['q95']:.3f}]")

Notes

Different from other methods: - LOO-CV: Leave-one-out, aggregates counts (not fold-level rates) - Bootstrap: Resamples with replacement, tests on fresh data - Cross-conformal: K-fold split, estimates rate distribution from calibration

This method directly estimates the variability of rates due to finite calibration samples, without requiring a data simulator.

ssbc.print_cross_conformal_results(results)[source]

Pretty print cross-conformal validation results.

Parameters:

results (dict) – Results from cross_conformal_validation()

Return type:

None

ssbc.validate_pac_bounds(report, simulator, test_size, n_trials=1000, seed=None, verbose=True, n_jobs=-1)[source]

Empirically validate prediction interval operational bounds.

Takes a PAC report from generate_rigorous_pac_report() and validates that the theoretical bounds actually hold in practice by: 1. Extracting the FIXED thresholds from calibration 2. Running n_trials simulations with fresh test sets 3. Measuring empirical coverage of all reported bounds (analytical, exact, hoeffding)

When the report includes method comparison (prediction_method=”all”), validates all three methods separately. Otherwise, validates only the selected method.

Parameters:
  • report (dict) – Output from generate_rigorous_pac_report()

  • simulator (DataGenerator) – Simulator to generate independent test data (e.g., BinaryClassifierSimulator)

  • test_size (int) – Size of each test set

  • n_trials (int, default=1000) – Number of independent trials

  • seed (int, optional) – Random seed for reproducibility

  • verbose (bool, default=True) – Print validation progress

  • n_jobs (int, default=-1) – Number of parallel jobs for trial execution. -1 = use all cores (default), 1 = single-threaded, N = use N cores.

Returns:

Validation results with: - ‘marginal’: Marginal operational rates and coverage - ‘class_0’: Class 0 operational rates and coverage - ‘class_1’: Class 1 operational rates and coverage Each containing:

  • ’singleton’, ‘doublet’, ‘abstention’, ‘singleton_error’ dicts with:

  • ’rates’: Array of rates across trials

  • ’mean’: Mean rate

  • ’quantiles’: Quantiles (5%, 25%, 50%, 75%, 95%)

  • ’bounds’: Selected/default bounds from report

  • ’expected’: Expected rate from report

  • ’empirical_coverage’: Fraction of trials within selected bounds

  • ’method_validations’: Dict of method-specific validations (when available): - ‘analytical’: {bounds, empirical_coverage} - ‘exact’: {bounds, empirical_coverage} - ‘hoeffding’: {bounds, empirical_coverage}

Return type:

dict

Examples

>>> from ssbc import BinaryClassifierSimulator, generate_rigorous_pac_report, validate_pac_bounds
>>> sim = BinaryClassifierSimulator(p_class1=0.2, seed=42)
>>> labels, probs = sim.generate(100)
>>> report = generate_rigorous_pac_report(labels, probs, delta=0.10)
>>> validation = validate_pac_bounds(report, sim, test_size=1000, n_trials=1000)
>>> print(f"Singleton coverage: {validation['marginal']['singleton']['empirical_coverage']:.1%}")

Notes

This function is useful for: - Verifying theoretical PAC guarantees empirically - Understanding the tightness of bounds - Debugging issues with bounds calculation - Generating validation plots for papers/reports

The empirical coverage should be ≥ PAC level (1 - δ) for rigorous bounds.

ssbc.print_validation_results(validation)[source]

Pretty print validation results.

Parameters:

validation (dict) – Output from validate_pac_bounds()

Return type:

None

Examples

>>> validation = validate_pac_bounds(report, sim, test_size=1000, n_trials=1000)
>>> print_validation_results(validation)
ssbc.plot_validation_bounds(validation, metric='singleton', show_detail=True, main_figsize=(18, 5), detail_figsize=(18, 12), bins=50, method_colors=None, return_figs=False)[source]

Plot empirical distributions with prediction interval bounds for all methods.

Creates visualization comparing empirical rates against bounds from analytical, exact, and hoeffding methods when available.

Parameters:
  • validation (dict) – Output from validate_pac_bounds() containing validation results

  • metric (str, default="singleton") – Which metric to plot. Options: “singleton”, “doublet”, “abstention”, “singleton_error”

  • show_detail (bool, default=True) – If True, also create detailed 3x3 grid showing each method separately

  • main_figsize (tuple[int, int], default=(18, 5)) – Figure size for main comparison plot (width, height in inches)

  • detail_figsize (tuple[int, int], default=(18, 12)) – Figure size for detailed method comparison grid (width, height in inches)

  • bins (int, default=50) – Number of bins for histograms

  • method_colors (dict or None, default=None) – Custom colors and linestyles for methods. Dict mapping method names to (color, linestyle) tuples. If None, uses default colors: - “analytical”: (“#2E86AB”, “solid”) # Blue - “exact”: (“#A23B72”, “dashed”) # Purple - “hoeffding”: (“#F18F01”, “dashdot”) # Orange

  • return_figs (bool, default=False) – If True, returns matplotlib Figure objects for further customization. Returns (fig_main, fig_detail) or (fig_main, None) if show_detail=False. If False, calls plt.show() and returns None.

Returns:

If return_figs=True:
  • (fig_main, fig_detail) if show_detail=True

  • (fig_main, None) if show_detail=False

If return_figs=False: None (displays plots directly)

Return type:

tuple or None

Examples

>>> from ssbc import validate_pac_bounds, plot_validation_bounds
>>> validation = validate_pac_bounds(report, sim, test_size=1000, n_trials=1000)
>>> plot_validation_bounds(validation, metric="singleton")
>>> # Or get figure objects for customization
>>> fig_main, fig_detail = plot_validation_bounds(
...     validation, metric="singleton", return_figs=True
... )
>>> fig_main.savefig("validation_main.png")

Notes

The main plot shows all three methods overlaid on the same histogram for easy comparison. The detailed plot shows each method separately in a 3x3 grid. Both plots include: - Empirical distribution histogram - Method-specific bounds (when method comparison available) - Expected value from LOO-CV - Empirical mean from validation trials - Coverage percentages for each method

ssbc.validate_prediction_interval_calibration(simulator, n_calibration, BigN, alpha_target=0.1, delta=0.1, test_size=1000, n_trials=1000, ci_level=0.95, use_loo_correction=True, prediction_method='all', loo_inflation_factor=None, seed=None, n_jobs=-1, verbose=False)[source]

Validate that prediction interval confidence level holds across calibration datasets.

This meta-validation checks if the nominal confidence level (e.g., 95%) actually holds when repeating the entire calibration+validation process many times with different calibration datasets.

For each of BigN calibration datasets: 1. Generate random calibration data 2. Compute prediction interval bounds 3. Validate bounds with many test sets 4. Record empirical coverage

Then aggregates results to see if ~95% of calibrations achieve ≥95% coverage.

Parameters:
  • simulator (DataGenerator) – Simulator for generating calibration and test data (e.g., BinaryClassifierSimulator)

  • n_calibration (int) – Size of each calibration dataset

  • BigN (int) – Number of different calibration datasets to test

  • alpha_target (float or dict[int, float], default=0.10) – Target miscoverage rate per class

  • delta (float or dict[int, float], default=0.10) – PAC risk tolerance for threshold calibration

  • test_size (int, default=1000) – Size of each test set in validation

  • n_trials (int, default=1000) – Number of test sets per calibration dataset (for validation)

  • ci_level (float, default=0.95) – Nominal confidence level for prediction intervals (target to validate)

  • use_loo_correction (bool, default=True) – Use LOO-corrected bounds

  • prediction_method (str, default="all") – Method for bounds computation (“all” to compare all methods)

  • loo_inflation_factor (float, optional) – Manual override for LOO inflation factor

  • seed (int, optional) – Random seed for reproducibility

  • n_jobs (int, default=-1) – Number of parallel jobs (-1 = all cores)

  • verbose (bool, default=False) – If True, print progress for each calibration dataset

Returns:

Meta-validation results with keys: - ‘n_calibrations’: BigN - ‘n_calibration’: Calibration dataset size - ‘n_trials_per_calibration’: n_trials - ‘ci_level’: Target confidence level - ‘marginal’: Dict with coverage statistics per method - ‘class_0’: Dict with coverage statistics per method - ‘class_1’: Dict with coverage statistics per method Each scope contains: - ‘singleton’, ‘doublet’, ‘abstention’, ‘singleton_error’: Dicts with:

  • ’selected’: Coverage stats for selected bounds

  • ’analytical’: Coverage stats if available

  • ’exact’: Coverage stats if available

  • ’hoeffding’: Coverage stats if available

Each method has: - ‘coverages’: Array of empirical coverages across BigN calibrations - ‘mean’: Mean coverage - ‘median’: Median coverage - ‘quantiles’: {q05, q25, q50, q75, q95} - ‘fraction_above_target’: Fraction achieving ≥ci_level - ‘fraction_above_95pct’: Fraction achieving ≥95% (for comparison)

Return type:

dict

Examples

>>> from ssbc import BinaryClassifierSimulator, validate_prediction_interval_calibration
>>> sim = BinaryClassifierSimulator(p_class1=0.2, seed=42)
>>> results = validate_prediction_interval_calibration(
...     simulator=sim,
...     n_calibration=100,
...     BigN=50,
...     n_trials=500,
...     verbose=False
... )
>>> print(f"Fraction achieving ≥95%: {results['marginal']['singleton']['selected']['fraction_above_target']:.1%}")
ssbc.print_calibration_validation_results(results)[source]

Pretty print meta-validation results.

Parameters:

results (dict) – Output from validate_prediction_interval_calibration()

Return type:

None

Examples

>>> results = validate_prediction_interval_calibration(...)
>>> print_calibration_validation_results(results)
ssbc.get_calibration_bounds_dataframe(results, scope=None, metric=None)[source]

Extract calibration bounds and observed quantiles as DataFrame.

Converts the raw calibration data from validate_prediction_interval_calibration() into a pandas DataFrame format for easy plotting and analysis.

Parameters:
  • results (dict) – Output from validate_prediction_interval_calibration()

  • scope (str, optional) – Filter to specific scope: “marginal”, “class_0”, or “class_1”. If None, includes all scopes.

  • metric (str, optional) – Filter to specific metric: “singleton”, “doublet”, “abstention”, “singleton_error”. If None, includes all metrics.

Returns:

Pandas DataFrame with columns: - calibration_idx: Index of calibration dataset (0 to BigN-1) - scope: marginal, class_0, or class_1 - metric: singleton, doublet, abstention, singleton_error - observed_q05: 5th percentile of test set rates - observed_q95: 95th percentile of test set rates - selected_lower: Lower bound from selected method - selected_upper: Upper bound from selected method - analytical_lower: Lower bound from analytical method (NaN if not available) - analytical_upper: Upper bound from analytical method (NaN if not available) - exact_lower: Lower bound from exact method (NaN if not available) - exact_upper: Upper bound from exact method (NaN if not available) - hoeffding_lower: Lower bound from hoeffding method (NaN if not available) - hoeffding_upper: Upper bound from hoeffding method (NaN if not available)

Return type:

DataFrame

Examples

>>> import pandas as pd
>>> from ssbc import get_calibration_bounds_dataframe
>>> results = validate_prediction_interval_calibration(...)
>>> df = get_calibration_bounds_dataframe(results)
>>> # Filter to singleton marginal
>>> df_single = df[(df['scope'] == 'marginal') & (df['metric'] == 'singleton')]
>>> # Plot lower bounds
>>> import matplotlib.pyplot as plt
>>> plt.scatter(df_single['analytical_lower'], df_single['observed_q05'])
ssbc.plot_calibration_excess(df, scope=None, metric=None, methods=None, figsize=(14, 6), bins=30, return_fig=False)[source]

Plot excess (difference between observed and predicted quantiles).

Creates histograms showing: - Lower excess: observed_q05 - predicted_lower (positive = predicted too high) - Upper excess: predicted_upper - observed_q95 (positive = predicted too high)

Parameters:
  • df (DataFrame) – Output from get_calibration_bounds_dataframe()

  • scope (str, optional) – Filter to specific scope: “marginal”, “class_0”, or “class_1”. If None, uses all scopes (creates separate subplots).

  • metric (str, optional) – Filter to specific metric: “singleton”, “doublet”, “abstention”, “singleton_error”. If None, uses all metrics (creates separate subplots).

  • methods (list[str], optional) – Methods to plot: [“analytical”, “exact”, “hoeffding”]. If None, plots all available methods.

  • figsize (tuple[int, int], default=(14, 6)) – Figure size (width, height in inches)

  • bins (int, default=30) – Number of histogram bins

  • return_fig (bool, default=False) – If True, returns matplotlib Figure object. If False, calls plt.show()

Returns:

If return_fig=True, returns Figure object. Otherwise None.

Return type:

Figure or None

Examples

>>> from ssbc import get_calibration_bounds_dataframe, plot_calibration_excess
>>> results = validate_prediction_interval_calibration(...)
>>> df = get_calibration_bounds_dataframe(results)
>>> # Plot for singleton marginal
>>> df_single = df[(df['scope'] == 'marginal') & (df['metric'] == 'singleton')]
>>> plot_calibration_excess(df_single, scope='marginal', metric='singleton')
ssbc.generate_rigorous_pac_report(labels, probs, alpha_target=0.1, delta=0.1, test_size=None, ci_level=0.95, use_union_bound=False, n_jobs=-1, verbose=True, prediction_method='exact', use_loo_correction=True, loo_inflation_factor=None)[source]

Generate complete rigorous PAC report with coverage volatility.

This is the UNIFIED function that gives you everything properly: - SSBC-corrected thresholds - Coverage guarantees - PAC-controlled operational bounds (marginal + per-class) - Singleton error rates with PAC guarantees - All bounds account for coverage volatility via BetaBinomial

Parameters:
  • labels (np.ndarray, shape (n,)) – True labels (0 or 1)

  • probs (np.ndarray, shape (n, 2)) – Predicted probabilities [P(class=0), P(class=1)]

  • alpha_target (float or dict[int, float], default=0.10) – Target miscoverage per class

  • delta (float or dict[int, float], default=0.10) – PAC risk tolerance. Used for both: - Coverage guarantee (via SSBC) - Operational bounds (pac_level = 1 - delta)

  • test_size (int, optional) – Expected test set size. If None, uses calibration size

  • ci_level (float, default=0.95) – Confidence level for prediction bounds

  • prediction_method (str, default="hoeffding") – Method for LOO uncertainty quantification (when use_loo_correction=True): - “auto”: Automatically select best method - “analytical”: Method 1 (recommended for n>=40) - “exact”: Method 2 (recommended for n=20-40) - “hoeffding”: Method 3 (ultra-conservative, default) - “all”: Compare all methods When use_loo_correction=False, this parameter is ignored.

  • use_loo_correction (bool, default=False) –

    If True, uses LOO-CV uncertainty correction for small samples (n=20-40). This accounts for all four sources of uncertainty: 1. LOO-CV correlation structure (variance inflation ≈2×) 2. Threshold calibration uncertainty 3. Parameter estimation uncertainty 4. Test sampling uncertainty Recommended for small calibration sets where standard bounds may be too narrow.

    LOO-CV Correlation Issue: The critical challenge with LOO-CV is that the N LOO predictions are not independent. The training sets for different folds overlap substantially—folds i and j using training sets D_{-i} and D_{-j} differ by only two examples out of N−1. Because each fold’s threshold is computed from nearly identical data, the resulting predictions exhibit strong positive correlation. This correlation structure is handled through specialized LOO-corrected methods that account for the dependency between folds when computing diagnostic bounds.

  • loo_inflation_factor (float, optional) –

    Manual override for LOO variance inflation factor. If None (default), automatically estimated from the data using empirical variance.

    Empirical Correction Factor Estimation: The inflation factor is estimated by comparing the empirical variance of LOO predictions to the theoretical IID variance. Specifically, inflation = (Var_empirical / Var_IID) × (n / (n-1)), where Var_empirical is the sample variance of the binary LOO predictions (with Bessel’s correction), Var_IID = p̂(1-p̂) is the expected variance under independence, and the n/(n-1) factor accounts for the finite-sample bias correction. For large n, this approaches the theoretical value of 2.0, but for small samples (n=20-40), the actual inflation can vary. The estimated factor is clipped to [1.0, 6.0] to prevent extreme values from outliers or numerical instability.

    Typical values: - 1.0: No inflation (assumes independent samples - usually wrong for LOO) - 2.0: Standard LOO inflation (theoretical value for n→∞) - 1.5-2.5: Empirical range for small samples - >2.5: High correlation scenarios - Up to 6.0: Extended range for very high correlation scenarios

    Note: This parameter can be used as a phenomenological control knob to correct for issues not modeled properly in the statistical framework. For example, if validation suggests the default estimation is too optimistic or too conservative, manually adjusting this factor can help achieve desired coverage behavior. Use with caution and validate empirically.

  • use_union_bound (bool, default=False) – Apply Bonferroni for simultaneous guarantees

  • n_jobs (int, default=-1) – Number of parallel jobs for LOO-CV computation. -1 = use all cores (default), 1 = single-threaded, N = use N cores.

  • verbose (bool, default=True) – Print comprehensive report

Returns:

Complete report with keys: - ‘ssbc_class_0’: SSBCResult for class 0 - ‘ssbc_class_1’: SSBCResult for class 1 - ‘pac_bounds_marginal’: PAC operational bounds (marginal) - ‘pac_bounds_class_0’: PAC operational bounds (class 0) - ‘pac_bounds_class_1’: PAC operational bounds (class 1) - ‘calibration_result’: From mondrian_conformal_calibrate - ‘prediction_stats’: From mondrian_conformal_calibrate

Return type:

dict

Examples

>>> from ssbc import BinaryClassifierSimulator
>>> from ssbc.rigorous_report import generate_rigorous_pac_report
>>>
>>> sim = BinaryClassifierSimulator(p_class1=0.5, seed=42)
>>> labels, probs = sim.generate(n_samples=1000)
>>>
>>> report = generate_rigorous_pac_report(
...     labels, probs,
...     alpha_target=0.10,
...     delta=0.10,
...     verbose=True
... )

Notes

This replaces the old workflow (removed in v1.1.0):

OLD (removed - these functions no longer exist): `python # These functions were removed in v1.1.0: # op_bounds = compute_mondrian_operational_bounds(...)  # Removed # marginal_bounds = compute_marginal_operational_bounds(...)  # Removed # report_prediction_stats(...)  # Removed `

NEW (rigorous): `python report = generate_rigorous_pac_report(labels, probs, alpha_target, delta) # Done! All bounds account for coverage volatility. `

ssbc.sweep_hyperparams_and_collect(class_data, alpha_0, delta_0, alpha_1, delta_1, mode='beta', extra_metrics=None, quiet=True)[source]

Sweep (a0,d0,a1,d1), run mondrian_conformal_calibrate + report_prediction_stats, and return a tidy DataFrame with hyperparams + selected metrics.

This function performs a grid search over hyperparameter combinations and evaluates the resulting conformal prediction performance.

Parameters:
  • class_data (dict) – Output from split_by_class()

  • alpha_0 (array-like) – Grid of alpha values for class 0

  • delta_0 (array-like) – Grid of delta values for class 0

  • alpha_1 (array-like) – Grid of alpha values for class 1

  • delta_1 (array-like) – Grid of delta values for class 1

  • mode (str, default="beta") – “beta” or “beta-binomial” mode for SSBC

  • extra_metrics (dict of {name: function}, optional) – Additional metrics to compute. Each function takes the summary dict and returns a scalar value.

  • quiet (bool, default=True) – If True, suppress progress output

Returns:

Tidy dataframe with one row per hyperparameter combination. Columns include: - a0, d0, a1, d1: hyperparameters - cov: overall coverage rate - sing_rate: singleton prediction rate - err_all: overall singleton error rate - err_pred0, err_pred1: errors by predicted class - err_y0, err_y1: errors by true class - esc_rate: escalation rate (doublets + abstentions) - n_total, sing_count, m_abst, m_doublets: counts - Any additional metrics from extra_metrics

Return type:

pd.DataFrame

Examples

>>> import numpy as np
>>> from ssbc import BinaryClassifierSimulator, split_by_class
>>>
>>> # Generate data
>>> sim = BinaryClassifierSimulator(0.1, (2, 8), (8, 2), seed=42)
>>> labels, probs = sim.generate(1000)
>>> class_data = split_by_class(labels, probs)
>>>
>>> # Define grid
>>> alpha_grid = np.arange(0.05, 0.20, 0.05)
>>> delta_grid = np.arange(0.05, 0.20, 0.05)
>>>
>>> # Run sweep
>>> df = sweep_hyperparams_and_collect(
...     class_data,
...     alpha_0=alpha_grid, delta_0=delta_grid,
...     alpha_1=alpha_grid, delta_1=delta_grid,
... )
>>>
>>> # Analyze results
>>> print(df[['a0', 'a1', 'cov', 'sing_rate', 'err_all']].head())

Notes

The function performs a complete grid search, so the total number of evaluations is len(alpha_0) × len(delta_0) × len(alpha_1) × len(delta_1). For large grids, this can be computationally expensive.

ssbc.sweep_and_plot_parallel_plotly(class_data, delta_0, delta_1, alpha_0, alpha_1, mode='beta', extra_metrics=None, color='err_all', color_continuous_scale=None, title=None, height=600)[source]

Convenience wrapper: run sweep + show plotly parallel coordinates figure.

This function combines hyperparameter sweep and visualization in one call.

Parameters:
  • class_data (dict) – Output from split_by_class()

  • delta_0 (array-like) – Grid of delta values for classes 0 and 1

  • delta_1 (array-like) – Grid of delta values for classes 0 and 1

  • alpha_0 (array-like) – Grid of alpha values for classes 0 and 1

  • alpha_1 (array-like) – Grid of alpha values for classes 0 and 1

  • mode (str, default="beta") – “beta” or “beta-binomial” mode for SSBC

  • extra_metrics (dict of {name: function}, optional) – Additional metrics to compute

  • color (str, default='err_all') – Column to use for coloring the parallel coordinates

  • color_continuous_scale (plotly colorscale, optional) – Color scale for the plot

  • title (str, optional) – Plot title (defaults to auto-generated title)

  • height (int, default=600) – Plot height in pixels

Returns:

  • df (pd.DataFrame) – Results dataframe

  • fig (plotly.graph_objects.Figure) – Interactive parallel coordinates plot

Examples

>>> import numpy as np
>>> from ssbc import BinaryClassifierSimulator, split_by_class
>>>
>>> # Generate data
>>> sim = BinaryClassifierSimulator(0.1, (2, 8), (8, 2), seed=42)
>>> labels, probs = sim.generate(1000)
>>> class_data = split_by_class(labels, probs)
>>>
>>> # Run sweep and plot
>>> df, fig = sweep_and_plot_parallel_plotly(
...     class_data,
...     delta_0=np.arange(0.05, 0.20, 0.05),
...     delta_1=np.arange(0.05, 0.20, 0.05),
...     alpha_0=np.arange(0.05, 0.20, 0.05),
...     alpha_1=np.arange(0.05, 0.20, 0.05),
...     color='err_all'
... )
>>> fig.show()  # Display in notebook
>>> # Or save: fig.write_html("sweep_results.html")

Notes

The parallel coordinates plot allows interactive exploration of the hyperparameter space. You can brush (select) ranges on any axis to filter configurations and see their impact on other metrics.

Indices and tables