ssbc.validation_pkg

Validation API facade.

Provides a stable package path for validation utilities.

ssbc.validation_pkg.get_calibration_bounds_dataframe(results, scope=None, metric=None)[source]

Extract calibration bounds and observed quantiles as DataFrame.

Converts the raw calibration data from validate_prediction_interval_calibration() into a pandas DataFrame format for easy plotting and analysis.

Parameters:

results (dict) – Output from validate_prediction_interval_calibration()
scope (str, optional) – Filter to specific scope: “marginal”, “class_0”, or “class_1”. If None, includes all scopes.
metric (str, optional) – Filter to specific metric: “singleton”, “doublet”, “abstention”, “singleton_error”. If None, includes all metrics.

Returns:

Pandas DataFrame with columns: - calibration_idx: Index of calibration dataset (0 to BigN-1) - scope: marginal, class_0, or class_1 - metric: singleton, doublet, abstention, singleton_error - observed_q05: 5th percentile of test set rates - observed_q95: 95th percentile of test set rates - selected_lower: Lower bound from selected method - selected_upper: Upper bound from selected method - analytical_lower: Lower bound from analytical method (NaN if not available) - analytical_upper: Upper bound from analytical method (NaN if not available) - exact_lower: Lower bound from exact method (NaN if not available) - exact_upper: Upper bound from exact method (NaN if not available) - hoeffding_lower: Lower bound from hoeffding method (NaN if not available) - hoeffding_upper: Upper bound from hoeffding method (NaN if not available)

Return type:

DataFrame

Examples

>>> import pandas as pd
>>> from ssbc import get_calibration_bounds_dataframe
>>> results = validate_prediction_interval_calibration(...)
>>> df = get_calibration_bounds_dataframe(results)
>>> # Filter to singleton marginal
>>> df_single = df[(df['scope'] == 'marginal') & (df['metric'] == 'singleton')]
>>> # Plot lower bounds
>>> import matplotlib.pyplot as plt
>>> plt.scatter(df_single['analytical_lower'], df_single['observed_q05'])

ssbc.validation_pkg.plot_calibration_excess(df, scope=None, metric=None, methods=None, figsize=(14, 6), bins=30, return_fig=False, filename=None)[source]

Plot excess (difference between observed and predicted quantiles).

Creates histograms showing: - Lower excess: observed_q05 - predicted_lower (positive = predicted too high) - Upper excess: predicted_upper - observed_q95 (positive = predicted too high)

Parameters:

df (DataFrame) – Output from get_calibration_bounds_dataframe()
scope (str, optional) – Filter to specific scope: “marginal”, “class_0”, or “class_1”. If None, uses all scopes (creates separate subplots).
metric (str, optional) – Filter to specific metric: “singleton”, “doublet”, “abstention”, “singleton_error”. If None, uses all metrics (creates separate subplots).
methods (list[str], optional) – Methods to plot: [“analytical”, “exact”, “hoeffding”]. If None, plots all available methods.
figsize (tuple[int, int], default=(14, 6)) – Figure size (width, height in inches)
bins (int, default=30) – Number of histogram bins
return_fig (bool, default=False) – If True, returns matplotlib Figure object. If False, calls plt.show()
filename (str, optional) – If provided, saves the plot to this filename (e.g., “excess_plot.png”). Supports common formats: .png, .pdf, .svg, .jpg, etc.

Returns:

If return_fig=True, returns Figure object. Otherwise None.

Return type:

Figure or None

Examples

>>> from ssbc import get_calibration_bounds_dataframe, plot_calibration_excess
>>> results = validate_prediction_interval_calibration(...)
>>> df = get_calibration_bounds_dataframe(results)
>>> # Plot for singleton marginal
>>> df_single = df[(df['scope'] == 'marginal') & (df['metric'] == 'singleton')]
>>> plot_calibration_excess(df_single, scope='marginal', metric='singleton')

ssbc.validation_pkg.plot_validation_bounds(validation, metric='singleton', show_detail=True, main_figsize=(18, 5), detail_figsize=(18, 12), bins=50, method_colors=None, return_figs=False)[source]

Plot empirical distributions with prediction interval bounds for all methods.

Creates visualization comparing empirical rates against bounds from analytical, exact, and hoeffding methods when available.

Parameters:

validation (dict) – Output from validate_pac_bounds() containing validation results
metric (str, default="singleton") – Which metric to plot. Options: “singleton”, “doublet”, “abstention”, “singleton_error”
show_detail (bool, default=True) – If True, also create detailed 3x3 grid showing each method separately
main_figsize (tuple[int, int], default=(18, 5)) – Figure size for main comparison plot (width, height in inches)
detail_figsize (tuple[int, int], default=(18, 12)) – Figure size for detailed method comparison grid (width, height in inches)
bins (int, default=50) – Number of bins for histograms
method_colors (dict or None, default=None) – Custom colors and linestyles for methods. Dict mapping method names to (color, linestyle) tuples. If None, uses default colors: - “analytical”: (“#2E86AB”, “solid”) # Blue - “exact”: (“#A23B72”, “dashed”) # Purple - “hoeffding”: (“#F18F01”, “dashdot”) # Orange
return_figs (bool, default=False) – If True, returns matplotlib Figure objects for further customization. Returns (fig_main, fig_detail) or (fig_main, None) if show_detail=False. If False, calls plt.show() and returns None.

Returns:

If return_figs=True:

(fig_main, fig_detail) if show_detail=True
(fig_main, None) if show_detail=False

If return_figs=False: None (displays plots directly)

Return type:

tuple or None

Examples

>>> from ssbc import validate_pac_bounds, plot_validation_bounds
>>> validation = validate_pac_bounds(report, sim, test_size=1000, n_trials=1000)
>>> plot_validation_bounds(validation, metric="singleton")
>>> # Or get figure objects for customization
>>> fig_main, fig_detail = plot_validation_bounds(
...     validation, metric="singleton", return_figs=True
... )
>>> fig_main.savefig("validation_main.png")

Notes

The main plot shows all three methods overlaid on the same histogram for easy comparison. The detailed plot shows each method separately in a 3x3 grid. Both plots include: - Empirical distribution histogram - Method-specific bounds (when method comparison available) - Expected value from LOO-CV - Empirical mean from validation trials - Coverage percentages for each method

ssbc.validation_pkg.print_calibration_validation_results(results)[source]

Pretty print meta-validation results.

Parameters:: results (dict) – Output from validate_prediction_interval_calibration()
Return type:: None

Examples

>>> results = validate_prediction_interval_calibration(...)
>>> print_calibration_validation_results(results)

ssbc.validation_pkg.print_validation_results(validation)[source]

Pretty print validation results.

Parameters:: validation (dict) – Output from validate_pac_bounds()
Return type:: None

Examples

>>> validation = validate_pac_bounds(report, sim, test_size=1000, n_trials=1000)
>>> print_validation_results(validation)

ssbc.validation_pkg.tabulate_calibration_excess(df, scope=None, metric=None, methods=None)[source]

Tabulate excess values for all methods as a DataFrame.

Computes excess values (difference between observed and predicted quantiles) for each method and returns a structured DataFrame with statistics.

Parameters:

df (DataFrame) – Output from get_calibration_bounds_dataframe()
scope (str, optional) – Filter to specific scope: “marginal”, “class_0”, or “class_1”. If None, includes all scopes.
metric (str, optional) – Filter to specific metric: “singleton”, “doublet”, “abstention”, “singleton_error”. If None, includes all metrics.
methods (list[str], optional) – Methods to include: [“analytical”, “exact”, “hoeffding”, “selected”]. If None, includes all available methods.

Returns:

DataFrame with columns: - method: Method name (analytical, exact, hoeffding, selected) - bound_type: “lower” or “upper” - excess_mean: Mean excess value - excess_std: Standard deviation of excess - excess_min: Minimum excess value - excess_max: Maximum excess value - excess_q05: 5th percentile of excess - excess_q25: 25th percentile of excess - excess_q50: 50th percentile (median) of excess - excess_q75: 75th percentile of excess - excess_q95: 95th percentile of excess - n_negative: Number of negative excess values (risky) - n_positive: Number of positive excess values (conservative) - pct_negative: Percentage of negative excess values - n_valid: Number of valid (non-NaN) excess values - scope: Scope name (if filtered) - metric: Metric name (if filtered)

Return type:

DataFrame

Examples

>>> from ssbc import get_calibration_bounds_dataframe, tabulate_calibration_excess
>>> results = validate_prediction_interval_calibration(...)
>>> df = get_calibration_bounds_dataframe(results)
>>> # Tabulate excess for singleton marginal
>>> df_single = df[(df['scope'] == 'marginal') & (df['metric'] == 'singleton')]
>>> excess_table = tabulate_calibration_excess(df_single, scope='marginal', metric='singleton')
>>> print(excess_table)

ssbc.validation_pkg.validate_pac_bounds(report, simulator, test_size, n_trials=1000, seed=None, verbose=True, n_jobs=-1)[source]

Empirically validate prediction interval operational bounds.

Takes a PAC report from generate_rigorous_pac_report() and validates that the theoretical bounds actually hold in practice by: 1. Extracting the FIXED thresholds from calibration 2. Running n_trials simulations with fresh test sets 3. Measuring empirical coverage of all reported bounds (analytical, exact, hoeffding)

When the report includes method comparison (prediction_method=”all”), validates all three methods separately. Otherwise, validates only the selected method.

Parameters:

report (dict) – Output from generate_rigorous_pac_report()
simulator (DataGenerator) – Simulator to generate independent test data (e.g., BinaryClassifierSimulator)
test_size (int) – Size of each test set
n_trials (int, default=1000) – Number of independent trials
seed (int, optional) – Random seed for reproducibility
verbose (bool, default=True) – Print validation progress
n_jobs (int, default=-1) – Number of parallel jobs for trial execution. -1 = use all cores (default), 1 = single-threaded, N = use N cores.

Returns:

Validation results with: - ‘marginal’: Marginal operational rates and coverage - ‘class_0’: Class 0 operational rates and coverage - ‘class_1’: Class 1 operational rates and coverage Each containing:

’singleton’, ‘doublet’, ‘abstention’, ‘singleton_error’ dicts with:

’rates’: Array of rates across trials

’mean’: Mean rate

’quantiles’: Quantiles (5%, 25%, 50%, 75%, 95%)

’bounds’: Selected/default bounds from report

’expected’: Expected rate from report

’empirical_coverage’: Fraction of trials within selected bounds

’method_validations’: Dict of method-specific validations (when available): - ‘analytical’: {bounds, empirical_coverage} - ‘exact’: {bounds, empirical_coverage} - ‘hoeffding’: {bounds, empirical_coverage}

Return type:

dict

Examples

>>> from ssbc import BinaryClassifierSimulator, generate_rigorous_pac_report, validate_pac_bounds
>>> sim = BinaryClassifierSimulator(p_class1=0.2, seed=42)
>>> labels, probs = sim.generate(100)
>>> report = generate_rigorous_pac_report(labels, probs, delta=0.10)
>>> validation = validate_pac_bounds(report, sim, test_size=1000, n_trials=1000)
>>> print(f"Singleton coverage: {validation['marginal']['singleton']['empirical_coverage']:.1%}")

Notes

This function is useful for: - Verifying theoretical PAC guarantees empirically - Understanding the tightness of bounds - Debugging issues with bounds calculation - Generating validation plots for papers/reports

The empirical coverage should be ≥ PAC level (1 - δ) for rigorous bounds.

ssbc.validation_pkg.validate_prediction_interval_calibration(simulator, n_calibration, BigN, alpha_target=0.1, delta=0.1, test_size=1000, n_trials=1000, ci_level=0.95, use_loo_correction=True, prediction_method='all', loo_inflation_factor=None, seed=None, n_jobs=-1, verbose=False)[source]

Validate that prediction interval confidence level holds across calibration datasets.

This meta-validation checks if the nominal confidence level (e.g., 95%) actually holds when repeating the entire calibration+validation process many times with different calibration datasets.

For each of BigN calibration datasets: 1. Generate random calibration data 2. Compute prediction interval bounds 3. Validate bounds with many test sets 4. Record empirical coverage

Then aggregates results to see if ~95% of calibrations achieve ≥95% coverage.

Parameters:

simulator (DataGenerator) – Simulator for generating calibration and test data (e.g., BinaryClassifierSimulator)
n_calibration (int) – Size of each calibration dataset
BigN (int) – Number of different calibration datasets to test
alpha_target (float or dict[int, float], default=0.10) – Target miscoverage rate per class
delta (float or dict[int, float], default=0.10) – PAC risk tolerance for threshold calibration
test_size (int, default=1000) – Size of each test set in validation
n_trials (int, default=1000) – Number of test sets per calibration dataset (for validation)
ci_level (float, default=0.95) – Nominal confidence level for prediction intervals (target to validate)
use_loo_correction (bool, default=True) – Use LOO-corrected bounds
prediction_method (str, default="all") – Method for bounds computation (“all” to compare all methods)
loo_inflation_factor (float, optional) – Manual override for LOO inflation factor
seed (int, optional) – Random seed for reproducibility
n_jobs (int, default=-1) – Number of parallel jobs (-1 = all cores)
verbose (bool, default=False) – If True, print progress for each calibration dataset

Returns:

Meta-validation results with keys: - ‘n_calibrations’: BigN - ‘n_calibration’: Calibration dataset size - ‘n_trials_per_calibration’: n_trials - ‘ci_level’: Target confidence level - ‘marginal’: Dict with coverage statistics per method - ‘class_0’: Dict with coverage statistics per method - ‘class_1’: Dict with coverage statistics per method Each scope contains: - ‘singleton’, ‘doublet’, ‘abstention’, ‘singleton_error’: Dicts with:

’selected’: Coverage stats for selected bounds

’analytical’: Coverage stats if available

’exact’: Coverage stats if available

’hoeffding’: Coverage stats if available

Each method has: - ‘coverages’: Array of empirical coverages across BigN calibrations - ‘mean’: Mean coverage - ‘median’: Median coverage - ‘quantiles’: {q05, q25, q50, q75, q95} - ‘fraction_above_target’: Fraction achieving ≥ci_level - ‘fraction_above_95pct’: Fraction achieving ≥95% (for comparison)

Return type:

dict

Examples

>>> from ssbc import BinaryClassifierSimulator, validate_prediction_interval_calibration
>>> sim = BinaryClassifierSimulator(p_class1=0.2, seed=42)
>>> results = validate_prediction_interval_calibration(
...     simulator=sim,
...     n_calibration=100,
...     BigN=50,
...     n_trials=500,
...     verbose=False
... )
>>> print(f"Fraction achieving ≥95%: {results['marginal']['singleton']['selected']['fraction_above_target']:.1%}")