ssbc.validation_pkg

Validation API facade.

Provides a stable package path for validation utilities.

ssbc.validation_pkg.get_calibration_bounds_dataframe(results, scope=None, metric=None)[source]

Extract calibration bounds and observed quantiles as DataFrame.

Converts the raw calibration data from validate_prediction_interval_calibration() into a pandas DataFrame format for easy plotting and analysis.

Parameters:
  • results (dict) – Output from validate_prediction_interval_calibration()

  • scope (str, optional) – Filter to specific scope: “marginal”, “class_0”, or “class_1”. If None, includes all scopes.

  • metric (str, optional) – Filter to specific metric: “singleton”, “doublet”, “abstention”, “singleton_error”. If None, includes all metrics.

Returns:

Pandas DataFrame with columns: - calibration_idx: Index of calibration dataset (0 to BigN-1) - scope: marginal, class_0, or class_1 - metric: singleton, doublet, abstention, singleton_error - observed_q05: 5th percentile of test set rates - observed_q95: 95th percentile of test set rates - selected_lower: Lower bound from selected method - selected_upper: Upper bound from selected method - analytical_lower: Lower bound from analytical method (NaN if not available) - analytical_upper: Upper bound from analytical method (NaN if not available) - exact_lower: Lower bound from exact method (NaN if not available) - exact_upper: Upper bound from exact method (NaN if not available) - hoeffding_lower: Lower bound from hoeffding method (NaN if not available) - hoeffding_upper: Upper bound from hoeffding method (NaN if not available)

Return type:

DataFrame

Examples

>>> import pandas as pd
>>> from ssbc import get_calibration_bounds_dataframe
>>> results = validate_prediction_interval_calibration(...)
>>> df = get_calibration_bounds_dataframe(results)
>>> # Filter to singleton marginal
>>> df_single = df[(df['scope'] == 'marginal') & (df['metric'] == 'singleton')]
>>> # Plot lower bounds
>>> import matplotlib.pyplot as plt
>>> plt.scatter(df_single['analytical_lower'], df_single['observed_q05'])
ssbc.validation_pkg.plot_calibration_excess(df, scope=None, metric=None, methods=None, figsize=(14, 6), bins=30, return_fig=False, filename=None)[source]

Plot excess (difference between observed and predicted quantiles).

Creates histograms showing: - Lower excess: observed_q05 - predicted_lower (positive = predicted too high) - Upper excess: predicted_upper - observed_q95 (positive = predicted too high)

Parameters:
  • df (DataFrame) – Output from get_calibration_bounds_dataframe()

  • scope (str, optional) – Filter to specific scope: “marginal”, “class_0”, or “class_1”. If None, uses all scopes (creates separate subplots).

  • metric (str, optional) – Filter to specific metric: “singleton”, “doublet”, “abstention”, “singleton_error”. If None, uses all metrics (creates separate subplots).

  • methods (list[str], optional) – Methods to plot: [“analytical”, “exact”, “hoeffding”]. If None, plots all available methods.

  • figsize (tuple[int, int], default=(14, 6)) – Figure size (width, height in inches)

  • bins (int, default=30) – Number of histogram bins

  • return_fig (bool, default=False) – If True, returns matplotlib Figure object. If False, calls plt.show()

  • filename (str, optional) – If provided, saves the plot to this filename (e.g., “excess_plot.png”). Supports common formats: .png, .pdf, .svg, .jpg, etc.

Returns:

If return_fig=True, returns Figure object. Otherwise None.

Return type:

Figure or None

Examples

>>> from ssbc import get_calibration_bounds_dataframe, plot_calibration_excess
>>> results = validate_prediction_interval_calibration(...)
>>> df = get_calibration_bounds_dataframe(results)
>>> # Plot for singleton marginal
>>> df_single = df[(df['scope'] == 'marginal') & (df['metric'] == 'singleton')]
>>> plot_calibration_excess(df_single, scope='marginal', metric='singleton')
ssbc.validation_pkg.plot_validation_bounds(validation, metric='singleton', show_detail=True, main_figsize=(18, 5), detail_figsize=(18, 12), bins=50, method_colors=None, return_figs=False)[source]

Plot empirical distributions with prediction interval bounds for all methods.

Creates visualization comparing empirical rates against bounds from analytical, exact, and hoeffding methods when available.

Parameters:
  • validation (dict) – Output from validate_pac_bounds() containing validation results

  • metric (str, default="singleton") – Which metric to plot. Options: “singleton”, “doublet”, “abstention”, “singleton_error”

  • show_detail (bool, default=True) – If True, also create detailed 3x3 grid showing each method separately

  • main_figsize (tuple[int, int], default=(18, 5)) – Figure size for main comparison plot (width, height in inches)

  • detail_figsize (tuple[int, int], default=(18, 12)) – Figure size for detailed method comparison grid (width, height in inches)

  • bins (int, default=50) – Number of bins for histograms

  • method_colors (dict or None, default=None) – Custom colors and linestyles for methods. Dict mapping method names to (color, linestyle) tuples. If None, uses default colors: - “analytical”: (“#2E86AB”, “solid”) # Blue - “exact”: (“#A23B72”, “dashed”) # Purple - “hoeffding”: (“#F18F01”, “dashdot”) # Orange

  • return_figs (bool, default=False) – If True, returns matplotlib Figure objects for further customization. Returns (fig_main, fig_detail) or (fig_main, None) if show_detail=False. If False, calls plt.show() and returns None.

Returns:

If return_figs=True:
  • (fig_main, fig_detail) if show_detail=True

  • (fig_main, None) if show_detail=False

If return_figs=False: None (displays plots directly)

Return type:

tuple or None

Examples

>>> from ssbc import validate_pac_bounds, plot_validation_bounds
>>> validation = validate_pac_bounds(report, sim, test_size=1000, n_trials=1000)
>>> plot_validation_bounds(validation, metric="singleton")
>>> # Or get figure objects for customization
>>> fig_main, fig_detail = plot_validation_bounds(
...     validation, metric="singleton", return_figs=True
... )
>>> fig_main.savefig("validation_main.png")

Notes

The main plot shows all three methods overlaid on the same histogram for easy comparison. The detailed plot shows each method separately in a 3x3 grid. Both plots include: - Empirical distribution histogram - Method-specific bounds (when method comparison available) - Expected value from LOO-CV - Empirical mean from validation trials - Coverage percentages for each method

ssbc.validation_pkg.print_calibration_validation_results(results)[source]

Pretty print meta-validation results.

Parameters:

results (dict) – Output from validate_prediction_interval_calibration()

Return type:

None

Examples

>>> results = validate_prediction_interval_calibration(...)
>>> print_calibration_validation_results(results)
ssbc.validation_pkg.print_validation_results(validation)[source]

Pretty print validation results.

Parameters:

validation (dict) – Output from validate_pac_bounds()

Return type:

None

Examples

>>> validation = validate_pac_bounds(report, sim, test_size=1000, n_trials=1000)
>>> print_validation_results(validation)
ssbc.validation_pkg.tabulate_calibration_excess(df, scope=None, metric=None, methods=None)[source]

Tabulate excess values for all methods as a DataFrame.

Computes excess values (difference between observed and predicted quantiles) for each method and returns a structured DataFrame with statistics.

Parameters:
  • df (DataFrame) – Output from get_calibration_bounds_dataframe()

  • scope (str, optional) – Filter to specific scope: “marginal”, “class_0”, or “class_1”. If None, includes all scopes.

  • metric (str, optional) – Filter to specific metric: “singleton”, “doublet”, “abstention”, “singleton_error”. If None, includes all metrics.

  • methods (list[str], optional) – Methods to include: [“analytical”, “exact”, “hoeffding”, “selected”]. If None, includes all available methods.

Returns:

DataFrame with columns: - method: Method name (analytical, exact, hoeffding, selected) - bound_type: “lower” or “upper” - excess_mean: Mean excess value - excess_std: Standard deviation of excess - excess_min: Minimum excess value - excess_max: Maximum excess value - excess_q05: 5th percentile of excess - excess_q25: 25th percentile of excess - excess_q50: 50th percentile (median) of excess - excess_q75: 75th percentile of excess - excess_q95: 95th percentile of excess - n_negative: Number of negative excess values (risky) - n_positive: Number of positive excess values (conservative) - pct_negative: Percentage of negative excess values - n_valid: Number of valid (non-NaN) excess values - scope: Scope name (if filtered) - metric: Metric name (if filtered)

Return type:

DataFrame

Examples

>>> from ssbc import get_calibration_bounds_dataframe, tabulate_calibration_excess
>>> results = validate_prediction_interval_calibration(...)
>>> df = get_calibration_bounds_dataframe(results)
>>> # Tabulate excess for singleton marginal
>>> df_single = df[(df['scope'] == 'marginal') & (df['metric'] == 'singleton')]
>>> excess_table = tabulate_calibration_excess(df_single, scope='marginal', metric='singleton')
>>> print(excess_table)
ssbc.validation_pkg.validate_pac_bounds(report, simulator, test_size, n_trials=1000, seed=None, verbose=True, n_jobs=-1)[source]

Empirically validate prediction interval operational bounds.

Takes a PAC report from generate_rigorous_pac_report() and validates that the theoretical bounds actually hold in practice by: 1. Extracting the FIXED thresholds from calibration 2. Running n_trials simulations with fresh test sets 3. Measuring empirical coverage of all reported bounds (analytical, exact, hoeffding)

When the report includes method comparison (prediction_method=”all”), validates all three methods separately. Otherwise, validates only the selected method.

Parameters:
  • report (dict) – Output from generate_rigorous_pac_report()

  • simulator (DataGenerator) – Simulator to generate independent test data (e.g., BinaryClassifierSimulator)

  • test_size (int) – Size of each test set

  • n_trials (int, default=1000) – Number of independent trials

  • seed (int, optional) – Random seed for reproducibility

  • verbose (bool, default=True) – Print validation progress

  • n_jobs (int, default=-1) – Number of parallel jobs for trial execution. -1 = use all cores (default), 1 = single-threaded, N = use N cores.

Returns:

Validation results with: - ‘marginal’: Marginal operational rates and coverage - ‘class_0’: Class 0 operational rates and coverage - ‘class_1’: Class 1 operational rates and coverage Each containing:

  • ’singleton’, ‘doublet’, ‘abstention’, ‘singleton_error’ dicts with:

  • ’rates’: Array of rates across trials

  • ’mean’: Mean rate

  • ’quantiles’: Quantiles (5%, 25%, 50%, 75%, 95%)

  • ’bounds’: Selected/default bounds from report

  • ’expected’: Expected rate from report

  • ’empirical_coverage’: Fraction of trials within selected bounds

  • ’method_validations’: Dict of method-specific validations (when available): - ‘analytical’: {bounds, empirical_coverage} - ‘exact’: {bounds, empirical_coverage} - ‘hoeffding’: {bounds, empirical_coverage}

Return type:

dict

Examples

>>> from ssbc import BinaryClassifierSimulator, generate_rigorous_pac_report, validate_pac_bounds
>>> sim = BinaryClassifierSimulator(p_class1=0.2, seed=42)
>>> labels, probs = sim.generate(100)
>>> report = generate_rigorous_pac_report(labels, probs, delta=0.10)
>>> validation = validate_pac_bounds(report, sim, test_size=1000, n_trials=1000)
>>> print(f"Singleton coverage: {validation['marginal']['singleton']['empirical_coverage']:.1%}")

Notes

This function is useful for: - Verifying theoretical PAC guarantees empirically - Understanding the tightness of bounds - Debugging issues with bounds calculation - Generating validation plots for papers/reports

The empirical coverage should be ≥ PAC level (1 - δ) for rigorous bounds.

ssbc.validation_pkg.validate_prediction_interval_calibration(simulator, n_calibration, BigN, alpha_target=0.1, delta=0.1, test_size=1000, n_trials=1000, ci_level=0.95, use_loo_correction=True, prediction_method='all', loo_inflation_factor=None, seed=None, n_jobs=-1, verbose=False)[source]

Validate that prediction interval confidence level holds across calibration datasets.

This meta-validation checks if the nominal confidence level (e.g., 95%) actually holds when repeating the entire calibration+validation process many times with different calibration datasets.

For each of BigN calibration datasets: 1. Generate random calibration data 2. Compute prediction interval bounds 3. Validate bounds with many test sets 4. Record empirical coverage

Then aggregates results to see if ~95% of calibrations achieve ≥95% coverage.

Parameters:
  • simulator (DataGenerator) – Simulator for generating calibration and test data (e.g., BinaryClassifierSimulator)

  • n_calibration (int) – Size of each calibration dataset

  • BigN (int) – Number of different calibration datasets to test

  • alpha_target (float or dict[int, float], default=0.10) – Target miscoverage rate per class

  • delta (float or dict[int, float], default=0.10) – PAC risk tolerance for threshold calibration

  • test_size (int, default=1000) – Size of each test set in validation

  • n_trials (int, default=1000) – Number of test sets per calibration dataset (for validation)

  • ci_level (float, default=0.95) – Nominal confidence level for prediction intervals (target to validate)

  • use_loo_correction (bool, default=True) – Use LOO-corrected bounds

  • prediction_method (str, default="all") – Method for bounds computation (“all” to compare all methods)

  • loo_inflation_factor (float, optional) – Manual override for LOO inflation factor

  • seed (int, optional) – Random seed for reproducibility

  • n_jobs (int, default=-1) – Number of parallel jobs (-1 = all cores)

  • verbose (bool, default=False) – If True, print progress for each calibration dataset

Returns:

Meta-validation results with keys: - ‘n_calibrations’: BigN - ‘n_calibration’: Calibration dataset size - ‘n_trials_per_calibration’: n_trials - ‘ci_level’: Target confidence level - ‘marginal’: Dict with coverage statistics per method - ‘class_0’: Dict with coverage statistics per method - ‘class_1’: Dict with coverage statistics per method Each scope contains: - ‘singleton’, ‘doublet’, ‘abstention’, ‘singleton_error’: Dicts with:

  • ’selected’: Coverage stats for selected bounds

  • ’analytical’: Coverage stats if available

  • ’exact’: Coverage stats if available

  • ’hoeffding’: Coverage stats if available

Each method has: - ‘coverages’: Array of empirical coverages across BigN calibrations - ‘mean’: Mean coverage - ‘median’: Median coverage - ‘quantiles’: {q05, q25, q50, q75, q95} - ‘fraction_above_target’: Fraction achieving ≥ci_level - ‘fraction_above_95pct’: Fraction achieving ≥95% (for comparison)

Return type:

dict

Examples

>>> from ssbc import BinaryClassifierSimulator, validate_prediction_interval_calibration
>>> sim = BinaryClassifierSimulator(p_class1=0.2, seed=42)
>>> results = validate_prediction_interval_calibration(
...     simulator=sim,
...     n_calibration=100,
...     BigN=50,
...     n_trials=500,
...     verbose=False
... )
>>> print(f"Fraction achieving ≥95%: {results['marginal']['singleton']['selected']['fraction_above_target']:.1%}")