ssbc.validation_pkg
Validation API facade.
Provides a stable package path for validation utilities.
- ssbc.validation_pkg.get_calibration_bounds_dataframe(results, scope=None, metric=None)[source]
Extract calibration bounds and observed quantiles as DataFrame.
Converts the raw calibration data from validate_prediction_interval_calibration() into a pandas DataFrame format for easy plotting and analysis.
- Parameters:
results (dict) – Output from validate_prediction_interval_calibration()
scope (str, optional) – Filter to specific scope: “marginal”, “class_0”, or “class_1”. If None, includes all scopes.
metric (str, optional) – Filter to specific metric: “singleton”, “doublet”, “abstention”, “singleton_error”. If None, includes all metrics.
- Returns:
Pandas DataFrame with columns: - calibration_idx: Index of calibration dataset (0 to BigN-1) - scope: marginal, class_0, or class_1 - metric: singleton, doublet, abstention, singleton_error - observed_q05: 5th percentile of test set rates - observed_q95: 95th percentile of test set rates - selected_lower: Lower bound from selected method - selected_upper: Upper bound from selected method - analytical_lower: Lower bound from analytical method (NaN if not available) - analytical_upper: Upper bound from analytical method (NaN if not available) - exact_lower: Lower bound from exact method (NaN if not available) - exact_upper: Upper bound from exact method (NaN if not available) - hoeffding_lower: Lower bound from hoeffding method (NaN if not available) - hoeffding_upper: Upper bound from hoeffding method (NaN if not available)
- Return type:
DataFrame
Examples
>>> import pandas as pd >>> from ssbc import get_calibration_bounds_dataframe >>> results = validate_prediction_interval_calibration(...) >>> df = get_calibration_bounds_dataframe(results) >>> # Filter to singleton marginal >>> df_single = df[(df['scope'] == 'marginal') & (df['metric'] == 'singleton')] >>> # Plot lower bounds >>> import matplotlib.pyplot as plt >>> plt.scatter(df_single['analytical_lower'], df_single['observed_q05'])
- ssbc.validation_pkg.plot_calibration_excess(df, scope=None, metric=None, methods=None, figsize=(14, 6), bins=30, return_fig=False, filename=None)[source]
Plot excess (difference between observed and predicted quantiles).
Creates histograms showing: - Lower excess: observed_q05 - predicted_lower (positive = predicted too high) - Upper excess: predicted_upper - observed_q95 (positive = predicted too high)
- Parameters:
df (DataFrame) – Output from get_calibration_bounds_dataframe()
scope (str, optional) – Filter to specific scope: “marginal”, “class_0”, or “class_1”. If None, uses all scopes (creates separate subplots).
metric (str, optional) – Filter to specific metric: “singleton”, “doublet”, “abstention”, “singleton_error”. If None, uses all metrics (creates separate subplots).
methods (list[str], optional) – Methods to plot: [“analytical”, “exact”, “hoeffding”]. If None, plots all available methods.
figsize (tuple[int, int], default=(14, 6)) – Figure size (width, height in inches)
bins (int, default=30) – Number of histogram bins
return_fig (bool, default=False) – If True, returns matplotlib Figure object. If False, calls plt.show()
filename (str, optional) – If provided, saves the plot to this filename (e.g., “excess_plot.png”). Supports common formats: .png, .pdf, .svg, .jpg, etc.
- Returns:
If return_fig=True, returns Figure object. Otherwise None.
- Return type:
Figure or None
Examples
>>> from ssbc import get_calibration_bounds_dataframe, plot_calibration_excess >>> results = validate_prediction_interval_calibration(...) >>> df = get_calibration_bounds_dataframe(results) >>> # Plot for singleton marginal >>> df_single = df[(df['scope'] == 'marginal') & (df['metric'] == 'singleton')] >>> plot_calibration_excess(df_single, scope='marginal', metric='singleton')
- ssbc.validation_pkg.plot_validation_bounds(validation, metric='singleton', show_detail=True, main_figsize=(18, 5), detail_figsize=(18, 12), bins=50, method_colors=None, return_figs=False)[source]
Plot empirical distributions with prediction interval bounds for all methods.
Creates visualization comparing empirical rates against bounds from analytical, exact, and hoeffding methods when available.
- Parameters:
validation (dict) – Output from validate_pac_bounds() containing validation results
metric (str, default="singleton") – Which metric to plot. Options: “singleton”, “doublet”, “abstention”, “singleton_error”
show_detail (bool, default=True) – If True, also create detailed 3x3 grid showing each method separately
main_figsize (tuple[int, int], default=(18, 5)) – Figure size for main comparison plot (width, height in inches)
detail_figsize (tuple[int, int], default=(18, 12)) – Figure size for detailed method comparison grid (width, height in inches)
bins (int, default=50) – Number of bins for histograms
method_colors (dict or None, default=None) – Custom colors and linestyles for methods. Dict mapping method names to (color, linestyle) tuples. If None, uses default colors: - “analytical”: (“#2E86AB”, “solid”) # Blue - “exact”: (“#A23B72”, “dashed”) # Purple - “hoeffding”: (“#F18F01”, “dashdot”) # Orange
return_figs (bool, default=False) – If True, returns matplotlib Figure objects for further customization. Returns (fig_main, fig_detail) or (fig_main, None) if show_detail=False. If False, calls plt.show() and returns None.
- Returns:
- If return_figs=True:
(fig_main, fig_detail) if show_detail=True
(fig_main, None) if show_detail=False
If return_figs=False: None (displays plots directly)
- Return type:
tuple or None
Examples
>>> from ssbc import validate_pac_bounds, plot_validation_bounds >>> validation = validate_pac_bounds(report, sim, test_size=1000, n_trials=1000) >>> plot_validation_bounds(validation, metric="singleton") >>> # Or get figure objects for customization >>> fig_main, fig_detail = plot_validation_bounds( ... validation, metric="singleton", return_figs=True ... ) >>> fig_main.savefig("validation_main.png")
Notes
The main plot shows all three methods overlaid on the same histogram for easy comparison. The detailed plot shows each method separately in a 3x3 grid. Both plots include: - Empirical distribution histogram - Method-specific bounds (when method comparison available) - Expected value from LOO-CV - Empirical mean from validation trials - Coverage percentages for each method
- ssbc.validation_pkg.print_calibration_validation_results(results)[source]
Pretty print meta-validation results.
- Parameters:
results (dict) – Output from validate_prediction_interval_calibration()
- Return type:
None
Examples
>>> results = validate_prediction_interval_calibration(...) >>> print_calibration_validation_results(results)
- ssbc.validation_pkg.print_validation_results(validation)[source]
Pretty print validation results.
- Parameters:
validation (dict) – Output from validate_pac_bounds()
- Return type:
None
Examples
>>> validation = validate_pac_bounds(report, sim, test_size=1000, n_trials=1000) >>> print_validation_results(validation)
- ssbc.validation_pkg.tabulate_calibration_excess(df, scope=None, metric=None, methods=None)[source]
Tabulate excess values for all methods as a DataFrame.
Computes excess values (difference between observed and predicted quantiles) for each method and returns a structured DataFrame with statistics.
- Parameters:
df (DataFrame) – Output from get_calibration_bounds_dataframe()
scope (str, optional) – Filter to specific scope: “marginal”, “class_0”, or “class_1”. If None, includes all scopes.
metric (str, optional) – Filter to specific metric: “singleton”, “doublet”, “abstention”, “singleton_error”. If None, includes all metrics.
methods (list[str], optional) – Methods to include: [“analytical”, “exact”, “hoeffding”, “selected”]. If None, includes all available methods.
- Returns:
DataFrame with columns: - method: Method name (analytical, exact, hoeffding, selected) - bound_type: “lower” or “upper” - excess_mean: Mean excess value - excess_std: Standard deviation of excess - excess_min: Minimum excess value - excess_max: Maximum excess value - excess_q05: 5th percentile of excess - excess_q25: 25th percentile of excess - excess_q50: 50th percentile (median) of excess - excess_q75: 75th percentile of excess - excess_q95: 95th percentile of excess - n_negative: Number of negative excess values (risky) - n_positive: Number of positive excess values (conservative) - pct_negative: Percentage of negative excess values - n_valid: Number of valid (non-NaN) excess values - scope: Scope name (if filtered) - metric: Metric name (if filtered)
- Return type:
DataFrame
Examples
>>> from ssbc import get_calibration_bounds_dataframe, tabulate_calibration_excess >>> results = validate_prediction_interval_calibration(...) >>> df = get_calibration_bounds_dataframe(results) >>> # Tabulate excess for singleton marginal >>> df_single = df[(df['scope'] == 'marginal') & (df['metric'] == 'singleton')] >>> excess_table = tabulate_calibration_excess(df_single, scope='marginal', metric='singleton') >>> print(excess_table)
- ssbc.validation_pkg.validate_pac_bounds(report, simulator, test_size, n_trials=1000, seed=None, verbose=True, n_jobs=-1)[source]
Empirically validate prediction interval operational bounds.
Takes a PAC report from generate_rigorous_pac_report() and validates that the theoretical bounds actually hold in practice by: 1. Extracting the FIXED thresholds from calibration 2. Running n_trials simulations with fresh test sets 3. Measuring empirical coverage of all reported bounds (analytical, exact, hoeffding)
When the report includes method comparison (prediction_method=”all”), validates all three methods separately. Otherwise, validates only the selected method.
- Parameters:
report (dict) – Output from generate_rigorous_pac_report()
simulator (DataGenerator) – Simulator to generate independent test data (e.g., BinaryClassifierSimulator)
test_size (int) – Size of each test set
n_trials (int, default=1000) – Number of independent trials
seed (int, optional) – Random seed for reproducibility
verbose (bool, default=True) – Print validation progress
n_jobs (int, default=-1) – Number of parallel jobs for trial execution. -1 = use all cores (default), 1 = single-threaded, N = use N cores.
- Returns:
Validation results with: - ‘marginal’: Marginal operational rates and coverage - ‘class_0’: Class 0 operational rates and coverage - ‘class_1’: Class 1 operational rates and coverage Each containing:
’singleton’, ‘doublet’, ‘abstention’, ‘singleton_error’ dicts with:
’rates’: Array of rates across trials
’mean’: Mean rate
’quantiles’: Quantiles (5%, 25%, 50%, 75%, 95%)
’bounds’: Selected/default bounds from report
’expected’: Expected rate from report
’empirical_coverage’: Fraction of trials within selected bounds
’method_validations’: Dict of method-specific validations (when available): - ‘analytical’: {bounds, empirical_coverage} - ‘exact’: {bounds, empirical_coverage} - ‘hoeffding’: {bounds, empirical_coverage}
- Return type:
Examples
>>> from ssbc import BinaryClassifierSimulator, generate_rigorous_pac_report, validate_pac_bounds >>> sim = BinaryClassifierSimulator(p_class1=0.2, seed=42) >>> labels, probs = sim.generate(100) >>> report = generate_rigorous_pac_report(labels, probs, delta=0.10) >>> validation = validate_pac_bounds(report, sim, test_size=1000, n_trials=1000) >>> print(f"Singleton coverage: {validation['marginal']['singleton']['empirical_coverage']:.1%}")
Notes
This function is useful for: - Verifying theoretical PAC guarantees empirically - Understanding the tightness of bounds - Debugging issues with bounds calculation - Generating validation plots for papers/reports
The empirical coverage should be ≥ PAC level (1 - δ) for rigorous bounds.
- ssbc.validation_pkg.validate_prediction_interval_calibration(simulator, n_calibration, BigN, alpha_target=0.1, delta=0.1, test_size=1000, n_trials=1000, ci_level=0.95, use_loo_correction=True, prediction_method='all', loo_inflation_factor=None, seed=None, n_jobs=-1, verbose=False)[source]
Validate that prediction interval confidence level holds across calibration datasets.
This meta-validation checks if the nominal confidence level (e.g., 95%) actually holds when repeating the entire calibration+validation process many times with different calibration datasets.
For each of BigN calibration datasets: 1. Generate random calibration data 2. Compute prediction interval bounds 3. Validate bounds with many test sets 4. Record empirical coverage
Then aggregates results to see if ~95% of calibrations achieve ≥95% coverage.
- Parameters:
simulator (DataGenerator) – Simulator for generating calibration and test data (e.g., BinaryClassifierSimulator)
n_calibration (int) – Size of each calibration dataset
BigN (int) – Number of different calibration datasets to test
alpha_target (float or dict[int, float], default=0.10) – Target miscoverage rate per class
delta (float or dict[int, float], default=0.10) – PAC risk tolerance for threshold calibration
test_size (int, default=1000) – Size of each test set in validation
n_trials (int, default=1000) – Number of test sets per calibration dataset (for validation)
ci_level (float, default=0.95) – Nominal confidence level for prediction intervals (target to validate)
use_loo_correction (bool, default=True) – Use LOO-corrected bounds
prediction_method (str, default="all") – Method for bounds computation (“all” to compare all methods)
loo_inflation_factor (float, optional) – Manual override for LOO inflation factor
seed (int, optional) – Random seed for reproducibility
n_jobs (int, default=-1) – Number of parallel jobs (-1 = all cores)
verbose (bool, default=False) – If True, print progress for each calibration dataset
- Returns:
Meta-validation results with keys: - ‘n_calibrations’: BigN - ‘n_calibration’: Calibration dataset size - ‘n_trials_per_calibration’: n_trials - ‘ci_level’: Target confidence level - ‘marginal’: Dict with coverage statistics per method - ‘class_0’: Dict with coverage statistics per method - ‘class_1’: Dict with coverage statistics per method Each scope contains: - ‘singleton’, ‘doublet’, ‘abstention’, ‘singleton_error’: Dicts with:
’selected’: Coverage stats for selected bounds
’analytical’: Coverage stats if available
’exact’: Coverage stats if available
’hoeffding’: Coverage stats if available
Each method has: - ‘coverages’: Array of empirical coverages across BigN calibrations - ‘mean’: Mean coverage - ‘median’: Median coverage - ‘quantiles’: {q05, q25, q50, q75, q95} - ‘fraction_above_target’: Fraction achieving ≥ci_level - ‘fraction_above_95pct’: Fraction achieving ≥95% (for comparison)
- Return type:
Examples
>>> from ssbc import BinaryClassifierSimulator, validate_prediction_interval_calibration >>> sim = BinaryClassifierSimulator(p_class1=0.2, seed=42) >>> results = validate_prediction_interval_calibration( ... simulator=sim, ... n_calibration=100, ... BigN=50, ... n_trials=500, ... verbose=False ... ) >>> print(f"Fraction achieving ≥95%: {results['marginal']['singleton']['selected']['fraction_above_target']:.1%}")