Skip to main content

Summary

ORCA readiness scores correlate with actual ML model performance (F1) across multiple datasets and degradation types. When data quality degrades, ORCA scores drop in proportion to the resulting loss in model accuracy — meaning the readiness score is a reliable proxy for downstream ML outcomes.

Methodology

The benchmark follows the CleanML framework (Budach et al., NeurIPS 2022), which systematically measures how data quality issues affect machine learning performance:
  1. Baseline — Load a clean dataset, train baseline models, and compute the baseline ORCA readiness score.
  2. Degrade — Apply controlled degradation (nulls, duplicates, noise) at varying rates.
  3. Re-evaluate — Train the same models on the degraded data and measure the performance drop (F1 macro). Compute the ORCA readiness score on the degraded data and measure the score drop.
  4. Correlate — Calculate the Pearson correlation between readiness score drop and F1 drop across all experiments.
Each degradation configuration is repeated across multiple random seeds so the reported correlation is averaged over trial-to-trial variance rather than driven by any single run. The test set is always the clean held-out split, so performance changes reflect training data quality — not test contamination.

Datasets

DatasetRowsSourceTask
Synthetic churn5,000Generated (logistic model)Binary classification
Adult Census~32,500UCI ML RepositoryIncome prediction
Wine Quality~1,600UCI ML RepositoryQuality classification
Heart Disease~300UCI ML Repository (Cleveland)Disease prediction

Models

Three standard classifiers are trained for every experiment:
  • Logistic Regression (max 1,000 iterations)
  • Random Forest (100 estimators)
  • Gradient Boosting (100 estimators)
Metrics collected: F1 macro, accuracy, AUC-ROC.

Degradation types

Three families of degradation are exercised, each at multiple severities:
TypeMethod
MCAR nullsRandom cell masking (Missing Completely At Random), swept across a range of severities from light to heavy
DuplicatesRandom row duplication, swept across a range of duplication rates
Gaussian noiseAdditive noise scaled to each numeric column’s standard deviation, swept across a range of noise levels
Each severity is repeated across multiple random seeds so the reported correlation is averaged over noise rather than driven by any single trial.

Readiness scoring

The benchmark uses ORCA’s production-aligned 7-dimension readiness score: completeness, consistency, referential integrity, compliance, uniqueness, schema quality, and stability. Each dimension is scored on a 0-100 scale and combined into a single weighted composite. Per the methodology page, the exact dimension weights, the thresholds each dimension applies internally, and the per-issue penalty schedules are implementation details we do not publish here. The point of this benchmark is the correlation claim — that the published score moves in step with real model performance — not a recipe for reproducing the score itself.

Interpreting results

Pearson rInterpretation
> 0.7Strong — readiness scores reliably predict ML performance
0.4 - 0.7Moderate — scores are a useful proxy
0.2 - 0.4Weak — some predictive value
< 0.2Not significant — the score would be flagged for methodology revision

How to reproduce

# Single dataset (fast)
python scripts/benchmark_readiness.py --dataset synthetic

# All datasets (downloads UCI data on first run)
python scripts/benchmark_readiness.py --dataset all --output results.json
Requirements: scikit-learn, pandas, numpy (included in requirements.txt). UCI datasets are downloaded automatically and cached in scripts/datasets/.

Use-case weight profiles

Production deployments can apply use-case-specific weight profiles that re-balance dimension importance to match the intended ML workload — a time-series forecasting profile and a tabular-classification profile weigh the dimensions differently because what dominates risk for each family of model is different. See AI Readiness and the methodology page for the philosophy behind the profiles; the per-profile weight values are implementation details we do not publish.

Transparency note

Results may vary based on dataset characteristics. We publish our methodology and scripts for full transparency. The benchmark source code is available at scripts/benchmark_readiness.py.