Summary
ORCA readiness scores correlate with actual ML model performance (F1) across multiple datasets and degradation types. When data quality degrades, ORCA scores drop in proportion to the resulting loss in model accuracy — meaning the readiness score is a reliable proxy for downstream ML outcomes.Methodology
The benchmark follows the CleanML framework (Budach et al., NeurIPS 2022), which systematically measures how data quality issues affect machine learning performance:- Baseline — Load a clean dataset, train baseline models, and compute the baseline ORCA readiness score.
- Degrade — Apply controlled degradation (nulls, duplicates, noise) at varying rates.
- Re-evaluate — Train the same models on the degraded data and measure the performance drop (F1 macro). Compute the ORCA readiness score on the degraded data and measure the score drop.
- Correlate — Calculate the Pearson correlation between readiness score drop and F1 drop across all experiments.
Datasets
| Dataset | Rows | Source | Task |
|---|---|---|---|
| Synthetic churn | 5,000 | Generated (logistic model) | Binary classification |
| Adult Census | ~32,500 | UCI ML Repository | Income prediction |
| Wine Quality | ~1,600 | UCI ML Repository | Quality classification |
| Heart Disease | ~300 | UCI ML Repository (Cleveland) | Disease prediction |
Models
Three standard classifiers are trained for every experiment:- Logistic Regression (max 1,000 iterations)
- Random Forest (100 estimators)
- Gradient Boosting (100 estimators)
Degradation types
Three families of degradation are exercised, each at multiple severities:| Type | Method |
|---|---|
| MCAR nulls | Random cell masking (Missing Completely At Random), swept across a range of severities from light to heavy |
| Duplicates | Random row duplication, swept across a range of duplication rates |
| Gaussian noise | Additive noise scaled to each numeric column’s standard deviation, swept across a range of noise levels |
Readiness scoring
The benchmark uses ORCA’s production-aligned 7-dimension readiness score: completeness, consistency, referential integrity, compliance, uniqueness, schema quality, and stability. Each dimension is scored on a 0-100 scale and combined into a single weighted composite. Per the methodology page, the exact dimension weights, the thresholds each dimension applies internally, and the per-issue penalty schedules are implementation details we do not publish here. The point of this benchmark is the correlation claim — that the published score moves in step with real model performance — not a recipe for reproducing the score itself.Interpreting results
| Pearson r | Interpretation |
|---|---|
| > 0.7 | Strong — readiness scores reliably predict ML performance |
| 0.4 - 0.7 | Moderate — scores are a useful proxy |
| 0.2 - 0.4 | Weak — some predictive value |
| < 0.2 | Not significant — the score would be flagged for methodology revision |
How to reproduce
scikit-learn, pandas, numpy (included in requirements.txt).
UCI datasets are downloaded automatically and cached in scripts/datasets/.
Use-case weight profiles
Production deployments can apply use-case-specific weight profiles that re-balance dimension importance to match the intended ML workload — a time-series forecasting profile and a tabular-classification profile weigh the dimensions differently because what dominates risk for each family of model is different. See AI Readiness and the methodology page for the philosophy behind the profiles; the per-profile weight values are implementation details we do not publish.Transparency note
Results may vary based on dataset characteristics. We publish our methodology and scripts for full transparency. The benchmark source code is available atscripts/benchmark_readiness.py.