Validation benchmark

Summary

ORCA readiness scores correlate with actual ML model performance (F1) across multiple datasets and degradation types. When data quality degrades, ORCA scores drop in proportion to the resulting loss in model accuracy — meaning the readiness score is a reliable proxy for downstream ML outcomes.

Methodology

The benchmark follows the CleanML framework (Budach et al., NeurIPS 2022), which systematically measures how data quality issues affect machine learning performance:

Baseline — Load a clean dataset, train baseline models, and compute the baseline ORCA readiness score.
Degrade — Apply controlled degradation (nulls, duplicates, noise) at varying rates.
Re-evaluate — Train the same models on the degraded data and measure the performance drop (F1 macro). Compute the ORCA readiness score on the degraded data and measure the score drop.
Correlate — Calculate the Pearson correlation between readiness score drop and F1 drop across all experiments.

Each degradation configuration is repeated across multiple random seeds so the reported correlation is averaged over trial-to-trial variance rather than driven by any single run. The test set is always the clean held-out split, so performance changes reflect training data quality — not test contamination.

Datasets

Dataset	Rows	Source	Task
Synthetic churn	5,000	Generated (logistic model)	Binary classification
Adult Census	~32,500	UCI ML Repository	Income prediction
Wine Quality	~1,600	UCI ML Repository	Quality classification
Heart Disease	~300	UCI ML Repository (Cleveland)	Disease prediction

Models

Three standard classifiers are trained for every experiment:

Logistic Regression (max 1,000 iterations)
Random Forest (100 estimators)
Gradient Boosting (100 estimators)

Metrics collected: F1 macro, accuracy, AUC-ROC.

Degradation types

Three families of degradation are exercised, each at multiple severities:

Type	Method
MCAR nulls	Random cell masking (Missing Completely At Random), swept across a range of severities from light to heavy
Duplicates	Random row duplication, swept across a range of duplication rates
Gaussian noise	Additive noise scaled to each numeric column’s standard deviation, swept across a range of noise levels

Each severity is repeated across multiple random seeds so the reported correlation is averaged over noise rather than driven by any single trial.

Readiness scoring

The benchmark uses ORCA’s production-aligned 7-dimension readiness score: completeness, consistency, referential integrity, compliance, uniqueness, schema quality, and stability. Each dimension is scored on a 0-100 scale and combined into a single weighted composite. Per the methodology page, the exact dimension weights, the thresholds each dimension applies internally, and the per-issue penalty schedules are implementation details we do not publish here. The point of this benchmark is the correlation claim — that the published score moves in step with real model performance — not a recipe for reproducing the score itself.

Interpreting results

Pearson r	Interpretation
> 0.7	Strong — readiness scores reliably predict ML performance
0.4 - 0.7	Moderate — scores are a useful proxy
0.2 - 0.4	Weak — some predictive value
< 0.2	Not significant — the score would be flagged for methodology revision

How to reproduce

# Single dataset (fast)
python scripts/benchmark_readiness.py --dataset synthetic

# All datasets (downloads UCI data on first run)
python scripts/benchmark_readiness.py --dataset all --output results.json

Requirements: scikit-learn, pandas, numpy (included in requirements.txt). UCI datasets are downloaded automatically and cached in scripts/datasets/.

Use-case weight profiles

Production deployments can apply use-case-specific weight profiles that re-balance dimension importance to match the intended ML workload — a time-series forecasting profile and a tabular-classification profile weigh the dimensions differently because what dominates risk for each family of model is different. See AI Readiness and the methodology page for the philosophy behind the profiles; the per-profile weight values are implementation details we do not publish.

Transparency note

Results may vary based on dataset characteristics. We publish our methodology and scripts for full transparency. The benchmark source code is available at scripts/benchmark_readiness.py.

Getting started

Features

Administration

Integrations

Security & compliance

Developer Tools

Methodology

Summary

Methodology

Datasets

Models

Degradation types

Readiness scoring

Interpreting results

How to reproduce

Use-case weight profiles

Transparency note

​Summary

​Methodology

​Datasets

​Models

​Degradation types

​Readiness scoring

​Interpreting results

​How to reproduce

​Use-case weight profiles

​Transparency note

Summary

Methodology

Datasets

Models

Degradation types

Readiness scoring

Interpreting results

How to reproduce

Use-case weight profiles

Transparency note