Auto-remediation

When to use which

ORCA gives you three different ways to fix data quality issues. Pick the right tool for the job:

Tool	Scope	Trigger	Best for
Auto-remediation (this page)	One file, one click	Manual, on-demand	Cleaning a single file end-to-end before downstream use
Fix Inbox	Continuous queue across all files	Automatic on every job	Reviewing and approving AI-suggested fixes over time
Correction pipelines	Reusable multi-step recipes	Manual or scheduled	Repeating the same cleanup sequence across many files

This page covers auto-remediation. The other two link out.

How it works

ORCA’s remediation engine analyzes your existing quality results and generates a fix plan tailored to each column’s data type and detected issues. The process is preview first, apply second — you always review changes before they’re made.

Preview

ORCA generates a remediation plan with strategy selection and before/after samples.

Select

Choose which fixes to apply using checkboxes. Some strategies require manual review and cannot be auto-applied.

Apply

ORCA creates a new remediated copy of your file. The original is never modified.

Review

See what changed with a detailed change log showing row counts and before/after values.

Strategy types

Null imputation

Fills missing values based on the column’s semantic type:

Column type	Strategy	Rationale
Numeric (revenue, age, price)	Median	Robust to skew and outliers
Categorical (status, country)	Mode	Most frequent value preserves distribution
Datetime (created_at, date)	Forward fill	Preserves temporal ordering
Text / Unclassified	Placeholder	”Not provided” avoids inventing data

Example — age column with 12% nulls:

Before: [34, 27, null, 41, null, 29, 38, null, 33]
After:  [34, 27,   33, 41,   33, 29, 38,   33, 33]   ← median = 33

ORCA picks median over mean because revenue/age/price columns are typically right-skewed; mean would drag values toward outliers.

Deduplication

Removes exact duplicate rows, keeping the first occurrence. Applied at the file level (all columns compared).

Format standardization

Normalizes inconsistent formats:

Format issue	Fix applied
Mixed date formats	Standardize to ISO 8601 (YYYY-MM-DD)
Email casing	Lowercase and trim whitespace
Phone formats	Strip non-numeric characters
General whitespace	Trim leading/trailing spaces, normalize unicode

Example — signup_date column with mixed formats:

Before: ["2024-03-12", "12/03/24", "March 12, 2024", "2024-03-12T00:00:00Z"]
After:  ["2024-03-12", "2024-03-12", "2024-03-12",   "2024-03-12"]

Example — email column with inconsistent casing and whitespace:

Before: ["  Alice@Acme.com", "bob@ACME.COM ", "carol@acme.com"]
After:  ["alice@acme.com",   "bob@acme.com",  "carol@acme.com"]

Outlier treatment (winsorization)

For numeric columns with detected outliers, extreme values at both tails of the distribution are capped to a percentile near each end of the distribution. This reduces the influence of a few extreme rows on downstream models without removing the rows themselves.

Manual review

Some issues cannot be auto-fixed:

GDPR fields — require human decision on data handling
Orphaned references — require cross-file resolution
Non-numeric anomalies — flagged for review

These appear in the plan but cannot be selected for auto-fix. They are marked with a “Manual” label in the remediation panel.

Score impact

Each remediation action has an estimated impact on your AI readiness score, depending on the type of issue and which dimension it falls under:

Issue type	Typical impact	Dimension affected
Null values	High	Completeness
Duplicates	Medium	Uniqueness
Format violations	Medium	Consistency
Outliers	Low to medium	Referential integrity

The estimated score improvement is shown in the plan summary. Actual improvement varies based on the specific data distribution and which dimensions were dragging the score down. For the full list of dimensions and how they combine, see the AI Readiness methodology.

Best practices

Start with high-impact fixes — null imputation on critical columns has the biggest score impact.

Review samples carefully — check the before/after previews to ensure the strategy is appropriate
Apply in batches — apply a few fixes, re-score, then decide on next steps
Keep originals — ORCA always preserves your original file, but downloading a backup is good practice
Re-run AI readiness after remediation to see updated scores and potentially generate an assessment report

Getting started

Features

Administration

Integrations

Security & compliance

Developer Tools

Methodology

When to use which

How it works

Strategy types

Null imputation

Deduplication

Format standardization

Outlier treatment (winsorization)

Manual review

Score impact

Best practices

​When to use which

​How it works

​Strategy types

​Null imputation

​Deduplication

​Format standardization

​Outlier treatment (winsorization)

​Manual review

​Score impact

​Best practices

When to use which

How it works

Strategy types

Null imputation

Deduplication

Format standardization

Outlier treatment (winsorization)

Manual review

Score impact

Best practices