Skip to main content

When to use which

ORCA gives you three different ways to fix data quality issues. Pick the right tool for the job:
ToolScopeTriggerBest for
Auto-remediation (this page)One file, one clickManual, on-demandCleaning a single file end-to-end before downstream use
Fix InboxContinuous queue across all filesAutomatic on every jobReviewing and approving AI-suggested fixes over time
Correction pipelinesReusable multi-step recipesManual or scheduledRepeating the same cleanup sequence across many files
This page covers auto-remediation. The other two link out.

How it works

ORCA’s remediation engine analyzes your existing quality results and generates a fix plan tailored to each column’s data type and detected issues. The process is preview first, apply second — you always review changes before they’re made.
1

Preview

ORCA generates a remediation plan with strategy selection and before/after samples.
2

Select

Choose which fixes to apply using checkboxes. Some strategies require manual review and cannot be auto-applied.
3

Apply

ORCA creates a new remediated copy of your file. The original is never modified.
4

Review

See what changed with a detailed change log showing row counts and before/after values.

Strategy types

Null imputation

Fills missing values based on the column’s semantic type:
Column typeStrategyRationale
Numeric (revenue, age, price)MedianRobust to skew and outliers
Categorical (status, country)ModeMost frequent value preserves distribution
Datetime (created_at, date)Forward fillPreserves temporal ordering
Text / UnclassifiedPlaceholder”Not provided” avoids inventing data
Example — age column with 12% nulls:
Before: [34, 27, null, 41, null, 29, 38, null, 33]
After:  [34, 27,   33, 41,   33, 29, 38,   33, 33]   ← median = 33
ORCA picks median over mean because revenue/age/price columns are typically right-skewed; mean would drag values toward outliers.

Deduplication

Removes exact duplicate rows, keeping the first occurrence. Applied at the file level (all columns compared).

Format standardization

Normalizes inconsistent formats:
Format issueFix applied
Mixed date formatsStandardize to ISO 8601 (YYYY-MM-DD)
Email casingLowercase and trim whitespace
Phone formatsStrip non-numeric characters
General whitespaceTrim leading/trailing spaces, normalize unicode
Example — signup_date column with mixed formats:
Before: ["2024-03-12", "12/03/24", "March 12, 2024", "2024-03-12T00:00:00Z"]
After:  ["2024-03-12", "2024-03-12", "2024-03-12",   "2024-03-12"]
Example — email column with inconsistent casing and whitespace:
Before: ["  Alice@Acme.com", "bob@ACME.COM ", "carol@acme.com"]
After:  ["alice@acme.com",   "bob@acme.com",  "carol@acme.com"]

Outlier treatment (winsorization)

For numeric columns with detected outliers, extreme values at both tails of the distribution are capped to a percentile near each end of the distribution. This reduces the influence of a few extreme rows on downstream models without removing the rows themselves.

Manual review

Some issues cannot be auto-fixed:
  • GDPR fields — require human decision on data handling
  • Orphaned references — require cross-file resolution
  • Non-numeric anomalies — flagged for review
These appear in the plan but cannot be selected for auto-fix. They are marked with a “Manual” label in the remediation panel.

Score impact

Each remediation action has an estimated impact on your AI readiness score, depending on the type of issue and which dimension it falls under:
Issue typeTypical impactDimension affected
Null valuesHighCompleteness
DuplicatesMediumUniqueness
Format violationsMediumConsistency
OutliersLow to mediumReferential integrity
The estimated score improvement is shown in the plan summary. Actual improvement varies based on the specific data distribution and which dimensions were dragging the score down. For the full list of dimensions and how they combine, see the AI Readiness methodology.

Best practices

Start with high-impact fixes — null imputation on critical columns has the biggest score impact.
  1. Review samples carefully — check the before/after previews to ensure the strategy is appropriate
  2. Apply in batches — apply a few fixes, re-score, then decide on next steps
  3. Keep originals — ORCA always preserves your original file, but downloading a backup is good practice
  4. Re-run AI readiness after remediation to see updated scores and potentially generate an assessment report