When to use which
ORCA gives you three different ways to fix data quality issues. Pick the right tool for the job:
| Tool | Scope | Trigger | Best for |
|---|
| Auto-remediation (this page) | One file, one click | Manual, on-demand | Cleaning a single file end-to-end before downstream use |
| Fix Inbox | Continuous queue across all files | Automatic on every job | Reviewing and approving AI-suggested fixes over time |
| Correction pipelines | Reusable multi-step recipes | Manual or scheduled | Repeating the same cleanup sequence across many files |
This page covers auto-remediation. The other two link out.
How it works
ORCA’s remediation engine analyzes your existing quality results and generates a fix plan tailored to each column’s data type and detected issues. The process is preview first, apply second — you always review changes before they’re made.
Preview
ORCA generates a remediation plan with strategy selection and before/after samples.
Select
Choose which fixes to apply using checkboxes. Some strategies require manual review and cannot be auto-applied.
Apply
ORCA creates a new remediated copy of your file. The original is never modified.
Review
See what changed with a detailed change log showing row counts and before/after values.
Strategy types
Null imputation
Fills missing values based on the column’s semantic type:
| Column type | Strategy | Rationale |
|---|
| Numeric (revenue, age, price) | Median | Robust to skew and outliers |
| Categorical (status, country) | Mode | Most frequent value preserves distribution |
| Datetime (created_at, date) | Forward fill | Preserves temporal ordering |
| Text / Unclassified | Placeholder | ”Not provided” avoids inventing data |
Example — age column with 12% nulls:
Before: [34, 27, null, 41, null, 29, 38, null, 33]
After: [34, 27, 33, 41, 33, 29, 38, 33, 33] ← median = 33
ORCA picks median over mean because revenue/age/price columns are typically right-skewed; mean would drag values toward outliers.
Deduplication
Removes exact duplicate rows, keeping the first occurrence. Applied at the file level (all columns compared).
Normalizes inconsistent formats:
| Format issue | Fix applied |
|---|
| Mixed date formats | Standardize to ISO 8601 (YYYY-MM-DD) |
| Email casing | Lowercase and trim whitespace |
| Phone formats | Strip non-numeric characters |
| General whitespace | Trim leading/trailing spaces, normalize unicode |
Example — signup_date column with mixed formats:
Before: ["2024-03-12", "12/03/24", "March 12, 2024", "2024-03-12T00:00:00Z"]
After: ["2024-03-12", "2024-03-12", "2024-03-12", "2024-03-12"]
Example — email column with inconsistent casing and whitespace:
Before: [" Alice@Acme.com", "bob@ACME.COM ", "carol@acme.com"]
After: ["alice@acme.com", "bob@acme.com", "carol@acme.com"]
Outlier treatment (winsorization)
For numeric columns with detected outliers, extreme values at both tails of the distribution are capped to a percentile near each end of the distribution. This reduces the influence of a few extreme rows on downstream models without removing the rows themselves.
Manual review
Some issues cannot be auto-fixed:
- GDPR fields — require human decision on data handling
- Orphaned references — require cross-file resolution
- Non-numeric anomalies — flagged for review
These appear in the plan but cannot be selected for auto-fix. They are marked with a “Manual” label in the remediation panel.
Score impact
Each remediation action has an estimated impact on your AI readiness score, depending on the type of issue and which dimension it falls under:
| Issue type | Typical impact | Dimension affected |
|---|
| Null values | High | Completeness |
| Duplicates | Medium | Uniqueness |
| Format violations | Medium | Consistency |
| Outliers | Low to medium | Referential integrity |
The estimated score improvement is shown in the plan summary. Actual improvement varies based on the specific data distribution and which dimensions were dragging the score down. For the full list of dimensions and how they combine, see the AI Readiness methodology.
Best practices
Start with high-impact fixes — null imputation on critical columns has the biggest score impact.
- Review samples carefully — check the before/after previews to ensure the strategy is appropriate
- Apply in batches — apply a few fixes, re-score, then decide on next steps
- Keep originals — ORCA always preserves your original file, but downloading a backup is good practice
- Re-run AI readiness after remediation to see updated scores and potentially generate an assessment report