Skip to main content

Overview

Most tools tell you a column is a string or float64. ORCA goes further: it identifies that the column is an email address, a Swedish personnummer, a currency amount, or a medical diagnosis. This semantic understanding powers every downstream feature — quality rules, GDPR detection, AI readiness scoring, and remediation suggestions. ORCA classifies columns into 50+ semantic categories across 10 domains, with confidence scores that tell you exactly how certain the system is about each classification.

How classification works

Classification uses a multi-signal pipeline. Each stage adds confidence, and later stages can override earlier ones when they have stronger evidence.

Stage 1: Fast classifier (deterministic)

The fast classifier runs first, using pattern matching and statistical analysis — no AI calls required. It handles columns with unambiguous signals:
  • Name patterns — column named email with values containing @ signs
  • Regex matches — values matching known formats (UUIDs, IBANs, postal codes)
  • Statistical profile — numeric range, cardinality, entropy
  • PII patterns — regex-confirmed sensitive data (credit cards, national IDs)
The fast classifier only fires when multiple independent signals agree. Anything ambiguous is passed to Stage 2 rather than guessed.

Stage 2: Gemini AI (semantic)

Columns that the fast classifier cannot resolve are sent to Google Gemini with a rich fingerprint context:
  • Column name and data type
  • Sample values (PII-masked for GDPR compliance)
  • Value distribution statistics
  • All other columns in the dataset (inter-column context)
  • Detected domain (financial, HR, medical, etc.)
The AI reasons about columns as a dataset, not in isolation. A column named status next to employee_name and department is classified differently than status next to invoice_number and due_date.

Stage 3: Inter-column context

The AI considers neighbouring columns to disambiguate:
ColumnNeighboursClassification
amountinvoice_number, due_datecurrency_amount
codehospital_name, diagnosiscategorical_generic
ratecurrency, exchange_datefloat_measurement
idfirst column, other data columnsinteger_id

Stage 4: Table-level classification

ORCA detects the overall domain of the dataset as a whole before finalising individual column classifications. Domains include: medical, financial, legal, scientific, hr, crm, ecommerce, logistics, it_operations, academic, real_estate, manufacturing, generic Knowing the dataset’s domain lets the engine resolve ambiguous columns toward categories that make sense in context — code near hospital_name is a different thing than code near invoice_number.

Semantic categories

Categories are grouped by domain. Each category maps to specific quality rules and validation logic.

Identifiers

CategoryDescription
integer_idNumeric row or entity identifier
uuidUUID v4 string identifier
product_skuProduct SKU or catalog code
case_numberLegal case or support ticket reference
invoice_numberBilling document reference
patient_idHealthcare patient identifier (GDPR-relevant)
sample_idScientific or lab specimen identifier
serial_numberDevice or product serial number

Contact

CategoryDescription
emailEmail address
phone_internationalInternational phone number (E.164)
phone_nordicNordic-format phone number
urlURL or web address

Location

CategoryDescription
postal_code_seSwedish postal code
postal_code_usUS ZIP code
postal_code_ukUK postcode
postal_code_genericOther postal code format
country_code_iso2ISO 3166-1 alpha-2 country code
country_code_iso3ISO 3166-1 alpha-3 country code
city_nameCity or town name
street_addressStreet address

Dates and times

CategoryDescription
date_isoISO 8601 date (YYYY-MM-DD)
date_eu_formatEuropean date format (DD/MM/YYYY)
date_us_formatUS date format (MM/DD/YYYY)
datetimeCombined date and time
time_of_dayTime value (HH:MM:SS)

Financial

CategoryDescription
currency_amountMonetary value
percentagePercentage value
ibanInternational Bank Account Number
CategoryDescription
swedish_personnummerSwedish personal identity number (GDPR)
swedish_org_numberSwedish organisation number
norwegian_personnummerNorwegian personal identity number (GDPR)
finnish_personal_idFinnish personal identity code (GDPR)
vat_number_seSwedish VAT number

Text and categoricals

CategoryDescription
text_namePerson name (first, last, or full)
text_descriptionFree-form description or notes
text_notesComments, remarks, annotations
text_diagnosisMedical diagnosis text (GDPR-relevant)
text_addressFull postal address as a single field
text_legalLegal clauses or contract terms
text_codeSource code snippets or scripts
categorical_statusStatus field (active, inactive, pending)
categorical_genericOther low-cardinality text field
boolean_flagTrue/false or yes/no field

Network

CategoryDescription
ip_address_v4IPv4 address

Numeric

CategoryDescription
integer_quantityWhole number quantity
float_measurementDecimal measurement value
integer_score_or_ratingScore or rating (integer)
numeric_ageAge in years
numeric_scoreTest scores, exam grades
numeric_measurementPhysical measurements (temperature, weight)
numeric_percentageRatio expressed as 0-100 or 0-1
numeric_countHeadcount, unit quantity, occurrence count

Confidence scores

Every classification includes a confidence score from 0.0 to 1.0 (displayed as 0-100%). Confidence reflects how strongly the engine’s signals agree on the column’s category — a column with a confirmed regex pattern, a recognisable name, and a domain-appropriate context will land near the top of the range; a column with weaker or conflicting signals will land lower.
Confidence bandTypical meaningAction
Very highPattern confirmed by multiple converging signalsNo action needed
HighStrong name and value signals agreeNo action needed
ModerateReasonable evidence, but not unanimousReview recommended
LowSignals are weak or conflictingSent to the clarification queue for user review
The exact band cut-offs and the per-band action thresholds are tuned over time and are not published. Low-confidence columns surface in the job results view so a human can confirm or correct them — those corrections feed the org-memory loop described below.

Org memory

ORCA learns from your corrections. When you correct a classification, the system stores the mapping in your organisation’s memory. On future scans:
  1. The engine checks if it has seen a similar column before (same name, similar values)
  2. If a match is found, the stored classification is used with boosted confidence
  3. This creates a feedback loop where accuracy improves over time for your specific data
Org memory is scoped to your organisation — it never leaks across tenants.

Correcting a classification

In the web app

Navigate to the job detail page, find the column, and select the correct category from the dropdown. The correction is saved immediately and added to org memory.

Via the API

PATCH /api/v1/columns/{column_id}
Content-Type: application/json
Authorization: Bearer <token>

{
  "semantic_type": "currency_amount",
  "confidence_override": 0.95
}

Batch corrections

To correct multiple columns at once:
PATCH /api/v1/columns/batch
Content-Type: application/json
Authorization: Bearer <token>

{
  "corrections": [
    {"column_id": "uuid-1", "semantic_type": "email"},
    {"column_id": "uuid-2", "semantic_type": "currency_amount"},
    {"column_id": "uuid-3", "semantic_type": "date_iso"}
  ]
}
Each correction is stored in org memory and influences future classifications of columns with similar names and value patterns.

Next steps

AI readiness

See how classification feeds into readiness scoring

Data contracts

Set quality rules based on classified column types