Classification

Overview

Most tools tell you a column is a string or float64. ORCA goes further: it identifies that the column is an email address, a Swedish personnummer, a currency amount, or a medical diagnosis. This semantic understanding powers every downstream feature — quality rules, GDPR detection, AI readiness scoring, and remediation suggestions. ORCA classifies columns into 50+ semantic categories across 10 domains, with confidence scores that tell you exactly how certain the system is about each classification.

How classification works

Classification uses a multi-signal pipeline. Each stage adds confidence, and later stages can override earlier ones when they have stronger evidence.

Stage 1: Fast classifier (deterministic)

The fast classifier runs first, using pattern matching and statistical analysis — no AI calls required. It handles columns with unambiguous signals:

Name patterns — column named email with values containing @ signs
Regex matches — values matching known formats (UUIDs, IBANs, postal codes)
Statistical profile — numeric range, cardinality, entropy
PII patterns — regex-confirmed sensitive data (credit cards, national IDs)

The fast classifier only fires when multiple independent signals agree. Anything ambiguous is passed to Stage 2 rather than guessed.

Stage 2: Gemini AI (semantic)

Columns that the fast classifier cannot resolve are sent to Google Gemini with a rich fingerprint context:

Column name and data type
Sample values (PII-masked for GDPR compliance)
Value distribution statistics
All other columns in the dataset (inter-column context)
Detected domain (financial, HR, medical, etc.)

The AI reasons about columns as a dataset, not in isolation. A column named status next to employee_name and department is classified differently than status next to invoice_number and due_date.

Stage 3: Inter-column context

The AI considers neighbouring columns to disambiguate:

Column	Neighbours	Classification
`amount`	`invoice_number`, `due_date`	`currency_amount`
`code`	`hospital_name`, `diagnosis`	`categorical_generic`
`rate`	`currency`, `exchange_date`	`float_measurement`
`id`	first column, other data columns	`integer_id`

Stage 4: Table-level classification

ORCA detects the overall domain of the dataset as a whole before finalising individual column classifications. Domains include: medical, financial, legal, scientific, hr, crm, ecommerce, logistics, it_operations, academic, real_estate, manufacturing, generic Knowing the dataset’s domain lets the engine resolve ambiguous columns toward categories that make sense in context — code near hospital_name is a different thing than code near invoice_number.

Semantic categories

Categories are grouped by domain. Each category maps to specific quality rules and validation logic.

Identifiers

Category	Description
`integer_id`	Numeric row or entity identifier
`uuid`	UUID v4 string identifier
`product_sku`	Product SKU or catalog code
`case_number`	Legal case or support ticket reference
`invoice_number`	Billing document reference
`patient_id`	Healthcare patient identifier (GDPR-relevant)
`sample_id`	Scientific or lab specimen identifier
`serial_number`	Device or product serial number

Contact

Category	Description
`email`	Email address
`phone_international`	International phone number (E.164)
`phone_nordic`	Nordic-format phone number
`url`	URL or web address

Location

Category	Description
`postal_code_se`	Swedish postal code
`postal_code_us`	US ZIP code
`postal_code_uk`	UK postcode
`postal_code_generic`	Other postal code format
`country_code_iso2`	ISO 3166-1 alpha-2 country code
`country_code_iso3`	ISO 3166-1 alpha-3 country code
`city_name`	City or town name
`street_address`	Street address

Dates and times

Category	Description
`date_iso`	ISO 8601 date (YYYY-MM-DD)
`date_eu_format`	European date format (DD/MM/YYYY)
`date_us_format`	US date format (MM/DD/YYYY)
`datetime`	Combined date and time
`time_of_day`	Time value (HH:MM:SS)

Financial

Category	Description
`currency_amount`	Monetary value
`percentage`	Percentage value
`iban`	International Bank Account Number

Nordic legal identifiers

Category	Description
`swedish_personnummer`	Swedish personal identity number (GDPR)
`swedish_org_number`	Swedish organisation number
`norwegian_personnummer`	Norwegian personal identity number (GDPR)
`finnish_personal_id`	Finnish personal identity code (GDPR)
`vat_number_se`	Swedish VAT number

Text and categoricals

Category	Description
`text_name`	Person name (first, last, or full)
`text_description`	Free-form description or notes
`text_notes`	Comments, remarks, annotations
`text_diagnosis`	Medical diagnosis text (GDPR-relevant)
`text_address`	Full postal address as a single field
`text_legal`	Legal clauses or contract terms
`text_code`	Source code snippets or scripts
`categorical_status`	Status field (active, inactive, pending)
`categorical_generic`	Other low-cardinality text field
`boolean_flag`	True/false or yes/no field

Network

Category	Description
`ip_address_v4`	IPv4 address

Numeric

Category	Description
`integer_quantity`	Whole number quantity
`float_measurement`	Decimal measurement value
`integer_score_or_rating`	Score or rating (integer)
`numeric_age`	Age in years
`numeric_score`	Test scores, exam grades
`numeric_measurement`	Physical measurements (temperature, weight)
`numeric_percentage`	Ratio expressed as 0-100 or 0-1
`numeric_count`	Headcount, unit quantity, occurrence count

Confidence scores

Every classification includes a confidence score from 0.0 to 1.0 (displayed as 0-100%). Confidence reflects how strongly the engine’s signals agree on the column’s category — a column with a confirmed regex pattern, a recognisable name, and a domain-appropriate context will land near the top of the range; a column with weaker or conflicting signals will land lower.

Confidence band	Typical meaning	Action
Very high	Pattern confirmed by multiple converging signals	No action needed
High	Strong name and value signals agree	No action needed
Moderate	Reasonable evidence, but not unanimous	Review recommended
Low	Signals are weak or conflicting	Sent to the clarification queue for user review

The exact band cut-offs and the per-band action thresholds are tuned over time and are not published. Low-confidence columns surface in the job results view so a human can confirm or correct them — those corrections feed the org-memory loop described below.

Org memory

ORCA learns from your corrections. When you correct a classification, the system stores the mapping in your organisation’s memory. On future scans:

The engine checks if it has seen a similar column before (same name, similar values)
If a match is found, the stored classification is used with boosted confidence
This creates a feedback loop where accuracy improves over time for your specific data

Org memory is scoped to your organisation — it never leaks across tenants.

Correcting a classification

In the web app

Navigate to the job detail page, find the column, and select the correct category from the dropdown. The correction is saved immediately and added to org memory.

Via the API

PATCH /api/v1/columns/{column_id}
Content-Type: application/json
Authorization: Bearer <token>

{
  "semantic_type": "currency_amount",
  "confidence_override": 0.95
}

Batch corrections

To correct multiple columns at once:

PATCH /api/v1/columns/batch
Content-Type: application/json
Authorization: Bearer <token>

{
  "corrections": [
    {"column_id": "uuid-1", "semantic_type": "email"},
    {"column_id": "uuid-2", "semantic_type": "currency_amount"},
    {"column_id": "uuid-3", "semantic_type": "date_iso"}
  ]
}

Each correction is stored in org memory and influences future classifications of columns with similar names and value patterns.

Getting started

Features

Administration

Integrations

Security & compliance

Developer Tools

Methodology

Overview

How classification works

Stage 1: Fast classifier (deterministic)

Stage 2: Gemini AI (semantic)

Stage 3: Inter-column context

Stage 4: Table-level classification

Semantic categories

Identifiers

Contact

Location

Dates and times

Financial

Nordic legal identifiers

Text and categoricals

Network

Numeric

Confidence scores

Org memory

Correcting a classification

In the web app

Via the API

Batch corrections

Next steps

AI readiness

Data contracts

​Overview

​How classification works

​Stage 1: Fast classifier (deterministic)

​Stage 2: Gemini AI (semantic)

​Stage 3: Inter-column context

​Stage 4: Table-level classification

​Semantic categories

​Identifiers

​Contact

​Location

​Dates and times

​Financial

​Nordic legal identifiers

​Text and categoricals

​Network

​Numeric

​Confidence scores

​Org memory

​Correcting a classification

​In the web app

​Via the API

​Batch corrections

​Next steps

AI readiness

Data contracts

Overview

How classification works

Stage 1: Fast classifier (deterministic)

Stage 2: Gemini AI (semantic)

Stage 3: Inter-column context

Stage 4: Table-level classification

Semantic categories

Identifiers

Contact

Location

Dates and times

Financial

Nordic legal identifiers

Text and categoricals

Network

Numeric

Confidence scores

Org memory

Correcting a classification

In the web app

Via the API

Batch corrections

Next steps