Overview
Most tools tell you a column is astring or float64. ORCA goes further: it identifies that the column is an email address, a Swedish personnummer, a currency amount, or a medical diagnosis. This semantic understanding powers every downstream feature — quality rules, GDPR detection, AI readiness scoring, and remediation suggestions.
ORCA classifies columns into 50+ semantic categories across 10 domains, with confidence scores that tell you exactly how certain the system is about each classification.
How classification works
Classification uses a multi-signal pipeline. Each stage adds confidence, and later stages can override earlier ones when they have stronger evidence.Stage 1: Fast classifier (deterministic)
The fast classifier runs first, using pattern matching and statistical analysis — no AI calls required. It handles columns with unambiguous signals:- Name patterns — column named
emailwith values containing@signs - Regex matches — values matching known formats (UUIDs, IBANs, postal codes)
- Statistical profile — numeric range, cardinality, entropy
- PII patterns — regex-confirmed sensitive data (credit cards, national IDs)
Stage 2: Gemini AI (semantic)
Columns that the fast classifier cannot resolve are sent to Google Gemini with a rich fingerprint context:- Column name and data type
- Sample values (PII-masked for GDPR compliance)
- Value distribution statistics
- All other columns in the dataset (inter-column context)
- Detected domain (financial, HR, medical, etc.)
status next to employee_name and department is classified differently than status next to invoice_number and due_date.
Stage 3: Inter-column context
The AI considers neighbouring columns to disambiguate:| Column | Neighbours | Classification |
|---|---|---|
amount | invoice_number, due_date | currency_amount |
code | hospital_name, diagnosis | categorical_generic |
rate | currency, exchange_date | float_measurement |
id | first column, other data columns | integer_id |
Stage 4: Table-level classification
ORCA detects the overall domain of the dataset as a whole before finalising individual column classifications. Domains include:medical, financial, legal, scientific, hr, crm, ecommerce, logistics, it_operations, academic, real_estate, manufacturing, generic
Knowing the dataset’s domain lets the engine resolve ambiguous columns toward categories that make sense in context — code near hospital_name is a different thing than code near invoice_number.
Semantic categories
Categories are grouped by domain. Each category maps to specific quality rules and validation logic.Identifiers
| Category | Description |
|---|---|
integer_id | Numeric row or entity identifier |
uuid | UUID v4 string identifier |
product_sku | Product SKU or catalog code |
case_number | Legal case or support ticket reference |
invoice_number | Billing document reference |
patient_id | Healthcare patient identifier (GDPR-relevant) |
sample_id | Scientific or lab specimen identifier |
serial_number | Device or product serial number |
Contact
| Category | Description |
|---|---|
email | Email address |
phone_international | International phone number (E.164) |
phone_nordic | Nordic-format phone number |
url | URL or web address |
Location
| Category | Description |
|---|---|
postal_code_se | Swedish postal code |
postal_code_us | US ZIP code |
postal_code_uk | UK postcode |
postal_code_generic | Other postal code format |
country_code_iso2 | ISO 3166-1 alpha-2 country code |
country_code_iso3 | ISO 3166-1 alpha-3 country code |
city_name | City or town name |
street_address | Street address |
Dates and times
| Category | Description |
|---|---|
date_iso | ISO 8601 date (YYYY-MM-DD) |
date_eu_format | European date format (DD/MM/YYYY) |
date_us_format | US date format (MM/DD/YYYY) |
datetime | Combined date and time |
time_of_day | Time value (HH:MM:SS) |
Financial
| Category | Description |
|---|---|
currency_amount | Monetary value |
percentage | Percentage value |
iban | International Bank Account Number |
Nordic legal identifiers
| Category | Description |
|---|---|
swedish_personnummer | Swedish personal identity number (GDPR) |
swedish_org_number | Swedish organisation number |
norwegian_personnummer | Norwegian personal identity number (GDPR) |
finnish_personal_id | Finnish personal identity code (GDPR) |
vat_number_se | Swedish VAT number |
Text and categoricals
| Category | Description |
|---|---|
text_name | Person name (first, last, or full) |
text_description | Free-form description or notes |
text_notes | Comments, remarks, annotations |
text_diagnosis | Medical diagnosis text (GDPR-relevant) |
text_address | Full postal address as a single field |
text_legal | Legal clauses or contract terms |
text_code | Source code snippets or scripts |
categorical_status | Status field (active, inactive, pending) |
categorical_generic | Other low-cardinality text field |
boolean_flag | True/false or yes/no field |
Network
| Category | Description |
|---|---|
ip_address_v4 | IPv4 address |
Numeric
| Category | Description |
|---|---|
integer_quantity | Whole number quantity |
float_measurement | Decimal measurement value |
integer_score_or_rating | Score or rating (integer) |
numeric_age | Age in years |
numeric_score | Test scores, exam grades |
numeric_measurement | Physical measurements (temperature, weight) |
numeric_percentage | Ratio expressed as 0-100 or 0-1 |
numeric_count | Headcount, unit quantity, occurrence count |
Confidence scores
Every classification includes a confidence score from 0.0 to 1.0 (displayed as 0-100%). Confidence reflects how strongly the engine’s signals agree on the column’s category — a column with a confirmed regex pattern, a recognisable name, and a domain-appropriate context will land near the top of the range; a column with weaker or conflicting signals will land lower.| Confidence band | Typical meaning | Action |
|---|---|---|
| Very high | Pattern confirmed by multiple converging signals | No action needed |
| High | Strong name and value signals agree | No action needed |
| Moderate | Reasonable evidence, but not unanimous | Review recommended |
| Low | Signals are weak or conflicting | Sent to the clarification queue for user review |
Org memory
ORCA learns from your corrections. When you correct a classification, the system stores the mapping in your organisation’s memory. On future scans:- The engine checks if it has seen a similar column before (same name, similar values)
- If a match is found, the stored classification is used with boosted confidence
- This creates a feedback loop where accuracy improves over time for your specific data
Correcting a classification
In the web app
Navigate to the job detail page, find the column, and select the correct category from the dropdown. The correction is saved immediately and added to org memory.Via the API
Batch corrections
To correct multiple columns at once:Next steps
AI readiness
See how classification feeds into readiness scoring
Data contracts
Set quality rules based on classified column types