What is ORCA?
ORCA is a data intelligence platform built for teams that need to understand, improve, and assess the quality of their data. Upload any dataset and ORCA automatically:- Classifies every column using AI-powered semantic analysis (200+ categories)
- Detects data quality issues: nulls, duplicates, format violations, outliers, GDPR-sensitive fields
- Scores AI readiness across 7 dimensions with actionable recommendations
- Remediates issues with auto-fix strategies
- Assesses datasets with verifiable quality reports
Key capabilities
| Capability | Description |
|---|---|
| Semantic classification | AI classifies columns into 200+ semantic categories (email, revenue, date_of_birth, etc.) |
| Quality analysis | Detects nulls, duplicates, format violations, outliers, and anomalies per column |
| AI readiness scoring | 7-dimension weighted score (0-100) measuring dataset fitness for ML/AI use cases |
| Auto-remediation | Preview and apply fixes: null imputation, deduplication, format standardization, outlier treatment |
| GDPR compliance | Automatic PII detection with 3-layer password screening and data masking |
| Use-case readiness | Assess fitness for 8 ML use cases (churn prediction, fraud detection, recommendation, etc.) |
| Assessment | Issue verifiable SHA-256 assessment reports for datasets scoring 75+ |
| PDF reports | Export AI readiness and GDPR compliance reports |
Architecture
ORCA is built on:- Backend: FastAPI + PostgreSQL + Redis + AWS S3
- AI engine: Google Gemini for semantic classification
- Task queue: ARQ for async processing (file analysis, report generation)
- Frontend: React with real-time WebSocket progress updates
Next steps
Quick start
Upload your first file and see results in minutes
AI readiness
Understand the 7-dimension scoring methodology
API reference
Integrate ORCA into your data pipeline
Security
Security and compliance details