Skip to main content

Overview

ORCA can connect to external data sources for continuous quality monitoring. Once connected, ORCA discovers files and tables, runs scheduled scans, and tracks changes over time. You can connect sources through the web UI or the API.

Supported sources

SourceTypeAuth methodFile formats
AWS S3Object storageIAM role or access keyCSV, Parquet, JSON
Google Cloud StorageObject storageService account (Workload Identity)CSV, Parquet, JSON
PostgreSQLDatabaseConnection string
BigQueryData warehouseService account
SnowflakeData warehouseUsername/password or key pair

Connecting a source

AWS S3

Required credentials:
FieldDescription
Bucket nameS3 bucket name
PrefixOptional path prefix to scope file discovery
AWS regionBucket region (default: eu-north-1)
Access key IDIAM access key (or use IAM role)
Secret access keyIAM secret key (or use IAM role)
IAM policy — minimum permissions for the ORCA service:
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:ListBucket",
        "s3:HeadBucket"
      ],
      "Resource": [
        "arn:aws:s3:::your-bucket",
        "arn:aws:s3:::your-bucket/*"
      ]
    }
  ]
}
S3 files are loaded as lazy frames (Polars scan_csv / scan_parquet) for memory-safe processing of large datasets. KMS encryption is supported for uploads.

Google Cloud Storage

Required credentials:
FieldDescription
Bucket nameGCS bucket name
PrefixOptional path prefix
Service account JSONService account key file contents
Authentication is handled via GOOGLE_APPLICATION_CREDENTIALS. Workload Identity Federation is supported for production deployments without key files.

PostgreSQL

Required credentials:
FieldDescription
HostDatabase hostname
PortDatabase port (default: 5432)
DatabaseDatabase name
UsernameDatabase user
PasswordDatabase password
SchemaSchema to scan (default: public)
SSL modeConnection SSL mode
When you test the connection, ORCA returns the list of discovered tables with row counts and sizes so you can select which tables to monitor.

BigQuery

Required credentials:
FieldDescription
Project IDGCP project ID
DatasetBigQuery dataset name
Service account JSONService account key file contents

Snowflake

Required credentials:
FieldDescription
AccountSnowflake account identifier
WarehouseCompute warehouse
DatabaseDatabase name
SchemaSchema name
UsernameSnowflake user
PasswordSnowflake password

Testing a connection

Before saving a source, test credentials to verify access:
curl -X POST https://api.orca-klavest.app/api/v1/sources/test-connection \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "type": "postgres",
    "credentials": {
      "host": "db.example.com",
      "port": 5432,
      "database": "analytics",
      "username": "orca_reader",
      "password": "..."
    },
    "config": {
      "schema": "public"
    }
  }'

Creating a source

curl -X POST https://api.orca-klavest.app/api/v1/sources \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Production Warehouse",
    "type": "postgres",
    "credentials": {
      "host": "db.example.com",
      "port": 5432,
      "database": "analytics",
      "username": "orca_reader",
      "password": "..."
    },
    "config": {
      "schema": "public"
    }
  }'
Credentials are encrypted at rest and never logged. Source creation is recorded in the audit log.

Scan scheduling

Set up cron-based schedules to scan sources automatically.

Schedule configuration

FieldDescription
cron_expressionStandard cron expression (e.g. 0 6 * * * for daily at 06:00 UTC)
tablesOptional list of specific tables to scan (default: all)
enabledToggle the schedule on/off

Common cron patterns

ScheduleCron expression
Every hour0 * * * *
Daily at 06:00 UTC0 6 * * *
Weekdays at 08:00 UTC0 8 * * 1-5
Weekly on Monday0 6 * * 1

Creating a schedule via API

curl -X POST https://api.orca-klavest.app/api/v1/sources/<source-id>/schedules \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "cron_expression": "0 6 * * *",
    "enabled": true
  }'
When a scheduled scan completes, ORCA evaluates any data contracts bound to the source and triggers alerts if quality drops below contract thresholds.

File discovery and sync state

For object storage sources (S3, GCS), ORCA maintains a sync state for each discovered file:
  • New files are detected on each scan and queued for analysis
  • Modified files (changed Last-Modified or ETag) are re-scanned
  • Deleted files are marked as removed in the sync state
The sync state is stored in the source_file_states table. You can view discovered files and their scan status through the Sources page in the web app or via the API.

API endpoints

All source management endpoints require authentication and org membership.
MethodEndpointDescription
POST/api/v1/sources/test-connectionTest credentials without saving
POST/api/v1/sourcesCreate a new data source
GET/api/v1/sourcesList all sources for the org
GET/api/v1/sources/:idGet source details
PATCH/api/v1/sources/:idUpdate source config or credentials
DELETE/api/v1/sources/:idDelete a source (admin only)
POST/api/v1/sources/:id/scanTrigger a manual scan
GET/api/v1/sources/:id/filesList discovered files and sync state
POST/api/v1/sources/:id/schedulesCreate a scan schedule
GET/api/v1/sources/:id/schedulesList scan schedules
PATCH/api/v1/sources/:id/schedules/:sidUpdate a schedule
DELETE/api/v1/sources/:id/schedules/:sidDelete a schedule

Security

  • Credentials are encrypted at rest using the application secret key
  • Credentials are never included in API responses or logs
  • Source creation and deletion events are recorded in the audit log
  • All source queries are scoped to the authenticated user’s organisation
  • S3 keys are scoped to the org’s prefix to prevent cross-tenant access