Data Sources & Connectors

Overview

ORCA can connect to external data sources for continuous quality monitoring. Once connected, ORCA discovers files and tables, runs scheduled scans, and tracks changes over time. You can connect sources through the web UI or the API.

Supported sources

Source	Type	Auth method	File formats
AWS S3	Object storage	IAM role or access key	CSV, Parquet, JSON
Google Cloud Storage	Object storage	Service account (Workload Identity)	CSV, Parquet, JSON
PostgreSQL	Database	Connection string	—
BigQuery	Data warehouse	Service account	—
Snowflake	Data warehouse	Username/password or key pair	—

Connecting a source

AWS S3

Required credentials:

Field	Description
Bucket name	S3 bucket name
Prefix	Optional path prefix to scope file discovery
AWS region	Bucket region (default: `eu-north-1`)
Access key ID	IAM access key (or use IAM role)
Secret access key	IAM secret key (or use IAM role)

IAM policy — minimum permissions for the ORCA service:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:ListBucket",
        "s3:HeadBucket"
      ],
      "Resource": [
        "arn:aws:s3:::your-bucket",
        "arn:aws:s3:::your-bucket/*"
      ]
    }
  ]
}

S3 files are loaded as lazy frames (Polars scan_csv / scan_parquet) for memory-safe processing of large datasets. KMS encryption is supported for uploads.

Google Cloud Storage

Required credentials:

Field	Description
Bucket name	GCS bucket name
Prefix	Optional path prefix
Service account JSON	Service account key file contents

Authentication is handled via GOOGLE_APPLICATION_CREDENTIALS. Workload Identity Federation is supported for production deployments without key files.

PostgreSQL

Required credentials:

Field	Description
Host	Database hostname
Port	Database port (default: 5432)
Database	Database name
Username	Database user
Password	Database password
Schema	Schema to scan (default: `public`)
SSL mode	Connection SSL mode

When you test the connection, ORCA returns the list of discovered tables with row counts and sizes so you can select which tables to monitor.

BigQuery

Required credentials:

Field	Description
Project ID	GCP project ID
Dataset	BigQuery dataset name
Service account JSON	Service account key file contents

Snowflake

Required credentials:

Field	Description
Account	Snowflake account identifier
Warehouse	Compute warehouse
Database	Database name
Schema	Schema name
Username	Snowflake user
Password	Snowflake password

Testing a connection

Before saving a source, test credentials to verify access:

curl -X POST https://api.orca-klavest.app/api/v1/sources/test-connection \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "type": "postgres",
    "credentials": {
      "host": "db.example.com",
      "port": 5432,
      "database": "analytics",
      "username": "orca_reader",
      "password": "..."
    },
    "config": {
      "schema": "public"
    }
  }'

{
  "data": {
    "success": true,
    "tables_found": 12,
    "tables": [
      {
        "schema": "public",
        "table_name": "orders",
        "row_estimate": 145230,
        "size_bytes": 52428800
      }
    ]
  }
}

Creating a source

curl -X POST https://api.orca-klavest.app/api/v1/sources \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Production Warehouse",
    "type": "postgres",
    "credentials": {
      "host": "db.example.com",
      "port": 5432,
      "database": "analytics",
      "username": "orca_reader",
      "password": "..."
    },
    "config": {
      "schema": "public"
    }
  }'

orca scan --source-id <returned-uuid> --wait

Credentials are encrypted at rest and never logged. Source creation is recorded in the audit log.

Scan scheduling

Set up cron-based schedules to scan sources automatically.

Schedule configuration

Field	Description
`cron_expression`	Standard cron expression (e.g. `0 6 * * *` for daily at 06:00 UTC)
`tables`	Optional list of specific tables to scan (default: all)
`enabled`	Toggle the schedule on/off

Common cron patterns

Schedule	Cron expression
Every hour	`0 * * * *`
Daily at 06:00 UTC	`0 6 * * *`
Weekdays at 08:00 UTC	`0 8 * * 1-5`
Weekly on Monday	`0 6 * * 1`

Creating a schedule via API

curl -X POST https://api.orca-klavest.app/api/v1/sources/<source-id>/schedules \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "cron_expression": "0 6 * * *",
    "enabled": true
  }'

When a scheduled scan completes, ORCA evaluates any data contracts bound to the source and triggers alerts if quality drops below contract thresholds.

File discovery and sync state

For object storage sources (S3, GCS), ORCA maintains a sync state for each discovered file:

New files are detected on each scan and queued for analysis
Modified files (changed Last-Modified or ETag) are re-scanned
Deleted files are marked as removed in the sync state

The sync state is stored in the source_file_states table. You can view discovered files and their scan status through the Sources page in the web app or via the API.

API endpoints

All source management endpoints require authentication and org membership.

Method	Endpoint	Description
`POST`	`/api/v1/sources/test-connection`	Test credentials without saving
`POST`	`/api/v1/sources`	Create a new data source
`GET`	`/api/v1/sources`	List all sources for the org
`GET`	`/api/v1/sources/:id`	Get source details
`PATCH`	`/api/v1/sources/:id`	Update source config or credentials
`DELETE`	`/api/v1/sources/:id`	Delete a source (admin only)
`POST`	`/api/v1/sources/:id/scan`	Trigger a manual scan
`GET`	`/api/v1/sources/:id/files`	List discovered files and sync state
`POST`	`/api/v1/sources/:id/schedules`	Create a scan schedule
`GET`	`/api/v1/sources/:id/schedules`	List scan schedules
`PATCH`	`/api/v1/sources/:id/schedules/:sid`	Update a schedule
`DELETE`	`/api/v1/sources/:id/schedules/:sid`	Delete a schedule

Security

Credentials are encrypted at rest using the application secret key
Credentials are never included in API responses or logs
Source creation and deletion events are recorded in the audit log
All source queries are scoped to the authenticated user’s organisation
S3 keys are scoped to the org’s prefix to prevent cross-tenant access

Getting started

Features

Administration

Integrations

Security & compliance

Developer Tools

Methodology

Overview

Supported sources

Connecting a source

AWS S3

Google Cloud Storage

PostgreSQL

BigQuery

Snowflake

Testing a connection

Creating a source

Scan scheduling

Schedule configuration

Common cron patterns

Creating a schedule via API

File discovery and sync state

API endpoints

Security

​Overview

​Supported sources

​Connecting a source

​AWS S3

​Google Cloud Storage

​PostgreSQL

​BigQuery

​Snowflake

​Testing a connection

​Creating a source

​Scan scheduling

​Schedule configuration

​Common cron patterns

​Creating a schedule via API

​File discovery and sync state

​API endpoints

​Security

Overview

Supported sources

Connecting a source

AWS S3

Google Cloud Storage

PostgreSQL

BigQuery

Snowflake

Testing a connection

Creating a source

Scan scheduling

Schedule configuration

Common cron patterns

Creating a schedule via API

File discovery and sync state

API endpoints

Security