Data retention

Overview

Data retention determines how long the raw files you upload to ORCA stick around in S3 after analysis completes. It’s separate from analysis results — quality scores, classifications, AI Readiness scores, and reports are stored in the database and persist regardless of what you set here. The retention setting only affects the original CSV/Parquet/Excel files. Once they’re deleted, you can no longer re-run analysis from them, but every result derived from them remains in your dashboard, reports, history, and contracts forever. This separation is intentional: it lets you prove the analysis happened for compliance purposes long after the source data is gone.

Retention modes

ORCA supports three modes:

Mode	Behavior	When to use
`analysis_only` (default)	Files deleted ~24 hours after analysis completes	Maximum privacy posture. Default for most workspaces.
`short_term`	Files retained for a configurable number of days, then deleted	When you need to re-run analysis or apply remediation more than 24 h after upload
`full_retention`	Files kept indefinitely until you delete them manually	Long-term storage for trusted, non-PII datasets

The default is analysis_only and the project security rules explicitly forbid changing this default — it’s the privacy-safe option and has to be opted out of, not into.

Setting retention

Retention is set per-job at upload time. You can also configure an org-wide default in Settings → Data retention.

Per upload (UI)

On the Upload page, expand Configuration. Set:

Retention mode — pick one of the three modes above
Retention days — only used when mode is short_term (1–365 days)

Per upload (API)

POST /api/v1/jobs
{
  "filenames": ["transactions.csv"],
  "retention_mode": "short_term",
  "retention_days": 7
}

Or override at job start time:

POST /api/v1/jobs/{job_id}/start
{
  "retention_mode": "analysis_only"
}

Once a job is started the retention mode is immutable — you cannot extend retention after the fact. You can always shorten it by deleting files manually.

Org-wide default

Admins can set the org default in Settings → Data retention. Per-job settings always override the org default.

How deletion works

A daily ARQ cron task (cleanup_expired_files) runs at 02:00 UTC and looks for jobs whose retention has expired:

Find expired jobs

Query for jobs where retention_mode IN ('short_term', 'analysis_only'), file_deleted_at IS NULL, and completed_at + retention_days < now(). For analysis_only, the retention period is hardcoded to 1 day.

Audit log entry

Before deleting, the worker writes a files_expiring event to the audit log with the job ID, org ID, and acting user.

Delete from S3

Each file’s S3 object is deleted. If any deletion fails, the job is skipped and retried on the next run — partial deletion is never recorded as success.

Mark as deleted

On success, the worker clears files.s3_key and stamps jobs.file_deleted_at. The job row stays in the database forever; only the S3 object is removed.

The cron is at-least-once: a file may live a few hours past its expiration, but never less than its configured retention.

Manual deletion

You can delete files immediately, regardless of retention mode:

DELETE /api/v1/jobs/{job_id}/files

Or in the UI: open the job detail page and click Delete files. This is irreversible. The audit log records a files_deleted event.

What survives deletion

After files are deleted from S3, the following remain in your database:

Job metadata, status, completed_at
All column classifications and confidence scores
Quality results and issue counts per column
AI Readiness scores and dimension breakdowns
Generated PDF reports (already rendered to S3 as separate objects with their own retention)
Audit log entries

What does not survive:

The raw CSV/Parquet/Excel file
The ability to re-run analysis with different settings
The ability to apply auto-remediation to that file (remediation needs the source bytes)

Compliance positioning

Requirement	Recommended setting
GDPR data minimisation	`analysis_only` (default)
Right to erasure (Article 17)	`analysis_only` or `short_term` ≤7 days
SOC 2 audit trail	Any mode — audit log persists regardless
Long-term re-analysis on stable data	`full_retention` (only for non-PII datasets)
Reproducibility for ML training data	`full_retention` (consider versioning at the source instead)

For most regulated workspaces, the default analysis_only mode is the right choice — analysis results stay forever, raw PII goes away in 24 hours.

Tips

Keep the default unless you have a specific reason. analysis_only is the privacy-safe option and the easiest to defend in a security review.
For short_term, pick the shortest window that lets you act. If your fix-and-rerun loop is two days, retention of 3 days is plenty.
Audit any switch to full_retention. It should only be used for non-PII datasets and the decision should be documented internally.
Rotate regularly. Combine short_term retention with scheduled scans so each scan creates a fresh copy and the previous one ages out cleanly.

What’s next?

Security overview — the broader compliance picture including encryption and PII handling
Audit logs — verify when files were deleted
Connectors — set up scheduled scans that work well with short retention windows

Getting started

Features

Administration

Integrations

Security & compliance

Developer Tools

Methodology

Overview

Retention modes

Setting retention

Per upload (UI)

Per upload (API)

Org-wide default

How deletion works

Manual deletion

What survives deletion

Compliance positioning

Tips

What’s next?

​Overview

​Retention modes

​Setting retention

​Per upload (UI)

​Per upload (API)

​Org-wide default

​How deletion works

​Manual deletion

​What survives deletion

​Compliance positioning

​Tips

​What’s next?

Overview

Retention modes

Setting retention

Per upload (UI)

Per upload (API)

Org-wide default

How deletion works

Manual deletion

What survives deletion

Compliance positioning

Tips

What’s next?