Skip to main content

Overview

Data retention determines how long the raw files you upload to ORCA stick around in S3 after analysis completes. It’s separate from analysis results — quality scores, classifications, AI Readiness scores, and reports are stored in the database and persist regardless of what you set here. The retention setting only affects the original CSV/Parquet/Excel files. Once they’re deleted, you can no longer re-run analysis from them, but every result derived from them remains in your dashboard, reports, history, and contracts forever. This separation is intentional: it lets you prove the analysis happened for compliance purposes long after the source data is gone.

Retention modes

ORCA supports three modes:
ModeBehaviorWhen to use
analysis_only (default)Files deleted ~24 hours after analysis completesMaximum privacy posture. Default for most workspaces.
short_termFiles retained for a configurable number of days, then deletedWhen you need to re-run analysis or apply remediation more than 24 h after upload
full_retentionFiles kept indefinitely until you delete them manuallyLong-term storage for trusted, non-PII datasets
The default is analysis_only and the project security rules explicitly forbid changing this default — it’s the privacy-safe option and has to be opted out of, not into.

Setting retention

Retention is set per-job at upload time. You can also configure an org-wide default in Settings → Data retention.

Per upload (UI)

On the Upload page, expand Configuration. Set:
  • Retention mode — pick one of the three modes above
  • Retention days — only used when mode is short_term (1–365 days)

Per upload (API)

POST /api/v1/jobs
{
  "filenames": ["transactions.csv"],
  "retention_mode": "short_term",
  "retention_days": 7
}
Or override at job start time:
POST /api/v1/jobs/{job_id}/start
{
  "retention_mode": "analysis_only"
}
Once a job is started the retention mode is immutable — you cannot extend retention after the fact. You can always shorten it by deleting files manually.

Org-wide default

Admins can set the org default in Settings → Data retention. Per-job settings always override the org default.

How deletion works

A daily ARQ cron task (cleanup_expired_files) runs at 02:00 UTC and looks for jobs whose retention has expired:
1

Find expired jobs

Query for jobs where retention_mode IN ('short_term', 'analysis_only'), file_deleted_at IS NULL, and completed_at + retention_days < now(). For analysis_only, the retention period is hardcoded to 1 day.
2

Audit log entry

Before deleting, the worker writes a files_expiring event to the audit log with the job ID, org ID, and acting user.
3

Delete from S3

Each file’s S3 object is deleted. If any deletion fails, the job is skipped and retried on the next run — partial deletion is never recorded as success.
4

Mark as deleted

On success, the worker clears files.s3_key and stamps jobs.file_deleted_at. The job row stays in the database forever; only the S3 object is removed.
The cron is at-least-once: a file may live a few hours past its expiration, but never less than its configured retention.

Manual deletion

You can delete files immediately, regardless of retention mode:
DELETE /api/v1/jobs/{job_id}/files
Or in the UI: open the job detail page and click Delete files. This is irreversible. The audit log records a files_deleted event.

What survives deletion

After files are deleted from S3, the following remain in your database:
  • Job metadata, status, completed_at
  • All column classifications and confidence scores
  • Quality results and issue counts per column
  • AI Readiness scores and dimension breakdowns
  • Generated PDF reports (already rendered to S3 as separate objects with their own retention)
  • Audit log entries
What does not survive:
  • The raw CSV/Parquet/Excel file
  • The ability to re-run analysis with different settings
  • The ability to apply auto-remediation to that file (remediation needs the source bytes)

Compliance positioning

RequirementRecommended setting
GDPR data minimisationanalysis_only (default)
Right to erasure (Article 17)analysis_only or short_term ≤7 days
SOC 2 audit trailAny mode — audit log persists regardless
Long-term re-analysis on stable datafull_retention (only for non-PII datasets)
Reproducibility for ML training datafull_retention (consider versioning at the source instead)
For most regulated workspaces, the default analysis_only mode is the right choice — analysis results stay forever, raw PII goes away in 24 hours.

Tips

  • Keep the default unless you have a specific reason. analysis_only is the privacy-safe option and the easiest to defend in a security review.
  • For short_term, pick the shortest window that lets you act. If your fix-and-rerun loop is two days, retention of 3 days is plenty.
  • Audit any switch to full_retention. It should only be used for non-PII datasets and the decision should be documented internally.
  • Rotate regularly. Combine short_term retention with scheduled scans so each scan creates a fresh copy and the previous one ages out cleanly.

What’s next?

  • Security overview — the broader compliance picture including encryption and PII handling
  • Audit logs — verify when files were deleted
  • Connectors — set up scheduled scans that work well with short retention windows