Overview
ORCA automatically builds a knowledge graph from your analysed datasets. The graph maps entities (such as “customer”, “product”, or “order”) across files, detects relationships between columns, and tracks quality lineage over time. This gives you a unified view of how your data connects and where quality issues propagate.
Knowledge graph
What the graph shows
The knowledge graph visualization displays:
- File nodes — each analysed file appears as a node, showing row count, column count, quality score, and issue summary
- Cross-file edges — connections between files that share entities or related columns, with connection health indicators
- Shared columns — the specific columns that link two files together, including containment ratios and confidence scores
- Functional dependencies — within-file column dependencies (e.g.
zip_code determines city) attached to their file node
Entity detection
When ORCA analyses a file, it identifies semantic entities by examining column names, data types, and value patterns. Columns with names like customer_id, cust_id, and client_id are resolved to the same canonical entity (“customer”). Entity resolution uses:
- Column name similarity and synonyms
- Value distribution fingerprinting
- Semantic classification from the AI engine
Relationship types
ORCA detects several types of cross-file relationships:
| Type | Description |
|---|
equivalent | Columns represent the same data (e.g. customer_id in both files) |
a_references_b | Column in file A references column in file B (foreign key pattern) |
b_references_a | Column in file B references column in file A |
junction | Junction table pattern linking two entities |
shared_concept | Columns represent the same concept but may have different values |
functional_dependency | Within-file dependency (determinant/dependent pair) |
Connection health
Each edge in the graph is assigned a health status based on the quality of the shared columns:
| Status | Meaning |
|---|
| Healthy | No significant issues in shared columns |
| Warning | Substantial null rates or other warning-level issues (nulls, GDPR fields) |
| Critical | Format violations, duplicates, or orphaned references in shared columns |
Using the graph visualization
Navigate to the Knowledge Graph page from the sidebar. The graph displays all files with cross-file connections.
- Click a file node to see its columns, quality score, entity columns, and functional dependencies
- Click an edge to inspect the shared columns, relationship types, containment ratios, and per-column quality
- Filter by relationship type or file to focus on specific connections
- Re-detect relationships using the action button to refresh connections after uploading new files
Files without cross-file connections are hidden from the graph by default to reduce visual noise. Use the graph diagnostic endpoint to see unlinked file counts.
Data lineage
ORCA provides two levels of lineage tracking:
Column-level lineage (automatic)
As ORCA analyses files over time, it records quality snapshots per entity column. This lets you track how quality metrics change across uploads:
- Quality score trends per entity
- Null rate changes over time
- Issue count history
- Corrections applied at each point
Data flow lineage (manual + detected)
The lineage system models how data flows between sources, transformations, and destinations using a node-and-edge graph.
Node types represent data assets:
| Node type | Description |
|---|
source | Raw data source (S3 bucket, database table, CSV file) |
transformation | ETL/ELT step or data pipeline stage |
destination | Final output (data warehouse table, report, model input) |
Edge types represent data flow:
| Edge type | Description |
|---|
derives_from | Target is derived from source (transformation output) |
copies_to | Data is copied without transformation |
aggregates | Target aggregates data from source |
Edges can include column mappings that specify which source columns map to which target columns.
Impact analysis
Select any lineage node and run impact analysis to see all downstream nodes that would be affected by a change. The analysis traverses the lineage graph recursively (up to 10 levels deep) and returns:
- All impacted downstream nodes
- The edges connecting them
- The maximum depth of impact
This is useful for understanding blast radius before modifying a data source or pipeline.
API reference
All endpoints require authentication. Base path: /api/v1.
Knowledge graph
| Method | Endpoint | Description |
|---|
GET | /knowledge-graph/entities | List all entities in the org |
GET | /knowledge-graph/entities/{entity_id} | Entity detail with columns and quality history |
GET | /knowledge-graph/relationships | List relationships (filterable by type, file) |
GET | /knowledge-graph/relationships/{file_id} | Relationships involving a specific file |
GET | /knowledge-graph/functional-dependencies/{file_id} | Functional dependencies within a file |
GET | /knowledge-graph/lineage/{entity_id} | Quality history for an entity over time |
GET | /knowledge-graph/graph-data | Full graph (nodes + edges) for visualization |
GET | /knowledge-graph/graph-diagnostic | Diagnostic stats (entity counts, linked/unlinked files) |
POST | /knowledge-graph/re-detect-relationships | Clear and re-run relationship detection |
Data lineage
| Method | Endpoint | Description | Auth |
|---|
POST | /lineage/nodes | Create a lineage node | Admin |
GET | /lineage/nodes | List lineage nodes (filterable by type, source) | Any user |
PATCH | /lineage/nodes/{node_id} | Update a lineage node | Admin |
DELETE | /lineage/nodes/{node_id} | Delete a lineage node (deactivates connected edges) | Admin |
POST | /lineage/edges | Create a lineage edge | Admin |
GET | /lineage/edges | List active lineage edges | Any user |
DELETE | /lineage/edges/{edge_id} | Delete a lineage edge | Admin |
GET | /lineage/graph | Full lineage graph (nodes + edges) | Any user |
GET | /lineage/impact/{node_id} | Downstream impact analysis from a node | Any user |
Example: run impact analysis
curl https://api.orca-klavest.app/api/v1/lineage/impact/{node_id} \
-H "Authorization: Bearer $TOKEN"
Response:
{
"data": {
"root_node": { "id": "...", "name": "raw_customers", "node_type": "source" },
"impacted_nodes": [
{ "id": "...", "name": "dim_customers", "node_type": "transformation" },
{ "id": "...", "name": "churn_model_input", "node_type": "destination" }
],
"impacted_edges": [...],
"depth": 2
}
}