Knowledge Graph & Lineage

Overview

ORCA automatically builds a knowledge graph from your analysed datasets. The graph maps entities (such as “customer”, “product”, or “order”) across files, detects relationships between columns, and tracks quality lineage over time. This gives you a unified view of how your data connects and where quality issues propagate.

Knowledge graph

What the graph shows

The knowledge graph visualization displays:

File nodes — each analysed file appears as a node, showing row count, column count, quality score, and issue summary
Cross-file edges — connections between files that share entities or related columns, with connection health indicators
Shared columns — the specific columns that link two files together, including containment ratios and confidence scores
Functional dependencies — within-file column dependencies (e.g. zip_code determines city) attached to their file node

Entity detection

When ORCA analyses a file, it identifies semantic entities by examining column names, data types, and value patterns. Columns with names like customer_id, cust_id, and client_id are resolved to the same canonical entity (“customer”). Entity resolution uses:

Column name similarity and synonyms
Value distribution fingerprinting
Semantic classification from the AI engine

Relationship types

ORCA detects several types of cross-file relationships:

Type	Description
`equivalent`	Columns represent the same data (e.g. `customer_id` in both files)
`a_references_b`	Column in file A references column in file B (foreign key pattern)
`b_references_a`	Column in file B references column in file A
`junction`	Junction table pattern linking two entities
`shared_concept`	Columns represent the same concept but may have different values
`functional_dependency`	Within-file dependency (determinant/dependent pair)

Connection health

Each edge in the graph is assigned a health status based on the quality of the shared columns:

Status	Meaning
Healthy	No significant issues in shared columns
Warning	Substantial null rates or other warning-level issues (nulls, GDPR fields)
Critical	Format violations, duplicates, or orphaned references in shared columns

Using the graph visualization

Navigate to the Knowledge Graph page from the sidebar. The graph displays all files with cross-file connections.

Click a file node to see its columns, quality score, entity columns, and functional dependencies
Click an edge to inspect the shared columns, relationship types, containment ratios, and per-column quality
Filter by relationship type or file to focus on specific connections
Re-detect relationships using the action button to refresh connections after uploading new files

Files without cross-file connections are hidden from the graph by default to reduce visual noise. Use the graph diagnostic endpoint to see unlinked file counts.

Data lineage

ORCA provides two levels of lineage tracking:

Column-level lineage (automatic)

As ORCA analyses files over time, it records quality snapshots per entity column. This lets you track how quality metrics change across uploads:

Quality score trends per entity
Null rate changes over time
Issue count history
Corrections applied at each point

Data flow lineage (manual + detected)

The lineage system models how data flows between sources, transformations, and destinations using a node-and-edge graph. Node types represent data assets:

Node type	Description
`source`	Raw data source (S3 bucket, database table, CSV file)
`transformation`	ETL/ELT step or data pipeline stage
`destination`	Final output (data warehouse table, report, model input)

Edge types represent data flow:

Edge type	Description
`derives_from`	Target is derived from source (transformation output)
`copies_to`	Data is copied without transformation
`aggregates`	Target aggregates data from source

Edges can include column mappings that specify which source columns map to which target columns.

Impact analysis

Select any lineage node and run impact analysis to see all downstream nodes that would be affected by a change. The analysis traverses the lineage graph recursively (up to 10 levels deep) and returns:

All impacted downstream nodes
The edges connecting them
The maximum depth of impact

This is useful for understanding blast radius before modifying a data source or pipeline.

API reference

All endpoints require authentication. Base path: /api/v1.

Knowledge graph

Method	Endpoint	Description
`GET`	`/knowledge-graph/entities`	List all entities in the org
`GET`	`/knowledge-graph/entities/{entity_id}`	Entity detail with columns and quality history
`GET`	`/knowledge-graph/relationships`	List relationships (filterable by type, file)
`GET`	`/knowledge-graph/relationships/{file_id}`	Relationships involving a specific file
`GET`	`/knowledge-graph/functional-dependencies/{file_id}`	Functional dependencies within a file
`GET`	`/knowledge-graph/lineage/{entity_id}`	Quality history for an entity over time
`GET`	`/knowledge-graph/graph-data`	Full graph (nodes + edges) for visualization
`GET`	`/knowledge-graph/graph-diagnostic`	Diagnostic stats (entity counts, linked/unlinked files)
`POST`	`/knowledge-graph/re-detect-relationships`	Clear and re-run relationship detection

Data lineage

Method	Endpoint	Description	Auth
`POST`	`/lineage/nodes`	Create a lineage node	Admin
`GET`	`/lineage/nodes`	List lineage nodes (filterable by type, source)	Any user
`PATCH`	`/lineage/nodes/{node_id}`	Update a lineage node	Admin
`DELETE`	`/lineage/nodes/{node_id}`	Delete a lineage node (deactivates connected edges)	Admin
`POST`	`/lineage/edges`	Create a lineage edge	Admin
`GET`	`/lineage/edges`	List active lineage edges	Any user
`DELETE`	`/lineage/edges/{edge_id}`	Delete a lineage edge	Admin
`GET`	`/lineage/graph`	Full lineage graph (nodes + edges)	Any user
`GET`	`/lineage/impact/{node_id}`	Downstream impact analysis from a node	Any user

Example: run impact analysis

curl https://api.orca-klavest.app/api/v1/lineage/impact/{node_id} \
  -H "Authorization: Bearer $TOKEN"

Response:

{
  "data": {
    "root_node": { "id": "...", "name": "raw_customers", "node_type": "source" },
    "impacted_nodes": [
      { "id": "...", "name": "dim_customers", "node_type": "transformation" },
      { "id": "...", "name": "churn_model_input", "node_type": "destination" }
    ],
    "impacted_edges": [...],
    "depth": 2
  }
}

Getting started

Features

Administration

Integrations

Security & compliance

Developer Tools

Methodology

Overview

Knowledge graph

What the graph shows

Entity detection

Relationship types

Connection health

Using the graph visualization

Data lineage

Column-level lineage (automatic)

Data flow lineage (manual + detected)

Impact analysis

API reference

Knowledge graph

Data lineage

Example: run impact analysis

​Overview

​Knowledge graph

​What the graph shows

​Entity detection

​Relationship types

​Connection health

​Using the graph visualization

​Data lineage

​Column-level lineage (automatic)

​Data flow lineage (manual + detected)

​Impact analysis

​API reference

​Knowledge graph

​Data lineage

​Example: run impact analysis

Overview

Knowledge graph

What the graph shows

Entity detection

Relationship types

Connection health

Using the graph visualization

Data lineage

Column-level lineage (automatic)

Data flow lineage (manual + detected)

Impact analysis

API reference

Knowledge graph

Data lineage

Example: run impact analysis