Skip to main content

Overview

ORCA automatically builds a knowledge graph from your analysed datasets. The graph maps entities (such as “customer”, “product”, or “order”) across files, detects relationships between columns, and tracks quality lineage over time. This gives you a unified view of how your data connects and where quality issues propagate.

Knowledge graph

What the graph shows

The knowledge graph visualization displays:
  • File nodes — each analysed file appears as a node, showing row count, column count, quality score, and issue summary
  • Cross-file edges — connections between files that share entities or related columns, with connection health indicators
  • Shared columns — the specific columns that link two files together, including containment ratios and confidence scores
  • Functional dependencies — within-file column dependencies (e.g. zip_code determines city) attached to their file node

Entity detection

When ORCA analyses a file, it identifies semantic entities by examining column names, data types, and value patterns. Columns with names like customer_id, cust_id, and client_id are resolved to the same canonical entity (“customer”). Entity resolution uses:
  • Column name similarity and synonyms
  • Value distribution fingerprinting
  • Semantic classification from the AI engine

Relationship types

ORCA detects several types of cross-file relationships:
TypeDescription
equivalentColumns represent the same data (e.g. customer_id in both files)
a_references_bColumn in file A references column in file B (foreign key pattern)
b_references_aColumn in file B references column in file A
junctionJunction table pattern linking two entities
shared_conceptColumns represent the same concept but may have different values
functional_dependencyWithin-file dependency (determinant/dependent pair)

Connection health

Each edge in the graph is assigned a health status based on the quality of the shared columns:
StatusMeaning
HealthyNo significant issues in shared columns
WarningSubstantial null rates or other warning-level issues (nulls, GDPR fields)
CriticalFormat violations, duplicates, or orphaned references in shared columns

Using the graph visualization

Navigate to the Knowledge Graph page from the sidebar. The graph displays all files with cross-file connections.
  • Click a file node to see its columns, quality score, entity columns, and functional dependencies
  • Click an edge to inspect the shared columns, relationship types, containment ratios, and per-column quality
  • Filter by relationship type or file to focus on specific connections
  • Re-detect relationships using the action button to refresh connections after uploading new files
Files without cross-file connections are hidden from the graph by default to reduce visual noise. Use the graph diagnostic endpoint to see unlinked file counts.

Data lineage

ORCA provides two levels of lineage tracking:

Column-level lineage (automatic)

As ORCA analyses files over time, it records quality snapshots per entity column. This lets you track how quality metrics change across uploads:
  • Quality score trends per entity
  • Null rate changes over time
  • Issue count history
  • Corrections applied at each point

Data flow lineage (manual + detected)

The lineage system models how data flows between sources, transformations, and destinations using a node-and-edge graph. Node types represent data assets:
Node typeDescription
sourceRaw data source (S3 bucket, database table, CSV file)
transformationETL/ELT step or data pipeline stage
destinationFinal output (data warehouse table, report, model input)
Edge types represent data flow:
Edge typeDescription
derives_fromTarget is derived from source (transformation output)
copies_toData is copied without transformation
aggregatesTarget aggregates data from source
Edges can include column mappings that specify which source columns map to which target columns.

Impact analysis

Select any lineage node and run impact analysis to see all downstream nodes that would be affected by a change. The analysis traverses the lineage graph recursively (up to 10 levels deep) and returns:
  • All impacted downstream nodes
  • The edges connecting them
  • The maximum depth of impact
This is useful for understanding blast radius before modifying a data source or pipeline.

API reference

All endpoints require authentication. Base path: /api/v1.

Knowledge graph

MethodEndpointDescription
GET/knowledge-graph/entitiesList all entities in the org
GET/knowledge-graph/entities/{entity_id}Entity detail with columns and quality history
GET/knowledge-graph/relationshipsList relationships (filterable by type, file)
GET/knowledge-graph/relationships/{file_id}Relationships involving a specific file
GET/knowledge-graph/functional-dependencies/{file_id}Functional dependencies within a file
GET/knowledge-graph/lineage/{entity_id}Quality history for an entity over time
GET/knowledge-graph/graph-dataFull graph (nodes + edges) for visualization
GET/knowledge-graph/graph-diagnosticDiagnostic stats (entity counts, linked/unlinked files)
POST/knowledge-graph/re-detect-relationshipsClear and re-run relationship detection

Data lineage

MethodEndpointDescriptionAuth
POST/lineage/nodesCreate a lineage nodeAdmin
GET/lineage/nodesList lineage nodes (filterable by type, source)Any user
PATCH/lineage/nodes/{node_id}Update a lineage nodeAdmin
DELETE/lineage/nodes/{node_id}Delete a lineage node (deactivates connected edges)Admin
POST/lineage/edgesCreate a lineage edgeAdmin
GET/lineage/edgesList active lineage edgesAny user
DELETE/lineage/edges/{edge_id}Delete a lineage edgeAdmin
GET/lineage/graphFull lineage graph (nodes + edges)Any user
GET/lineage/impact/{node_id}Downstream impact analysis from a nodeAny user

Example: run impact analysis

curl https://api.orca-klavest.app/api/v1/lineage/impact/{node_id} \
  -H "Authorization: Bearer $TOKEN"
Response:
{
  "data": {
    "root_node": { "id": "...", "name": "raw_customers", "node_type": "source" },
    "impacted_nodes": [
      { "id": "...", "name": "dim_customers", "node_type": "transformation" },
      { "id": "...", "name": "churn_model_input", "node_type": "destination" }
    ],
    "impacted_edges": [...],
    "depth": 2
  }
}