Building Enterprise Data Lineage & Provenance

Architecture, Data & Analytics, Data Catalog, Data Lineage, Data Management & Governance, Data Strategy, Pentaho, Pentaho Data Catalog

Building Enterprise Data Lineage & Provenance

Published by

Rishu Shrivastava

on

April 27, 2024

⏩ TL;DR

Data lineage tracks data from its origin to destination, crucial for data management and governance frameworks, benefiting data engineers and owners by ensuring quality control and compliance.
While data provenance focuses on the origin and history of data, data lineage provides a broader view, detailing its journey through processes and transformations.
Data lineage can be table-level or field-level, with techniques including inferred lineage, tagging, self-contained lineage, and factual lineage, each with its advantages and limitations.
Various tools like Solidatus, Alation, Pentaho Data Catalog, and Monte Carlo offer data lineage solutions, with factors to consider including data source integrations, column-level lineage support, adherence to standards like OpenLineage, visualisation capabilities, and security measures.
Enterprises face challenges in implementing robust data lineage due to complex ecosystems, inadequate governance, lack of expertise, and overemphasis on tools. Strategies for building data lineage include metadata management maturity, tool evaluation, education and awareness, and iterative scaling.

🚀 Introduction

Data Lineage

Data lineage refers to the end-to-end view of the flow of data from its source (origin) through various processes and transformations to its target (destination). It traces the lifecycle of data, detailing how it is created, manipulated, utilised and accessed within a system or across systems. Building and understanding the lineage of your data is a critical part of the data engineering and analytics projects, impacting both the engineering and the business side of your projects.

Data Lineage is one of the key component of the overall Data Management and Governance frameworks. The primary audience of data lineage are the data engineers and data owners who need to monitor and keep the quality control on their datasets.

The need for data lineage

*Sample view of data lineage. Image credits: Erwin*

With the increase in data usage, data regulations (GDPR), AI and others, understanding and trusting the source of your data is of the highest demand. Data engineering teams having large scale projects struggle to understand and figure out the source of their data. It becomes a difficult task to understand some of the legacy systems codes and align with the business rules and definitions. Data lineage is crucial for understanding the provenance of data, ensuring data quality, compliance with regulations, and troubleshooting issues within data pipelines or workflows.

In summary, Data lineage allows enterprises to:

Identifying data quality issues: Data lineage provides transparency into the flow of data, allowing organisations to identify potential issues such as data inconsistencies, inaccuracies, or anomalies. By understanding the origin and transformations applied to data, organisations can implement quality control measures to ensure data accuracy and reliability.
Enabling effective data governance: It supports effective data governance by providing visibility into data assets, their usage, and lineage relationships. It helps organisations establish data ownership, define data policies, and enforce data quality standards.
Regulatory compliance: Many industries are subject to regulatory requirements regarding data management and reporting. Data lineage helps organisations demonstrate compliance by providing a comprehensive audit trail of data movement and transformations. This enables organisations to track data lineage for regulatory reporting, such as compliance with GDPR, HIPAA, or financial regulations.
Impact analysis: Data lineage enables organisations to assess the impact of changes to data sources, processes, or systems. By understanding how changes propagate through data pipelines, organisations can anticipate potential impacts on downstream systems or analyses. This helps mitigate risks and ensures business continuity during system upgrades, migrations, or changes to data schemas.
Troubleshooting and RCA: When data issues arise, such as discrepancies or errors in reports or analyses, data lineage serves as a valuable tool for troubleshooting and root cause analysis.

Data Provenance

You may have come across the word Data Provenance in the sales, marketing slides or in a data governance meeting. Hence, it has become necessary to talk briefly about data provenance.

In short, Data Provenance refers to the history and origin of data, including its creation, movement, and transformations throughout its lifecycle. Unlike a data lineage, data provenance looks into the metadata of source data, providing authenticity and historical context.

Difference between Data provenance & Data lineage

Category	Data Provenance	Data Lineage
Definition	Primarily emphasises the origin of data and its history. It involves tracing the lineage of data back to its creation point, documenting the processes it has undergone, and identifying any transformations or modifications it has experienced	It is a broader concept that encompasses not only the origin of data but also its entire journey through various processes and transformations. It involves documenting the flow of data from its source through different systems, applications, and transformations to its final destination.
Solves concerns	What source this data comes from? Who changed the data source? Who created the data source?	Where is this data coming? Who is using this data? What happened to this table?
Key Target Audience	Data Analysts, Data Owners	Data Engineering, Data Stewards

📖 Understanding Data Lineage further

Data Lineage types

In practise, data lineage are basically of two types:

Table-Level Lineage

This is a high level view of the end to end lineage. Table level lineage illustrates how multiple tables are connected with other tables in a data environment. This type of lineage however fails to cover the origin details of the table.

Field-Level Lineage

The field level lineage looks at the individual column level information from source to target datasets. It is useful for data observability use-cases and enable data engineers to quickly identify and trace the issues in the data pipeline at an object field level.

Data lineage generation techniques

For a data engineering mind, the real question is how to generate data lineage of their organisation’s data pipeline. There are many products in the market that provide out of the box data lineage solutions. However, there is no one-solution-fit-all method to view the data lineage. Some of the techniques or methods that products usually take are as follows:

Inferred Lineage

This technique of generating lineage is also refereed as the automated lineage or Pattern based lineage. Lineage is generated based on the automated matching of the similar data or metadata of two database objects. For example: if two tables contain columns (fields) with similar data and metadata (datatypes, cardinality, etc.), it is highly likely to have a matching relationship.

The advantage of using this technique is its product agnostic capability. Since the reliability is on matching the data and the metadata, this process doesn’t have to rely on specific version of the databases (Oracle, MySQL, PostgreSQL, etc.).

The key disadvantage with this approach are the incorrect mappings. Inferred lineage are not always accurate as they either miss to identify relationships between fields due to some transformation logic or they may incorrectly match them. Hence, the reliability of such approaches diminishes in the real-world scenarios.

Lineage by Tagging

As the name suggests, the Lineage by Tagging approach is dependent on reading the tags from the data pipeline. This approach traces and queries the tags from the start to the end of the data pipeline.

This approach is suitable for closed data pipelines systems where tagging of the ETL workflow can be achieved easily. Such processes require a deep understanding of the pipelines and struggles to generalise against a plethora of ETL products in the market like Pentaho, Informatica, Talend, etc.

Self-contained Lineage

Another approach of generating lineage is by reading the metadata stored in the data stores (database, object store, etc.). This approach is applicable for data engineering projects that maintain every metadata about their data pipelines. However, similar to the above technique, self-contained lineage approach is suitable for closed systems. They are not open and fails to generalise against the other systems.

Factual Lineage

Factual lineage or lineage by parsing is the method to generate lineage by parsing the data pipeline code. These include applying custom algorithms to parse the SQL queries, read ETL logs, etc. to generate the lineage. It generates the most accurate picture of the lineage.

The advantage of using factual lineage is the accuracy of the generated lineage. Once implemented, factual lineage can be re-used across other domains of the organisations as the parsing queries and logs are standard across the community. The key disadvantage is the complexity of this approach. Factual lineage can become quite complex to implement and enterprises require specialised skills and efforts to develop a generalised lineage solutions.

*High level architecture of Monte Carlo’s factual lineage approach*

One example of end-to-end factual lineage solution (above diagram) is provided by Monte Carlo’s team. At a high level the approach looks into collecting SQL queries, parsing and generating the field level lineage. The solution involves using various technologies like AWS services like Kinesis, Snowflake, ANTLR, Elastic Search and PostgreSQL as metadata store. The solution approach is a custom solution and supports parsing of databases like Presto, Redshift, Snowflake, and BigQuery.

Having said that, most of the enterprises require a custom solution for their needs involving parsing complex ETL pipelines, SQL queries and other codes. Data Catalog products like Pentaho Data Catalog along with its professional services team supports the generation of the factual lineage.

Data lineage tools in the market

There are a catalogue of tools in the market that support data lineage. Some of them are provided as below:

Solidatus
Alation
Pentaho Data Catalog (previously Lumada data catalog)
Collibra
Monte Carlo
Informatica Metadata Manager
IBM Infosphere Information Governance Catalog
Atlan
MANTA
Apache Atlas

Tech Spaghetti

Building Enterprise Data Lineage & Provenance

⏩ TL;DR

🚀 Introduction

Data Lineage

The need for data lineage

Data Provenance

Difference between Data provenance & Data lineage

📖 Understanding Data Lineage further

Data Lineage types

Table-Level Lineage

Field-Level Lineage

Data lineage generation techniques

Inferred Lineage

Lineage by Tagging

Self-contained Lineage

Factual Lineage

Data lineage tools in the market

🏦 Solving Data Lineage for Enterprises

Why Enterprises are failing to leverage data lineage?

Building Enterprise Data Lineage & Provenance

⏩ TL;DR

🚀 Introduction

Data Lineage

The need for data lineage

Data Provenance

Difference between Data provenance & Data lineage

📖 Understanding Data Lineage further

Data Lineage types

Table-Level Lineage

Field-Level Lineage

Data lineage generation techniques

Inferred Lineage

Lineage by Tagging

Self-contained Lineage

Factual Lineage

Data lineage tools in the market

🏦 Solving Data Lineage for Enterprises

Why Enterprises are failing to leverage data lineage?

Subscribe to continue reading

Share this: