Use case: Applying data lineage for tracing a data propagation error

Learn about how a small error can cause a large-scale impact in complex, interconnected systems. Comprehensive data lineage can act as both a detective and a roadmap for resolving such issues.

Scenario overview

Consider a hypothetical large healthcare organization.

The organization operates a core Electronic Health Records (EHR) system that stores critical patient information. This system feeds multiple downstream platforms, including billing, patient portals, insurance claims, and other operational tools.

During a routine upgrade, a coding error introduces a transformation issue in the EHR system. The patient_ID column, which uniquely identifies patients, is incorrectly mapped to the caregiver_ID column that identifies healthcare providers.

The domino effect

As the EHR system feeds data to downstream systems, the mapping error has the following impacts:

  1. In the billing system, patients are billed for services provided to their healthcare providers.
  2. The patient portals show the healthcare providers details instead of patient-specific information, leading to privacy breaches.
  3. Insurance claims get denied due to incorrect patient information.
  4. Patients, healthcare providers, insurance companies, and even the healthcare organization customer service and IT teams are heavily impacted.

Navigating the problem with data lineage

In this situation, data lineage plays a crucial role in both identifying and correcting the issue using the following different layers of data lineage:

  • Cross-System Lineage – When the first issues that are incorrect billing and privacy breaches arise, the organization's data governance team uses cross-system lineage to trace the patient_ID column data across all systems. They identify that the error originates in the EHR system, which feeds most downstream systems.
  • Inner-System Lineage – Using inner-system lineage within the EHR system, they realize that the patient_ID column is incorrectly mapped to the caregiver_ID column during a transformation process.
  • End-to-End Column Lineage – To assess the full impact, the team looks at the end-to-end column lineage of the patient_ID column. They map out all the processes, systems, and reports that use this column. This information is vital for communicating with affected parties and directing corrective measures.

The IT team corrects the erroneous transformation in the EHR system and initiates a massive cleanup operation in all impacted downstream systems.

Prevention using an integrated platform

Having an integrated Cloudera Octopai Data Lineage platform that encompasses data discovery, lineage, and data catalog can significantly enhance the ability to prevent such errors by gaining greater visibility into data flows and transformations. This visibility improves detection and prevention of potential issues, enhancing overall data quality and reliability.

Different roles use the integrated platform to achieve the improvements by performing the following actions:

  • Data Engineers trace the data lineage during system development and maintenance using the catalog as a reference to check transformation logic and mapping to avoid incorrect data flow. They follow robust testing protocols before and after deploying any changes in data processing or transformation logic.
  • Data Stewards perform metadata management by ensuring that the platform metadata is up-to-date and accurate. They oversee data quality and leverage the lineage tools to verify proper data flows. They establish and monitor data quality rules aligned with the metadata and lineage information.
  • Data Analysts understand the origin and transformations of data they use for reporting and analytics. This help them spot potential issues and validate the data they use.
  • IT Security and Compliance understand where sensitive data is stored, transformed, and used. This enables them to enforce access control and monitoring to prevent unauthorized changes, promptly detect and respond to any suspicious activities, and ensure adherence to regulatory requirements.

This proactive approach is governed by a rigorous change management process, during which any system change requires an impact analysis supported by data lineage. Automated data quality checks are also used to identify out-of-range values or unusual data distributions that can indicate issues.