Data Discovery & Lineage for an Event Streaming Platform

Shahir A. Daya
9 min readJun 23, 2022

By Shahir A. Daya, IBM Distinguished Engineer and Chief Architect and Shashwat Pandey, IBM Software Engineer

In an earlier article, Mehryar and I described the Event-Driven Book of Reference pattern we leveraged to facilitate application modernization. The implementation of the pattern evolved into a foundational technology platform on top of which every client experience (i.e. business capability, product, etc.) was built and delivered. This platform grew in its capabilities and has a dedicated platform engineering team that evolves it based on a roadmap.

We had broken down the data silos and had all data from the many Systems of Record streaming into the platform to establish an event-driven book of reference in near real-time. One of our goals was for the platform to be self-serve. We needed full-stack product teams to be able to discover what data was available in the book of reference, where the data came from, i.e. what system of record, how and by whom (what component) it was processed/transformed by, what API could be used to consume it, and what its data classification was among other things. In addition to making the platform self-serve for product development teams, we needed to be able to answer questions about the data from data owners, risk, compliance and security teams quickly. We realized that we needed a metadata repository. We started to look at what open source solutions were available and experimented with a few. We ended up selecting Apache Atlas.

In this article, Shash and I will share how we have integrated Apache Atlas, an open-source metadata management and data governance solution, into our platform to provide that data discovery and end-to-end data lineage capability.

What is Apache Atlas?

Apache Atlas is a metadata management and data governance tool, but to dive deeper, it is a collection of immensely popular and resilient open-sourced projects to provide visibility into the large data infrastructures that exist at enterprises. Originally built to provide clarity into Hadoop environments, Atlas now provides the ability to identify, classify and manage data entities and their relationships across almost any ecosystem.

Key capabilities of Atlas include:

  • Metadata management, including pre-defined entities for Kafka, RDBMS, Hadoop and more
  • Data classification capabilities with the propagation of classification
  • Data Discovery and Lineage user interface
  • REST API and Kafka interfaces for integration

The following figure from the Atlas documentation shows the high-level architecture. There are a couple of ways to integrate with Atlas; Messaging and APIs. More on that later.

Figure 1 — Atlas High-Level Architecture — Overview (source: https://atlas.apache.org/#/Architecture)

Design Principles

As we designed how we would integrate Apache Atlas into our platform, we established some core design principles:

  • Automatically updated without human intervention — we needed Apache Atlas to be automatically updated and kept in sync with what was deployed to production. We did not want it to be a burden on anyone to keep on updating the data manually. We already had manual spreadsheet-based processes, and we wanted to move away from these as they were error-prone, took a lot of effort to keep up to date, and took time to find answers to questions we were being asked.
  • Near real-time updates to Atlas — we wanted Atlas to reflect what was running in near real-time. We did not want to rely on any batch processing. So, for example, when an AVRO schema gets updated in the Schema Registry, we wanted that update to flow into Atlas immediately.
  • Updates are not lost if the system is down — we needed to ensure that any updates to Atlas would not get lost. Atlas needed to be an accurate representation of production. If Atlas itself was unavailable for whatever reason, we needed updates to it to be held until it was available to process them.
  • Cover the entire end-to-end data supply chain — one of the key aspects of our platform was to democratize the data locked up in the many systems of record. Those systems of record were instrumented to emit events in real-time or leveraged change data capture. The data was ingested and transformed into an industry-aligned data model and projected into materialized views, etc. We needed Atlas to cover the entire data supply chain from source to sink.

High-Level Architecture

Integrating with Apache Atlas could be accomplished in two ways:

  1. Utilizing the Kafka Topics Atlas creates
  2. Using the Atlas REST API

In both cases, a custom service was needed to bridge the gap between the sources of metadata and Apache Atlas. In building this service, interacting through Atlas API calls afforded the ability to be more granular when defining and especially when updating relationships between the entities in Atlas.

Figure 2 illustrates the high-level architecture of our solution.

Figure 2 — High-Level Architecture

Integrating the Schema Registry

With Confluent Kafka as the basis for the event-driven environment, one point of interaction had to be with the Confluent Schema Registry. The Confluent Schema Registry maintains and manages all schemas used by streaming applications when producing /consuming to/from Kafka topics. The Schema Registry uses a Kafka topic “_schemas” as its storage for every registered schema and subsequent updates. Thus, a Kafka consumer on that topic would provide the most robust method of data consumption. Any error in data would result in the consumer stopping on that record until further investigation is complete, and no data would be lost as it is guaranteed by Kafka.

Integrating with Source Code artifacts

Due to the promotion process of Kafka topics through the different environments (DEV, TEST, PROD), a system was created with Kafka topics hosted in a source code repository which would be pulled and compared to the state of the production environment to identify the changes during a given release. Since the definitions, which included the topic name, size, retention policy and Access Control List (ACL), it was all that was needed to create complete entities on Atlas. Thus, a POST Webhook was created in Git that would reach out to the custom built Atlas Bridge Service each time a “release” branch was pushed to, at which time, the bridge service would pull that branch and register the updates to Atlas.

Integrating application components

Schemas and Kafka topics are the foundations of the Event-Driven Book of Reference pattern. However, they are primarily static metadata that does not frequently change and can only show relationships between each other (a given schema will be related to one or more topics). Applications, including Kafka Streams clients and Kafka Connectors, could change much more frequently.

Kafka Streams clients have a feature to report their Streams Topologies, a directed acyclic graph (DAG) showing the flow of data, including the Kafka topics the client consumes from and produces to. Thus, a CRON job was created to extract those topologies, which were exposed through an API endpoint and fed into the Atlas Bridge Service to create application entities for the client and the relationships to the Kafka topics.

Kafka Connectors, an abstraction on Kafka producer/consumer clients, are an extremely efficient way to connect to external data repositories (IBM MQ, Oracle, MySQL, Postgres, etc.) and either consume from or produce to said repositories. While the Kafka Connect application that manages all the connectors provides a REST API, a necessity to segment our connectors across two different data centers required the creation of a Kafka Connector Deployer application. As the name suggests, the Deployer application manages the deployment of the Kafka connector, so, after it deploys the Kafka Connector, a feature was implemented to call out to the Atlas Bridge Service to register the Connector.

Data Discovery

The Atlas UI allows a user to find Topics, Schemas, Fields and Application components using full-text searching, classification/entity filtering, or a combination.

Development teams can find data they are looking for by searching by naming from the source side or from the industry aligned destination side or anywhere in the middle.

Figure 3 illustrates a Kafka topic found through a full-text search and the different application components affected by the data within the topic.

Figure 3 — Atlas UI search capability

End-to-End Data Lineage

Data Lineage is one of the most informative features of Apache Atlas; through the relationships that entities hold, Atlas generates a directed acyclic graph (DAG) that follows the path of data through Topics, Streaming Applications, Kafka Connectors, Event Stores and APIs, giving a clear picture of the impact of each of those components.

Field level lineage was also integrated into Atlas through the spreadsheets that maintain that information. The spreadsheets were uploaded to a source code repository where a POST Webhook would trigger an application to pull the repo and convert all spreadsheets to JSONs and call the Atlas Bridge Service to import them into Atlas. While a simple solution, as it was not gathering data directly from applications, it afforded us the ability to view field-level transformations that could have only been deciphered through countless hours of digging within code.

Figure 4 — End-to-end field-level data lineage

Data Classification

Data Classification, an essential piece to understanding the data landscape, can be added manually through the Atlas UI. Atlas will leverage the relationships a given entity has and propagate the classification (if configured to do so) to downstream entities. Additionally, the Atlas Bridge Service also captures any classification an AVRO field may hold when processing AVRO Schemas and creates the classification when registering the AVRO field.

Figure 5 — Data Classification

This classification capability enabled us to answer questions from the security team about what PII information we had in the platform a lot quicker.

Some Closing Remarks

Making a platform easy to use (build on) and self-serve can drive adoption. You want all development teams to use the platform. The platform accelerates their time to market to get their vertical slices of business capability into the hands of their clients. The more teams use the platform, the better it becomes. Assets are effectively reused and get hardened and more resilient, which is great for everybody on the platform. The benefits of a foundational technology platform get realized. We have started to see these benefits as we push platform thinking habits.

As our Apache Atlas implementation gets used more and gets used in ways we had not foreseen, we continue to improve our integration. We are also looking at possibly overlaying the Atlas visualization with real-time metrics from the running platform.

About a year ago, we had shared our early work on this with the Confluent team. It is very encouraging to see Stream Governance/Lineage added to the Confluent Cloud offering. It appears that they are also using Apache Atlas.

We hope you found this article to be a worthwhile read. We are always learning and evolving our platform. If you have your own experiences integrating a metadata management solution such as Apache Atlas into a platform, please share in the comments.

References

[1] S. Daya and M. Maalem, “An Event-Driven Book of Reference to facilitate Modernization”, Medium, 2022. [Online]. Available: https://shahirdaya.medium.com/an-event-driven-book-of-reference-to-facilitate-modernization-9fc7256722a5.[Accessed: 20-Jun-2022].

[2] “Apache Atlas — Data Governance and Metadata framework for Hadoop”, Atlas.apache.org, 2022. [Online]. Available: https://atlas.apache.org/#/. [Accessed: 20- Jun- 2022].

[3] “What Is Apache Atlas: How It Works, and Where It Is Used”, Atlan, 2022. [Online]. Available: https://atlan.com/what-is-apache-atlas/. [Accessed: 20- Jun- 2022].

[4] “Stream Governance | Confluent Documentation”, Docs.confluent.io, 2022. [Online]. Available: https://docs.confluent.io/cloud/current/stream-governance/index.html#stream-governance-overview. [Accessed: 21- Jun- 2022].

[5] R. Murray, “The Art of Platform Thinking”, ThoughtWorks, 2022. [Online]. Available: https://www.thoughtworks.com/en-us/insights/blog/art-platform-thinking. [Accessed: 23- Jun- 2022].

--

--

Shahir A. Daya

Shahir Daya is CTO at Zafin and Former IBM Distinguished Engineer.