We’re excited to announce the preview of API-driven, OpenLineage-compatible knowledge lineage in Amazon DataZone that can assist you seize, retailer, and visualize lineage of knowledge motion and transformations of knowledge property on Amazon DataZone.
With the Amazon DataZone OpenLineage-compatible API, area directors and knowledge producers can seize and retailer lineage occasions past what is accessible in Amazon DataZone, together with transformations in Amazon Easy Storage Service (Amazon S3), AWS Glue, and different AWS providers. This offers a complete view for knowledge customers searching in Amazon DataZone, who can achieve confidence of an asset’s origin, and knowledge producers, who can assess the influence of adjustments to an asset by understanding its utilization.
On this publish, we focus on the newest options of knowledge lineage in Amazon DataZone, its compatibility with OpenLineage, and easy methods to get began capturing lineage from different providers resembling AWS Glue, Amazon Redshift, and Amazon Managed Workflows for Apache Airflow (Amazon MWAA) into Amazon DataZone by way of the API.
Why it issues to have knowledge lineage
Information lineage offers you an overarching view into knowledge property, permitting you to see the origin of objects and their chain of connections. Information lineage permits monitoring the motion of knowledge over time, offering a transparent understanding of the place the information originated, the way it has modified, and its final vacation spot inside the knowledge pipeline. With transparency round knowledge origination, knowledge customers achieve belief that the information is appropriate for his or her use case. Information lineage info is captured at ranges resembling tables, columns, and jobs, permitting you to conduct influence evaluation and reply to knowledge points as a result of, for instance, you possibly can see how one discipline impacts downstream sources. This equips you to make well-informed selections earlier than committing adjustments and keep away from undesirable adjustments downstream.
Information lineage in Amazon DataZone is an API-driven, OpenLineage-compatible function that helps you seize and visualize lineage occasions from OpenLineage-enabled methods or by way of an API, to hint knowledge origins, monitor transformations, and consider cross-organizational knowledge consumption. The lineage visualized contains actions contained in the Amazon DataZone enterprise knowledge catalog. Lineage captures the property cataloged in addition to the subscribers to these property and to actions that occur outdoors the enterprise knowledge catalog captured programmatically utilizing the API.
Moreover, Amazon DataZone variations lineage with every occasion, enabling you to visualise lineage at any time limit or evaluate transformations throughout an asset’s or job’s historical past. This historic lineage offers a deeper understanding of how knowledge has developed, which is important for troubleshooting, auditing, and imposing the integrity of knowledge property.
The next screenshot exhibits an instance lineage graph visualized with the Amazon DataZone knowledge catalog.
Introduction to OpenLineage appropriate knowledge lineage
The necessity to seize knowledge lineage constantly throughout varied analytical providers and mix them right into a unified object mannequin is vital in uncovering insights from the lineage artifact. OpenLineage is an open supply mission that provides a framework to gather and analyze lineage. It additionally affords reference implementation of an object mannequin to persist metadata together with integration to main knowledge and analytics instruments.
The next are key ideas in OpenLineage:
- Lineage occasions – OpenLineage captures lineage info by way of a sequence of occasions. An occasion is something that represents a particular operation carried out on the information that happens in an information pipeline, resembling knowledge ingestion, transformation, or knowledge consumption.
- Lineage entities – Entities in OpenLineage characterize the assorted knowledge objects concerned within the lineage course of, resembling datasets and tables.
- Lineage runs – A lineage run represents a particular run of an information pipeline or a job, encompassing a number of lineage occasions and entities.
- Lineage type sorts – Kind sorts, or sides, present further metadata or context about lineage entities or occasions, enabling richer and extra descriptive lineage info. OpenLineage affords sides for runs, jobs, and datasets, with the choice to construct customized sides.
The Amazon DataZone knowledge lineage API is OpenLineage appropriate and extends OpenLineage’s performance by offering a materialization endpoint to persist the lineage outputs in an extensible object mannequin. OpenLineage affords integrations for sure sources, and integration of those sources with Amazon DataZone is easy as a result of the Amazon DataZone knowledge lineage API understands the format and interprets to the lineage knowledge mannequin.
The next diagram illustrates an instance of the Amazon DataZone lineage knowledge mannequin.
In Amazon DataZone, each lineage node represents an underlying useful resource—there’s a 1:1 mapping of the lineage node with a logical or bodily useful resource resembling desk, view, or asset. The nodes characterize a particular job with a particular run, or a node for a desk or asset, and one node for a subscription goal.
Every model of a node captures what occurred to the underlying useful resource at that particular timestamp. In Amazon DataZone, lineage not solely shares the story of knowledge motion outdoors it, however it additionally represents the lineage of actions inside Amazon DataZone, resembling asset creation, curation, publishing, and subscription.
To hydrate the lineage mannequin in Amazon DataZone, two kinds of lineage are captured:
- Lineage actions inside Amazon DataZone – This contains property added to the catalog and printed, after which particulars in regards to the subscriptions are captured mechanically. Once you’re within the producer mission context (for instance, if the mission you’re chosen is the proudly owning mission of the asset you might be searching and also you’re a member of that mission), you will notice two states of the dataset node:
- The stock asset sort node defines the asset within the catalog that’s in an unpublished stage. Different customers can’t subscribe to the stock asset. To study extra, confer with Creating stock and printed knowledge in Amazon DataZone.
- The printed asset sort represents the precise asset that’s discoverable by knowledge customers throughout the group. That is the asset sort that may be subscribed by different mission members. In case you are a client and never a part of the manufacturing mission of that asset, you’ll solely see the printed asset node.
- Lineage actions outdoors of Amazon DataZone will be captured programmatically utilizing the PostLineageEvent With these occasions captured both upstream or downstream of cataloged property, knowledge producers and customers get a complete view of knowledge motion to examine the origin of knowledge or its consumption. We focus on easy methods to use the API to seize lineage occasions later on this publish.
There are two several types of lineage nodes obtainable in Amazon DataZone:
- Dataset node – In Amazon DataZone, lineage visualizes nodes that characterize tables and views. Relying on the context of the mission, the producers will have the ability to view each the stock and printed asset, whereas customers can solely view the printed asset. Once you first open the lineage tab on the asset particulars web page, the cataloged dataset node would be the start line for lineage graph traversal upstream or downstream. Dataset nodes embody lineage nodes automated from Amazon DataZone and customized lineage nodes:
- Automated dataset nodes – These nodes embody details about AWS Glue or Amazon Redshift property printed within the Amazon DataZone catalog. They’re mechanically generated and embody a corresponding AWS Glue or Amazon Redshift icon inside the node.
- Customized dataset nodes – These nodes embody details about property that aren’t printed within the Amazon DataZone catalog. They’re created manually by area directors (producers) and are represented by a default customized asset icon inside the node. These are basically customized lineage nodes created utilizing the OpenLineage occasion format.
- Job (run) node – This node captures the small print of the job, which represents the newest run of a specific job and its run particulars. This node additionally captures a number of runs of the job and will be seen on the Historical past tab of the node particulars. Node particulars are made seen while you select the icon.
Visualizing lineage in Amazon DataZone
Amazon DataZone affords a complete expertise for knowledge producers and customers. The asset particulars web page offers a graphical illustration of lineage, making it simple to visualise knowledge relationships upstream or downstream. The asset particulars web page offers the next capabilities to navigate the graph:
- Column-level lineage – You’ll be able to increase column-level lineage when obtainable in dataset nodes. This mechanically exhibits relationships with upstream or downstream dataset nodes if supply column info is accessible.
- Column search – If the dataset has greater than 10 columns, the node presents pagination to navigate to columns not initially introduced. To shortly view a specific column, you possibly can search on the dataset node that lists simply the searched column.
- View dataset nodes solely – In order for you filter out the job nodes, you possibly can select the Open view management icon within the graph viewer and toggle the Show dataset nodes solely It will take away all of the job nodes from the graph and allow you to navigate simply the dataset nodes.
- Particulars pane – Every lineage node captures and shows the next particulars:
- Each dataset node has three tabs: Lineage information, Schema, and Historical past. The Historical past tab lists the completely different variations of lineage occasion captured for that node.
- The job node has a particulars pane to show job particulars with the tabs Job information and Historical past. The small print pane additionally captures queries or expressions run as a part of the job.
- Model tabs – All lineage nodes in Amazon DataZone knowledge lineage can have versioning, captured as historical past, primarily based on lineage occasions captured. You’ll be able to view lineage at a specific timestamp that opens a brand new tab on the lineage web page to assist evaluate or distinction between the completely different timestamps.
The next screenshot exhibits an instance of knowledge lineage visualization.
You’ll be able to expertise the visualization with pattern knowledge by selecting Preview on the Lineage tab and selecting the Attempt pattern lineage hyperlink. This opens a brand new browser tab with pattern knowledge to check and study in regards to the function with or and not using a guided tour, as proven within the following screenshot.
Answer overview
Now that we perceive the capabilities of the brand new knowledge lineage function in Amazon DataZone, let’s discover how one can get began in capturing lineage from AWS Glue tables and ETL (extract, remodel, and cargo) jobs, Amazon Redshift, and Amazon MWAA.
The getting began scripts are additionally obtainable in Amazon DataZone’s new GitHub repository.
Stipulations
For this walkthrough, you must have the next stipulations:
If the AWS account you utilize to comply with this publish makes use of AWS Lake Formation to handle permissions on the AWS Glue Information Catalog, just remember to log in as a person with entry to create databases and tables. For extra info, confer with Implicit Lake Formation permissions.
Launch the CloudFormation stack
To create your assets for this use case utilizing AWS CloudFormation, full the next steps:
- Launch the CloudFormation stack in
us-east-1
: - For Stack title, enter a reputation in your stack.
- Select Subsequent.
- Choose I acknowledge that AWS CloudFormation may create IAM assets with customized names.
- Select Create stack.
Look ahead to the stack formation to complete provisioning the assets. Once you see the CREATE_COMPLETE
standing, you possibly can proceed to the subsequent steps.
Seize lineage from AWS Glue tables
For this instance, we use CloudShell, which is a browser-based shell, to run the instructions crucial to reap lineage metadata from AWS Glue tables. Full the next steps:
- On the AWS Glue console, select Crawlers within the navigation pane.
- Choose the AWSomeRetailCrawler crawler created by the CloudFormation template.
- Select Run.
When the crawler is full, you’ll see a Succeeded standing.
Now let’s harvest the lineage metadata utilizing CloudShell.
- Obtain the
extract_glue_crawler_lineage.py
file. - On the Amazon DataZone console, open CloudShell.
- On the Actions menu, select Replace file.
- Add the
extract_glue_crawler_lineage.py
file. - Run the next instructions:
You need to get the next outcomes.
- After all of the libraries and dependencies are configured, run the next command to reap the lineage metadata from the stock desk:
- The script asks for verification of the settings supplied; enter
Sure
.
You need to obtain a notification indicating that the script ran efficiently.
After you seize the lineage info from the Stock
desk, full the next steps to run the information supply.
- On the Amazon DataZone knowledge portal, open the
Gross sales
- On the Information tab, select Information sources within the navigation pane.
- Choose your knowledge supply job and select Run.
For this instance, we had an information supply job known as SalesDLDataSourceV2
already created pointing to the awesome_retail_db
database. To study extra about easy methods to create knowledge supply jobs, confer with Create and run an Amazon DataZone knowledge supply for the AWS Glue Information Catalog.
After the job runs efficiently, you must see a affirmation message.
Now let’s view the lineage diagram generated by Amazon DataZone.
- On the Information stock tab, select the
Stock
desk. - On the
Stock
asset web page, select the brand new Lineage tab.
On the Lineage tab, you possibly can see that Amazon DataZone created three nodes:
- Job / Job run – That is primarily based on the AWS Glue crawler used to reap the asset technical metadata
- Dataset – That is primarily based on the S3 object that incorporates the information associated to this asset
- Desk – That is the AWS Glue desk created by the crawler
Should you select the Dataset node, Amazon DataZone affords details about the S3 object used to create the asset.
Seize knowledge lineage for AWS Glue ETL jobs
Within the earlier part, we coated easy methods to generate an information lineage diagram on prime of an information asset. Now let’s see how we are able to create one for an AWS Glue job.
The CloudFormation template that we launched earlier created an AWS Glue job known as Inventory_Insights
. This job will get knowledge from the Stock
desk and creates a brand new desk known as Inventory_Insights
with the aggregated knowledge of the entire merchandise obtainable in all of the shops.
The CloudFormation template additionally copied the openlineage-spark_2.12-1.9.1.jar
file to the S3 bucket created for this publish. This file is important to generate lineage metadata from the AWS Glue job. We use model 1.9.1, which is appropriate with AWS Glue 3.0, the model used to create the AWS Glue job for this publish. Should you’re utilizing a unique model of AWS Glue, you have to obtain the corresponding OpenLineage Spark plugin file that matches your AWS Glue model.
The OpenLineage Spark plugin will not be in a position to extract knowledge lineage from AWS Glue Spark jobs that use AWS Glue DynamicFrames. Use Spark SQL DataFrames as a substitute.
- Obtain the extract_glue_spark_lineage.py file.
- On the Amazon DataZone console, open CloudShell.
- On the Actions menu, select Replace file.
- Add the extract_glue_spark_lineage.py file.
- On the CloudShell console, run the next command (in case your CloudShell session expired, you possibly can open a brand new session):
- Affirm the knowledge confirmed by the script by coming into
sure
.
You will notice the next message; because of this the script is able to get the AWS Glue job lineage metadata after you run it.
Now let’s run the AWS Glue job created by the Cloud formation template.
- On the AWS Glue console, select ETL jobs within the navigation pane.
- Choose the Inventory_Insights job and select Run job.
On the Job particulars tab, you’ll discover that the job has the next configuration:
- Key
--conf
with worthextraListeners=io.openlineage.spark.agent.OpenLineageSparkListener --conf spark.openlineage.transport.sort=console --conf spark.openlineage.sides.custom_environment_variables=[AWS_DEFAULT_REGION;GLUE_VERSION;GLUE_COMMAND_CRITERIA;GLUE_PYTHON_VERSION;]
- Key
--user-jars-first
with worthtrue
- Dependent JARs path set because the S3 path
s3://{your bucket}/lib/openlineage-spark_2.12-1.9.1.jar
- The AWS Glue model set as 3.0
Throughout the run of the job, you will notice the next output on the CloudShell console.
Which means the script has efficiently harvested the lineage metadata from the AWS Glue job.
Now let’s create an AWS Glue desk primarily based on the information created by the AWS Glue job. For this instance, we use an AWS Glue crawler.
- On the AWS Glue console, select Crawlers within the navigation pane.
- Choose the
AWSomeRetailCrawler
crawler created by the CloudFormation template and select Run.
When the crawler is full, you will notice the next message.
Now let’s open the Amazon DataZone portal to see how the diagram is represented in Amazon DataZone.
- On the Amazon DataZone portal, select the
Gross sales
mission. - On the Information tab, select Stock knowledge within the navigation pane.
- Select the
stock insights
asset
On the Lineage tab, you possibly can see the diagram created by Amazon DataZone. It exhibits three nodes:
-
- The AWS Glue crawler used to create the AWS Glue desk
- The AWS Glue desk created by the crawler
- The Amazon DataZone cataloged asset
- To see the lineage details about the AWS Glue job that you just ran to create the
inventory_insights
desk, select the arrows icon on the left facet of the diagram.
Now you possibly can see the total lineage diagram for the Inventory_insights
desk.
- Select the blue arrow icon within the stock node to the left of the diagram.
You’ll be able to see the evolution of the columns and the transformations that they’d.
Once you select any of the nodes which might be a part of the diagram, you possibly can see extra particulars. For instance, the inventory_insights
node exhibits the next info.
Seize lineage from Amazon Redshift
Let’s discover easy methods to generate a lineage diagram from Amazon Redshift. On this instance, we use AWS Cloud9 as a result of it permits us to configure the connection to the digital non-public cloud (VPC) the place our Redshift cluster resides. For extra details about AWS Cloud9, confer with the AWS Cloud9 Person Information.
The CloudFormation template included as a part of this publish doesn’t cowl the creation of a Redshift cluster or the creation of the tables used on this part. To study extra about easy methods to create a Redshift cluster, see Step 1: Create a pattern Amazon Redshift cluster. We use the next question to create the tables wanted for this part of the publish:
Bear in mind so as to add the IP deal with of your AWS Cloud9 surroundings to the safety group with entry to the Redshift cluster.
- Obtain the
necessities.txt
andextract_redshift_lineage.py
recordsdata. - On the File menu, select Add Native Recordsdata.
- Add the
necessities.txt
andextract_redshift_lineage.py
recordsdata. - Run the next instructions:
You need to have the ability to see the next messages.
- To set the AWS credentials, run the next command:
- Run the
extract_redshift_lineage.py
script to reap the metadata essential to generate the lineage diagram: - Subsequent, you’ll be prompted to enter the person title and password for the connection to your Amazon DataZone database.
- Once you obtain a affirmation message, enter
sure
.
If the configuration was carried out accurately, you will notice the next affirmation message.
Now let’s see how the diagram was created in Amazon DataZone.
- On the Amazon DataZone knowledge portal, open the
Gross sales
mission. - On the Information tab, select Information sources.
- Run the information supply job.
For this publish, we already created an information supply job known as Sales_DW_Enviroment-default-datasource
so as to add the Redshift knowledge supply to our Amazon DataZone mission. To discover ways to create an information supply job, confer with Create and run an Amazon DataZone knowledge supply for Amazon Redshift
After you run the job, you’ll see the next affirmation message.
- On the Information tab, select Stock knowledge within the navigation pane.
- Select the
total_sales
asset.
- Select the Lineage tab.
Amazon DataZone create a three-node lineage diagram for the entire gross sales desk; you possibly can select any node to view its particulars.
- Select the arrows icon subsequent to the Job/ Job run node to view a extra full lineage diagram.
- Select the Job / Job run
The Job Data part exhibits the question that was used to create the entire gross sales desk.
Seize lineage from Amazon MWAA
Apache Airflow is an open-source platform for growing, scheduling, and monitoring batch-oriented workflows. Amazon MWAA is a managed service for Airflow that allows you to use your present Airflow platform to orchestrate your workflows. OpenLineage helps integration with Airflow 2.6.3 utilizing the openlineage-airflow
bundle, and the identical will be enabled on Amazon MWAA as a plugin. As soon as enabled, the plugin converts Airflow metadata to OpenLineage occasions, that are consumable by DataZone.PostLineageEvent
.
The next diagram exhibits the setup required in Amazon MWAA to seize knowledge lineage utilizing OpenLineage and publish it to Amazon DataZone.
The workflow makes use of an Amazon MWAA DAG to invoke an information pipeline. The method is as follows:
- The
openlineage-airflow
plugin is configured on Amazon MWAA as a lineage backend. Metadata in regards to the DAG run is handed to the plugin, which converts it into OpenLineage format. - The lineage info collected is written to Amazon CloudWatch log group in accordance with the Amazon MWAA surroundings.
- A helper operate captures the lineage info from the log file and publishes it to Amazon DataZone utilizing the
PostLineageEvent
API.
The instance used within the publish makes use of Amazon MWAA model 2.6.3 and OpenLineage plugin model 1.4.1. For different Airflow variations supported by OpenLineage, confer with Supported Airflow variations.
Configure the OpenLineage plugin on Amazon MWAA to seize lineage
When harvesting lineage utilizing OpenLineage, a Transport
configuration must be arrange, which tells OpenLineage the place to emit the occasions to, for instance the console or an HTTP endpoint. You need to use ConsoleTransport, which logs the OpenLineage occasions within the Amazon MWAA process CloudWatch log group, which might then be printed to Amazon DataZone utilizing a helper operate.
Specify the next within the necessities.txt
file added to the S3 bucket configured for Amazon MWAA:
openlineage-airflow==1.4.1
Within the Airflow logging configuration part beneath the MWAA configuration for the Airflow surroundings, allow Airflow process logs with log degree INFO
. The next screenshot exhibits a pattern configuration.
A profitable configuration will add a plugin to Airflow, which will be verified from the Airflow UI by selecting Plugins on the Admin menu.
On this publish, we use a pattern DAG to hydrate knowledge to Redshift tables. The next screenshot exhibits the DAG in graph view.
Run the DAG and upon profitable completion of a run, open the Amazon MWAA process CloudWatch log group in your Airflow surroundings (airflow-env_name-task
) and filter primarily based on the expression console.py
to pick occasions emitted by OpenLineage. The next screenshot exhibits the outcomes.
Publish lineage to Amazon DataZone
Now that you’ve the lineage occasions emitted to CloudWatch, the subsequent step is to publish them to Amazon DataZone to affiliate them to a knowledge asset and visualize them on the enterprise knowledge catalog.
- Obtain the recordsdata
necessities.txt
andairflow_cw_parse_log.py
and collect surroundings particulars like AWS area, Amazon MWAA surroundings title and Amazon DataZone Area ID. - The Amazon MWAA surroundings title will be obtained from the Amazon MWAA console.
- The Amazon DataZone area ID will be obtained from Amazon DataZone service console or from the Amazon DataZone portal.
- Navigate to CloudShell and select Add recordsdata on the Actions menu to add the recordsdata
necessities.txt
andextract_airflow_lineage.py
. - After the recordsdata are uploaded, run the next script to filter lineage occasions from the Airflow process logs and publish them to Amazon DataZone:
The operate extract_airflow_lineage.py
filters the lineage occasions from the Amazon MWAA process log group and publishes the lineage to the desired area inside Amazon DataZone.
Visualize lineage on Amazon DataZone
After the lineage is printed to DataZone, open your DataZone mission, navigate to the Information tab and selected an information asset that was accessed by the Amazon MWAA DAG. On this case, it’s a subscribed asset.
Navigate to the Lineage tab to visualise the lineage printed to Amazon DataZone.
Select a node to take a look at further lineage metadata. Within the following screenshot, we are able to observe the producer of the lineage has been marked as airflow
.
Conclusion
On this publish, we shared the preview function of knowledge lineage in Amazon DataZone, the way it works, and how one can seize lineage occasions, from AWS Glue, Amazon Redshift, and Amazon MWAA, to be visualized as a part of the asset searching expertise.
To study extra about Amazon DataZone and easy methods to get began, confer with the Getting began information. Take a look at the YouTube playlist for a number of the newest demos of Amazon DataZone and quick descriptions of the capabilities obtainable.
Concerning the Authors
Leonardo Gomez is a Principal Analytics Specialist at AWS, with over a decade of expertise in knowledge administration. Specializing in knowledge governance, he assists prospects worldwide in maximizing their knowledge’s potential whereas selling knowledge democratization. Join with him on LinkedIn.
Priya Tiruthani is a Senior Technical Product Supervisor with Amazon DataZone at AWS. She focuses on enhancing knowledge discovery and curation required for knowledge analytics. She is obsessed with constructing progressive merchandise to simplify prospects’ end-to-end knowledge journey, particularly round knowledge governance and analytics. Outdoors of labor, she enjoys being outdoor to hike, seize nature’s magnificence, and just lately play pickleball.
Ron Kyker is a Principal Engineer with Amazon DataZone at AWS, the place he helps drive innovation, remedy complicated issues, and set the bar for engineering excellence for his workforce. Outdoors of labor, he enjoys board gaming with family and friends, films, and wine tasting.
Srinivasan Kuppusamy is a Senior Cloud Architect – Information at AWS ProServe, the place he helps prospects remedy their enterprise issues utilizing the ability of AWS Cloud expertise. His areas of pursuits are knowledge and analytics, knowledge governance, and AI/ML.