Spark lineage
Atlas collects metadata from Spark to represent the lineage among data assets.
The Atlas lineage graph shows the input and output processes that the
current entity participated in, specifically those relationships modeled
as “inputToProcesses” and “outputFromProcesses.” Entities are included if
they were inputs to processes that lead to the current entity or they are
output from processes for which the current entity was an input. In the
context of Spark, a Spark job is modeled as a
spark_application
entity. Each application entity
includes relationships to one or more processes that were executed in the
job. The spark_process
entities are automatically named
"execution-n" where n is an
integer incremented sequentially.
It is possible to have two spark process entities with the same name in a lineage graph; be sure to check the qualified name to make sure you are looking at the appropriate process.