Spark entities created in Apache Atlas

Each Spark entity in Atlas includes detailed metadata collected from Spark.

The following diagrams show a summary of the entities created in Atlas for Spark operations. The data assets that Spark operations act upon are collected through HMS. The supertypes that contribute attributes to the entity types are shaded.

Figure 1. Atlas Entity Types for Spark Data Sets


The metadata collected for each entity type is as follows:

Spark Application

Identifier Example content
typeName spark_application
guid System generated ID. This value is used to identify the entity in the Atlas Dashboard URL.
qualifiedName

<Spark application ID>

name Spark Job + <Spark application ID>
description Metadata from Spark. Reserved for future use.
displayName Reserved for future use.
owner Metadata from Spark. Reserved for future use.
currentUser Metadata from Spark. In a Kerberized environment, this value contains the principal name.
remoteUser Metadata from Spark. In a Kerberized environment, this value contains the principal name.
userDescription Metadata from Spark. Reserved for future use.
replicatedFrom Reserved for future use.
replicatedTo Reserved for future use.
Relationship: inputs Reserved for future use.
Relationship: outputs Reserved for future use.
Relationship: processes List of Spark process entities created as part of the processing accomplished in this Spark job.

Spark Process

Identifier Example content
typeName spark_process
guid System generated ID. This value is used to identify the entity in the Atlas Dashboard URL.
qualifiedName

application-ID-execution-n

where n is a sequential integer assigned by the Spark engine and the application ID is for the parent Spark job.

name

execution-n

where n is a sequential integer assigned by the Spark engine. The number is unique only for the job, so it is possible to have Spark processes with duplicate names in Atlas.

description Metadata from Spark. Reserved for future use.
owner Metadata from Spark. Reserved for future use.
details Metadata from Spark describing the logical plan.
displayName Reserved for future use.
executionId Metadata from Spark.
inputs List of the input tables or views, including each entity’s type name and the qualified name.
outputs List of the output objects, including each entity’s type name and the qualified name.
queryText Metadata from Spark. Reserved for future use.
currUser Metadata from Spark. In a Kerberized environment, this value contains the principal name.
remoteUser Metadata from Spark. In a Kerberized environment, this value contains the principal name.
executionTime Metadata from Spark.
details Query plan text, including parsed logical plan, analyzed logical plan, optimized logical plan, and physical plan.
sparkPlanDescription Physical plan text.
replicatedFrom Reserved for future use.
replicatedTo Reserved for future use.
userDescription Metadata from Spark. Reserved for future use.
Relationship: inputs List of the input tables or views, including each entity’s type name and the qualified name.
Relationship: outputs List of the output objects, including each entity’s type name and the qualified name.
Relationship: application The Spark application entity that describes the Spark job in which this process was created.

Spark Column Lineage

Identifier Example Content
typeName spark_column_lineage
guid System generated ID. This value is used to identify the entity in the Atlas Dashboard URL.
qualifiedName <output_table_name>-<timestamp>-<output_column_name>
name Same as qualifiedName.
Relationship: Process Name of the spark_process entity that produced this lineage. spark_process_column_lineages
Relationship: inputs List of the input columns, including each entity’s type name and the qualified name.
Relationship: outputs Output column, including each entity type name and the qualified name.