Spark entities created in Apache Atlas

Each Spark entity in Atlas includes detailed metadata collected from Spark.

The following diagrams show a summary of the entities created in Atlas for Spark operations. The data assets that Spark operations act upon are collected through HMS. The supertypes that contribute attributes to the entity types are shaded.

Figure 1. Atlas Entity Types for Hive Server 2 Data Sets

The metadata collected for each entity type is as follows:

Spark Process


Identifier	Example content
typeName	`spark_process`
guid	System generated ID. This value is used to identify the entity in the Atlas Dashboard URL.
qualifiedName	`<generated ID>` The generated ID is distinct from the GUID.
name	`process_<generated ID>`
description	Metadata from Spark.
owner	Metadata from Spark.
ownerType	Metadata from Spark.
inputs	List of the input tables or views, including each entity’s type name and the qualified name.
outputs	List of the output objects, including each entity’s type name and the qualified name.
executionId	Metadata from Spark.
currUser	Metadata from Spark. In a Kerberized environment, this value contains the principal name.
remoteUser	Metadata from Spark. In a Kerberized environment, this value contains the principal name.
executionTime	Metadata from Spark.
details	Query plan text, including parsed logical plan, analyzed logical plan, optimized logical plan, and physical plan.
sparkPlanDescription	Physical plan text.