Spark entities created in Apache Atlas

Each Spark entity in Atlas includes detailed metadata collected from Spark.

The following diagrams show a summary of the entities created in Atlas for Spark operations. The data assets that Spark operations act upon are collected through HMS. The supertypes that contribute attributes to the entity types are shaded.

Figure 1. Atlas Entity Types for Hive Server 2 Data Sets


The metadata collected for each entity type is as follows:

Spark Process

Identifier Example content
typeName spark_process
guid System generated ID. This value is used to identify the entity in the Atlas Dashboard URL.
qualifiedName

<generated ID>

The generated ID is distinct from the GUID.

name process_<generated ID>
description Metadata from Spark.
owner Metadata from Spark.
ownerType Metadata from Spark.
inputs List of the input tables or views, including each entity’s type name and the qualified name.
outputs List of the output objects, including each entity’s type name and the qualified name.
executionId Metadata from Spark.
currUser Metadata from Spark. In a Kerberized environment, this value contains the principal name.
remoteUser Metadata from Spark. In a Kerberized environment, this value contains the principal name.
executionTime Metadata from Spark.
details Query plan text, including parsed logical plan, analyzed logical plan, optimized logical plan, and physical plan.
sparkPlanDescription Physical plan text.