Spark troubleshooting

What do you do if you do not see Atlas metadata from Spark or face any related issues.

Spark runs an Atlas "hook" or plugin called Spark Atlas Connector (SAC) on every host where Spark runs. To troubleshoot problems, consider the following methods for narrowing down where the problem is:

  • Are you missing all metadata?

    Make sure that all the services supporting Atlas are configured and running. For CDP, the configuration is done for you; look in Cloudera Manager to see that Kafka, Solr, and Atlas services are running in the Data Lake.

  • Are you missing all Spark process metadata?

    By default, Spark operations are configured to send metadata to Atlas. To check that these settings have not been rolled back, look at the Spark On YARN service configuration page in Cloudera Manager to ensure that Spark is configured to send metadata to Atlas (Atlas Service property). Assuming this configuration is enabled, you can next check the Kafka topic queue to make sure that metadata messages are being produced in Spark and making it to the Kafka topic.

  • Missing only some Spark metadata?

    Because each instance of Spark collects metadata independently of other instances, it is possible that one instance failed to send metadata to Atlas. To determine if this is the problem, check the Kafka topic queue to see if one of the Spark hosts is not sending metadata.

  • When customer is running a standard Spark structured streaming application in DataHub Data Engineering cluster, the Spark job might throw exceptions in the YARN App logs.

    For example:

    WARN clients.NetworkClient: [Producer clientId=producer-1] Error while fetching metadata with correlation id 1 : {ATLAS_HOOK=TOPIC_AUTHORIZATION_FAILED} ERROR hook.AtlasHook: Giving up after 3 failed attempts to send notification to Atlas

    The Spark application tries to publish the metadata to the Kafka topic named ATLAS_HOOK. At that stage it was failing because of the permission issue for the spark application user. For example: ‘csso_jonathan’.

    To overcome this permission issue, you must provide the Ranger permission for the application user 'csso_jonathan’ using following steps:

    Go to Ranger Admin> Kafka policies> Provide "publish" permission to "csso_jonathan” user for the following:

  • ATLAS_HOOK
  • ATLAS_ENTITIES
  • ATLAS_SPARK_HOOK policies.

Later, re-run the Spark job and verify if the Atlas related errors are still visible in the application logs.