Guidelines for Virtual Cluster upkeep

There are upkeep guidelines for Cloudera Data Engineering Spark History Server (SHS) that you'll need to consider.

Lifecycle configuration of Cloudera Data Engineering Spark event logs

The number of Spark event logs (spark.eventLog.enabled), that are produced by Spark jobs that run via Cloudera Data Engineering Virtual Cluster, grows indefinitely with each new Cloudera Data Engineering Spark run. These event logs are not automatically deleted and are stored on the object store under <CDP env storage location>/dex/<Service ID>/<VC ID>/eventlog/.

Some examples of the event log location can look like the following:

  • For Amazon Web Services (AWS): s3a://dex-storage-bucket/datalake/logs/dex/cluster-2xvl4pfp/rdw8q2sh/eventlog/
  • For Azure: abfs://logs@dexstorageaccount.dfs.core.windows.net/dex/cluster-4p54mk8j/22bnm99g/eventlog/
Because the number of event logs continuously increases, the time from when the Cloudera Data Engineering job finishes and when the Spark UI is available for this run on the Virtual Cluster UI may increase. The delay is most apparent in Virtual Clusters with 6,000 or more completed Spark job runs.
To avoid delays in event log availability after Cloudera Data Engineering job runs, you can configure an object store lifecycle policy so that event logs are deleted automatically on the object store. For more information about an Amazon S3 lifecycle policy, see Setting lifecycle configuration on a bucket linked below. For more information about Azure lifecycle management policies, see Configure a lifecycle management policy linked below.