Guidelines for Virtual Cluster upkeep

There are upkeep guidelines for Cloudera Data Engineering (CDE) Spark History Server (SHS) that you'll need to consider.

Lifecycle configuration of CDE Spark event logs

The number of Spark event logs (spark.eventLog.enabled), that are produced by Spark jobs that run via CDE Virtual Cluster, grows indefinitely with each new CDE Spark run. These event logs are not automatically deleted and are stored on the object store under <CDP env storage location>/dex/<Service ID>/<VC ID>/eventlog/.

Some examples of the event log location can look like the following:

  • For Amazon Web Services (AWS): s3a://dex-storage-bucket/datalake/logs/dex/cluster-2xvl4pfp/rdw8q2sh/eventlog/
  • For Azure: abfs://logs@dexstorageaccount.dfs.core.windows.net/dex/cluster-4p54mk8j/22bnm99g/eventlog/
Because the number of event logs continuously increases, the time from when the CDE job finishes and when the Spark UI is available for this run on the Virtual Cluster UI may increase. The delay is most apparent in Virtual Clusters with 6,000 or more completed Spark job runs.
To avoid delays in event log availability after CDE job runs, you can configure an object store lifecycle policy so that event logs are deleted automatically on the object store. For more information about an Amazon S3 lifecycle policy, see Setting lifecycle configuration on a bucket linked below. For more information about Azure lifecycle management policies, see Configure a lifecycle management policy linked below.