Guidelines for Virtual Cluster upkeep

There are upkeep guidelines for Cloudera Data Engineering Spark History Server (SHS) that you'll need to consider.

Lifecycle configuration of Cloudera Data Engineering Spark event logs

The number of Spark event logs (spark.eventLog.enabled), that are produced by Spark jobs that run via Cloudera Data Engineering Virtual Cluster, grows indefinitely with each new Cloudera Data Engineering Spark run. These event logs are not automatically deleted and are stored on the object store under <CDP env storage location>/dex/<Service ID>/<VC ID>/eventlog/.

Some examples of the event log location can look like the following:

For Amazon Web Services (AWS): s3a://dex-storage-bucket/datalake/logs/dex/cluster-2xvl4pfp/rdw8q2sh/eventlog/
For Azure: abfs://logs@dexstorageaccount.dfs.core.windows.net/dex/cluster-4p54mk8j/22bnm99g/eventlog/

Because the number of event logs continuously increases, the time from when the Cloudera Data Engineering job finishes and when the Spark UI is available for this run on the Virtual Cluster UI may increase. The delay is most apparent in Virtual Clusters with 6,000 or more completed Spark job runs.

To avoid delays in event log availability after Cloudera Data Engineering job runs, you can configure an object store lifecycle policy so that event logs are deleted automatically on the object store. For more information about an Amazon S3 lifecycle policy, see Setting lifecycle configuration on a bucket linked below. For more information about Azure lifecycle management policies, see Configure a lifecycle management policy linked below.

Spark History configuration to reduce I/O consumption

Cloudera recommends cleaning up the old Spark History Server (SHS) logs regularly, to reduce the Input/Output (I/O) consumption significantly. If the old SHS logs are not cleaned up, the Spark History Server generates a high volume of read operations, leading to excessive I/O usage, regardless of the number of running jobs.

To reduce the I/O consumption, follow this procedure:

Provide the Virtual Cluster (VC) ID in the command and run the command to open the spark-defaults configmap for editing.
```
kubectl edit configmap dex-app-[***VC-ID***]-spark-defaults -n dex-app-xxxxxxxx
```
At the end of spark-defaults.conf section, add: spark.history.fs.update.interval: 300s
You can adjust the time interval as desired. The default value is 10s.
Restart the dex-app-xxxxxxxx-shs-xxxx pod in the dex-app-xxxxxxxx namespace.
```
kubectl delete pod dex-app-xxxxxxxx-shs-xxxx -n dex-app-xxxxxxxx
```