Guidelines for Virtual Cluster upkeep
There are upkeep guidelines for Cloudera Data Engineering Spark History Server (SHS) that you'll need to consider.
Lifecycle configuration of Cloudera Data Engineering Spark event logs
The number of Spark event logs (spark.eventLog.enabled), that are produced by Spark jobs that run via Cloudera Data Engineering Virtual Cluster, grows indefinitely with each new Cloudera Data Engineering Spark run. These event logs are not automatically deleted and are stored on the object store under <CDP env storage location>/dex/<Service ID>/<VC ID>/eventlog/.
Some examples of the event log location can look like the following:
- For Amazon Web Services (AWS): s3a://dex-storage-bucket/datalake/logs/dex/cluster-2xvl4pfp/rdw8q2sh/eventlog/
- For Azure: abfs://logs@dexstorageaccount.dfs.core.windows.net/dex/cluster-4p54mk8j/22bnm99g/eventlog/
Spark History configuration to reduce I/O consumption
Cloudera recommends cleaning up the old Spark History Server (SHS) logs regularly, to reduce the Input/Output (I/O) consumption significantly. If the old SHS logs are not cleaned up, the Spark History Server generates a high volume of read operations, leading to excessive I/O usage, regardless of the number of running jobs.
To reduce the I/O consumption, follow this procedure:
- Provide the Virtual Cluster (VC) ID in the command and run the command to open the
spark-defaults
configmap for editing.kubectl edit configmap dex-app-[***VC-ID***]-spark-defaults -n dex-app-xxxxxxxx
- At the end of
spark-defaults.conf
section, add:spark.history.fs.update.interval: 300s
You can adjust the time interval as desired. The default value is
10s
. - Restart the
dex-app-xxxxxxxx-shs-xxxx
pod in thedex-app-xxxxxxxx
namespace.kubectl delete pod dex-app-xxxxxxxx-shs-xxxx -n dex-app-xxxxxxxx