Guidelines for Virtual Cluster upkeep

There are upkeep guidelines for Cloudera Data Engineering (CDE) Spark History Server (SHS) that you'll need to consider.

Lifecycle configuration of CDE Spark event logs

The number of Spark event logs (spark.eventLog.enabled), that are produced by Spark jobs that run via CDE Virtual Cluster, grows indefinitely with each new CDE Spark run. These event logs are not automatically deleted and are stored on the object store under <CDP env storage location>/dex/<Service ID>/<VC ID>/eventlog/.

Some examples of the event log location can look like the following:

  • For Amazon Web Services (AWS): s3a://dex-storage-bucket/datalake/logs/dex/cluster-2xvl4pfp/rdw8q2sh/eventlog/
  • For Azure: abfs://logs@dexstorageaccount.dfs.core.windows.net/dex/cluster-4p54mk8j/22bnm99g/eventlog/
Because the number of event logs continuously increases, the time from when the CDE job finishes and when the Spark UI is available for this run on the Virtual Cluster UI may increase. The delay is most apparent in Virtual Clusters with 6,000 or more completed Spark job runs.
To avoid delays in event log availability after CDE job runs, you can configure an object store lifecycle policy so that event logs are deleted automatically on the object store. For more information about an Amazon S3 lifecycle policy, see Setting lifecycle configuration on a bucket linked below. For more information about Azure lifecycle management policies, see Configure a lifecycle management policy linked below.

Increasing database size on Azure

Azure databases can fill up with logs and entries with every job that runs in CDE Virtual Clusters. When the database fills up, increase the size of your database.
  1. Ensure that there are no jobs running.
  2. Go to the Azure Portal.
  3. Locate the Azure SQL database named <cluster id>, for example, cluster-2cmmds8q.
  4. Navigate to the Pricing Tier section.
  5. Move the slider to the desired value.
  6. Click OK and wait for the scaling to complete.
  7. Resume running your jobs.

Increasing database size on Amazon Web Services

Amazon Web Services (AWS) databases can fill up with logs and entries with every job that runs in CDE Virtual Clusters. When the database fills up, increase the size of your database.
  1. Go to the AWS Console.
  2. Navigate to the RDS Service page.
  3. Click Databases and use the filter to find your cluster id, for instance, cluster-w8d65nxp.
  4. Select the target database.
  5. In the Database homepage in the top right-hand corner, click Modify.
  6. Scroll down to the database size settings and set the Allocated Storage Property to the desired value.
  7. Click Continue.
  8. Set the desired timeframe for maintenance.
  9. Click Continue and wait for the database status to become available.
  10. Resume running your jobs.