Spark on Kubernetes

In Cloudera AI, multiple Spark versions are available through Runtime Addons. Data Scientists can select the version of Spark to be configured for any workload. Cloudera AI configures the Runtime container and mounts all of the dependencies.

Cloudera AI supports fully-containerized execution of Spark workloads through Spark's support for the Kubernetes cluster backend. Users can interact with Spark both interactively and in batch mode. In both batch and interactive modes, dependency management, including for Spark executors, is transparently managed by Cloudera AI and Kubernetes. No extra configuration is required. In interactive mode, Cloudera AI leverages the cloud provider for scalable project storage, and in batch mode, Cloudera AI manages dependencies through container images.

Cloudera AI also supports native cloud autoscaling via Kubernetes. When clusters do not have the required capacity to run workloads, they can automatically scale up additional nodes. Administrators can configure auto-scaling upper limits, which determine how large a compute cluster can grow. Since compute costs increase as cluster size increases, having a way to configure upper limits gives administrators a method to stay within a budget. Autoscaling policies can also account for heterogeneous node types such as GPU nodes.

In Cloudera AI, each project is owned by a user or team. Users can launch multiple sessions in a project. Workloads are launched within a separate Kubernetes namespace for each user, thus ensuring isolation between users at the Kubernetes level.