Spark on Kubernetes

In Cloudera Machine Learning (CML), multiple Spark versions are available through Runtime Addons. Data Scientists can select the version of Spark to be configured for any workload. CML configures the Runtime container and mounts all of the dependencies.

CML supports fully-containerized execution of Spark workloads through Spark's support for the Kubernetes cluster backend. Users can interact with Spark both interactively and in batch mode. In both batch and interactive modes, dependency management, including for Spark executors, is transparently managed by CML and Kubernetes. No extra configuration is required. In interactive mode, CML leverages the cloud provider for scalable project storage, and in batch mode, CML manages dependencies through container images.

CML also supports native cloud autoscaling via Kubernetes. When clusters do not have the required capacity to run workloads, they can automatically scale up additional nodes. Administrators can configure auto-scaling upper limits, which determine how large a compute cluster can grow. Since compute costs increase as cluster size increases, having a way to configure upper limits gives administrators a method to stay within a budget. Autoscaling policies can also account for heterogeneous node types such as GPU nodes.

In CML, each project is owned by a user or team. Users can launch multiple sessions in a project. Workloads are launched within a separate Kubernetes namespace for each user, thus ensuring isolation between users at the Kubernetes level.