Spark on Kubernetes

In Cloudera Machine Learning, multiple Spark versions are available through Runtime Addons. Data Scientists can select the version of Spark to be configured for any workload. Cloudera Machine Learning configures the Runtime container and mounts all of the dependencies.

Cloudera Machine Learning supports fully-containerized execution of Spark workloads through Spark's support for the Kubernetes cluster backend. Users can interact with Spark both interactively and in batch mode. In both batch and interactive modes, dependency management, including for Spark executors, is transparently managed by Cloudera Machine Learning and Kubernetes. No extra configuration is required. In interactive mode, Cloudera Machine Learning leverages the cloud provider for scalable project storage, and in batch mode, Cloudera Machine Learning manages dependencies through container images.

Cloudera Machine Learning also supports native cloud autoscaling via Kubernetes. When clusters do not have the required capacity to run workloads, they can automatically scale up additional nodes. Administrators can configure auto-scaling upper limits, which determine how large a compute cluster can grow. Since compute costs increase as cluster size increases, having a way to configure upper limits gives administrators a method to stay within a budget. Autoscaling policies can also account for heterogeneous node types such as GPU nodes.

In Cloudera Machine Learning, each project is owned by a user or team. Users can launch multiple sessions in a project. Workloads are launched within a separate Kubernetes namespace for each user, thus ensuring isolation between users at the Kubernetes level.