Developing Apache Spark ApplicationsPDF version

Enabling vectorized query execution

Vectorized query execution is a feature that greatly reduces the CPU usage for typical query operations such as scans, filters, aggregates, and joins. Vectorization is also implemented for the ORC format. Spark also uses Whole Stage Codegen and this vectorization (for Parquet) since Spark 2.0.

Use the following steps to implement the new ORC format and enable vectorization for ORC files with SparkSQL.

  1. In the Cloudera Data Platform (CDP) Management Console, go to Data Hub Clusters.
  2. Find and select the cluster you want to configure.
  3. Click the link for the Cloudera Manager URL.
  4. Go to Clusters > <Cluster Name> > Spark service > Configuration.
  5. Select the Scope > Gateway and Category > Advanced filters.
  6. Add the following properties to Spark Client Advanced Configuration Snippet (Safety Valve) for spark-conf/spark-defaults.conf:
    • spark.sql.orc.enabled=true – Enables the new ORC format to read/write Spark data source tables and files.

    • spark.sql.hive.convertMetastoreOrc=true – Enables the new ORC format to read/write Hive tables.

    • spark.sql.orc.char.enabled=true – Enables the new ORC format to use CHAR types to read Hive tables. By default, STRING types are used for performance reasons. This is an optional configuration for Hive compatibility.

  7. Click Save Changes, and then restart Spark and any other components that require a restart.