Enabling vectorized query execution
Vectorized query execution is a feature that greatly reduces the CPU usage for typical query operations such as scans, filters, aggregates, and joins. Vectorization is also implemented for the ORC format. Spark also uses Whole Stage Codegen and this vectorization (for Parquet) since Spark 2.0.
Use the following steps to implement the new ORC format and enable vectorization for ORC files with SparkSQL.
- In the Cloudera Data Platform (CDP) Management Console, go to Data Hub Clusters.
- Find and select the cluster you want to configure.
- Click the link for the Cloudera Manager URL.
- Go to .
- Select the and filters.
Add the following properties to Spark Client Advanced
Configuration Snippet (Safety Valve) for
spark.sql.orc.enabled=true– Enables the new ORC format to read/write Spark data source tables and files.
spark.sql.hive.convertMetastoreOrc=true– Enables the new ORC format to read/write Hive tables.
spark.sql.orc.char.enabled=true– Enables the new ORC format to use
CHARtypes to read Hive tables. By default,
STRINGtypes are used for performance reasons. This is an optional configuration for Hive compatibility.
- Click Save Changes, and then restart Spark and any other components that require a restart.