Developing Apache Spark Applications
Also available as:
PDF

Enable Vectorized Query Execution

Vectorized query execution is a feature that greatly reduces the CPU usage for typical query operations such as scans, filters, aggregates, and joins. Vectorization is also implemented for the ORC format. Spark also uses Whole Stage Codegen and this vectorization (for Parquet) since Spark 2.0 (released on July 26, 2016).

Use the following steps to implement the new ORC format and enable vectorization for ORC files with SparkSQL.

  1. On the Ambari Dashboard, select Spark2 > Configs. For Spark shells and applications, click Custom spark2-defaults, then add the following properties. For the Spark Thrift Server, add these properties under Custom spark2-thrift-sparkconf.
    • spark.sql.orc.enabled=true – Enables the new ORC format to read/write Spark data source tables and files.

    • spark.sql.hive.convertMetastoreOrc=true – Enables the new ORC format to read/write Hive tables.

    • spark.sql.orc.char.enabled=true – Enables the new ORC format to use CHAR types to read Hive tables. By default, STRING types are used for performance reasons. This is an optional configuration for Hive compatibility.

  2. Click Save, then restart Spark and any other components that require a restart.