Perform the following steps to configure Tez for your Hadoop cluster:
Create a
tez-site.xml
configuration file and place it in the/etc/tez/conf
configuration directory.A sample
tez-site.xml
file is included in theconfiguration_files/tez
folder in the HDP companion files.Create the
$TEZ_CONF_DIR
environment variable and set it to to the location of thetez-site.xml
file.export TEZ_CONF_DIR=/etc/tez/conf
Create the
$TEZ_JARS
environment variable and set it to the location of the Tez jars and their dependencies.export TEZ_JARS=/usr/lib/tez/*:/usr/lib/tez/lib/*
Note Be sure to include the asterisks (*) in the above command.
Configure the
tez.lib.uris
property with the HDFS paths containing the Tez jar files in thetez-site.xml
file.... <property> <name>tez.lib.uris</name> <value>${fs.default.name}/apps/tez/,${fs.default.name}/apps/tez/lib/</value> </property> ...
Add
$TEZ_CONF_DIR
and$TEZ_JARS
to the$HADOOP_CLASSPATH
environment variable.export HADOOP_CLASSPATH=$TEZ_CONF_DIR:$TEZ_JARS:$HADOOP_CLASSPATH
Where:
$TEZ_CONF_DIR
is the location oftez-site.xml
.$TEZ_JARS
is the location of Tez jars and their dependencies.
Table 10.1. Tez Configuration Parameters
Configuration Parameter | Description | Default Value |
---|---|---|
tez.lib.uris | Location of the Tez jars and their dependencies. Tez applications download required jar files from this location, so it should be public accessible. | N/A |
tez.am.log.level | Root logging level passed to the Tez Application Master. | INFO |
tez.staging-dir | The staging directory used by Tez when application developers submit DAGs, or Dynamic Acyclic Graphs. Tez creates all temporary files for the DAG job in this directory. | /tmp/${user.name}/staging |
tez.am.resource.memory.mb | The amount of memory in MB that YARN will allocate to the Tez Application Master. The size increases with the size of the DAG. | 1536 |
tez.am.java.opts | Java options for the Tez Application Master process. The value
specified for -Xmx value should be less than specified
in tez.am.resource.memory.mb , typically 512 MB less to
account for non-JVM memory in the process. | -server -Xmx1024m -Djava.net.preferIPv4Stack=true
-XX:+UseNUMA -XX:+UseParallelGC |
tez.am.shuffle-vertex-manager.min-src-fraction | In case of a Shuffle operation over a Scatter-Gather edge
connection, Tez may start data consumer tasks before all the data
producer tasks complete in order to overlap the shuffle IO. This
parameter specifies the fraction of producer tasks which should
complete before the consumer tasks are scheduled. The percentage is
expressed as a decimal, so the default value of 0.2
represents 20%. | 0.2 |
tez.am.shuffle-vertex-manager.max-src-fraction | In case of a Shuffle operation over a Scatter-Gather edge
connection, Tez may start data consumer tasks before all the data
producer tasks complete in order to overlap the shuffle IO. This
parameter specifies the fraction of producer tasks which should
complete before all consumer tasks are scheduled. The number of
consumer tasks ready for scheduling scales linearly between
min-fraction and max-fraction. The percentage is expressed as a
decimal, so the default value of 0.4 represents
40%. | 0.4 |
tez.am.am-rm.heartbeat.interval-ms.max | This parameter determines how frequently the Tez Application Master asks the YARN Resource Manager for resources in milliseconds. A low value can overload the Resource Manager. | 250 |
tez.am.grouping.split-waves | Specifies the number of waves,
or the percentage of queue container capacity, used to process a
data set where a value of1 represents 100% of container
capacity. The Tez Application Master considers this parameter value,
the available cluster resources, and the resources required by the
application to calculate parallelism, or the number of tasks to run.
Processing queries with additional containers leads to lower
latency. However, resource contention may occur if multiple users
run large queries simultaneously. | Tez Default:1.4 ; Hive Default:
1.7 |
tez.am.grouping.min-size | Specifies the lower bound of the size of the primary input to each task when The Tez Application Master determines the parallelism of primary input reading tasks. This configuration property prevents input tasks from being too small, which prevents the parallelism for the tasks being too large. | 16777216 |
tez.am.grouping.max-size | Specifies the upper bound of the size of the primary input to each task when the Tez Application Master determines the parallelism of primary input reading tasks. This configuration property prevents input tasks from being too large, which prevents their parallelism from being too small. | 1073741824 |
tez.am.container.reuse.enabled | A container is the unit of resource allocation in YARN. This configuration parameter determines whether Tez will reuse the same container to run multiple tasks. Enabling this parameter improves performance by avoiding the memory overhead of reallocating container resources for every task. However, disable this parameter if the tasks contain memory leaks or use static variables. | true |
tez.am.container.reuse.rack-fallback.enabled | Specifies whether to reuse containers for rack-local tasks. This
configuration parameter is ignored unless
tez.am.container.reuse.enabled is enabled. | true |
tez.am.container.reuse.non-local-fallback.enabled | Specifies whether to reuse containers for non-local tasks. This
configuration parameter is ignored unless
tez.am.container.reuse.enabled is enabled. | true |
tez.am.container.session.delay-allocation-millis | Determines when a Tez session releases its containers while not
actively servicing a query. Specify a value of -1 to
never release an idle container in a session. However, containers
may still be released if they do not meet resource or locality
needs. This configuration parameter is ignored unless
tez.am.container.reuse.enabled is enabled. | 10000 (10 seconds) |
tez.am.container.reuse.locality.delay-allocation-millis | The amount of time to wait in milliseconds before assigning a container to the next level of locality. The three levels of locality in ascending order are NODE, RACK, and NON_LOCAL. | 250 |
tez.task.get-task.sleep.interval-ms.max | Determines the maximum amount of time in milliseconds a container agent waits before asking The Tez Application Master for another task. Tez runs an agent on a container in order to remotely launch tasks. A low value may overload the Application Master. | 200 |
tez.session.client.timeout.secs | Specifies the amount of time in seconds to wait for the Application Master to start when trying to submit a DAG from the client in session mode. | 180 |
tez.session.am.dag.submit.timeout.secs | Specifies the amount of time in seconds that the Tez Application Master waits for a DAG to be submitted before shutting down. The value of this property is used when the Tez Application Manager is running in Session mode, which allows multiple DAGs to be submitted for execution. The idle time between DAG submissions should not exceed this time. | 300 |
tez.runtime.intermediate-output.should-compress | Specifies whether Tez should compress intermediate output. | false |
tez.runtime.intermediate-output.compress.codec | Specifies the codec to used when compressing intermediate output.
This configuration is ignored unless
tez.runtime.intermediate-output.should-compress is
enabled. | org.apache.hadoop.io.compress.SnappyCodec |
tez.runtime.intermediate-input.is-compressed | Specifies whether intermediate output is compressed. | false |
tez.runtime.intermediate-input.compress.codec | Specifies the codec to use when reading intermediate compressed
input. This configuration property is ignored unless
tez.runtime.intermediate-input.is-compressed is
enabled. | org.apache.hadoop.io.compress.SnappyCodec |
tez.yarn.ats.enabled | Specifies that Tez should start the TimeClient for sending information to the Timeline Server. | false |
You can use the tez.queue.name
property to specify which queue will
be used for Tez jobs. Currently the Capacity Scheduler is the default Scheduler in HDP. In general, this is
not limited to the Capacity Scheduler, but applies to any YARN queue.
If no queues have been configured, the default queue will be used, which means that 100% of the cluster capacity will be used when running Tez jobs. If queues have been configured, a queue name must be configured for each YARN application.
Setting tez.queue.name
in tez-site.xml
would apply to Tez
applications that use that configuration file. To assign separate queues for each
application, you would need separate tez-site.xml files, or you could have the
application pass this configuration to Tez while submitting the Tez DAG.
For example, in Hive you would use the the tez.queue.name
property in
hive-site.xml
to specify the queue to be used for Hive-on-Tez jobs.
To assign Hive-on-Tez jobs to use the "engineering" queue, you would add the
following property to hive-site.xml
:
<property> <name>tez.queue.name</name> <value>engineering</value> </property>
Setting this configuration property in hive-site.xml
will affect all
Hive queries that read that configuration file.
To assign Hive-on-Tez jobs to use the "engineering" queue in a Hive query, you would use the following command in the Hive shell or in a Hive script:
set tez.queue.name=engineering;