Command Line Installation
Also available as:
PDF
loading table of contents...

Configuring Tez

Perform the following steps to configure Tez for your Hadoop cluster:

  1. Create a tez-site.xml configuration file and place it in the /etc/tez/conf configuration directory. A sample tez-site.xml file is included in the configuration_files/tez folder in the HDP companion files.

  2. In the tez-site.xml file, configure the tez.lib.uris property with the HDFS path containing the Tez tarball file.

    ...
    <property>
         <name>tez.lib.uris</name>
         <value>/hdp/apps/<hdp_version>/tez/tez.tar.gz</value>
    </property>
    ...

    Where <hdp_version> is the current HDP version, such as 2.5.6.0.

Table 8.1. Tez Configuration Parameters

Configuration Parameter

Description

Default Value

tez.am.acls.enabled

Enables or disables access control list checks on Application Master (AM) and history data.

true

tez.am.am-rm.heartbeat.interval-ms.max

The maximum heartbeat interval between the AM and RM in milliseconds.

250

tez.am.client.am.port-range

Range of ports that the AM can use when binding for client connections. Leave this blank to use all possible ports.

No default setting. The format is a number range. For example, 10000-19999

tez.am.container.idle.release-timeout-max.millis

The maximum amount of time to hold on to a container if no task can be assigned to it immediately. Only active when reuse is enabled.

20000

tez.am.container.idle.release-timeout-min.millis

The minimum amount of time to hold on to a container that is idle. Only active when reuse is enabled.

10000

tez.am.container.reuse.enabled

Configuration that specifies whether a container should be reused.

true

tez.am.container.reuse.locality.delay-allocation-millis

The amount of time to wait before assigning a container to the next level of locality. NODE -> RACK -> NON_LOCAL

250

tez.am.container.reuse.non-local-fallback.enabled

Specifies whether to reuse containers for non-local tasks. Active only if reuse is enabled.

false

tez.am.container.reuse.rack-fallback.enabled

Specifies whether to reuse containers for rack local tasks. Active only if reuse is enabled.

true

tez.am.launch.cluster-default.cmd-opts

Note: This property should only be set by administrators -- it should not be used by non-administrative users.

Cluster default Java options for the Tez AppMaster process. These are prepended to the properties specified with tez.am.launch.cmd-opts.

-server -Djava.net.preferIPv4Stack=true -Dhdp.version=${hdp.version}

tez.am.launch.cmd-opts

Command line options that are provided during the launch of the Tez AppMaster process. Do not set any Xmx or Xms in these launch options so that Tez can determine them automatically.

-XX:+PrintGCDetails -verbose:gc -XX:+PrintGCTimeStamps -XX:+UseNUMA -XX:+UseParallelGC

tez.am.launch.env

Environment settings for the Tez AppMaster process.

LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HADOOP_COMMON_HOME/lib/native/

tez.am.log.level

Root logging level passed to the Tez Application Master.

Simple configuration: Set the log level for all loggers. For example, set to INFO. This sets the log level to INFO for all loggers.

Advanced configuration: Set the log level for all classes, along with a different level for some classes. For example, set to DEBUG;org.apache.hadoop.ipc=INFO;org.apache.hadoop.security=INFO

This sets the log level for all loggers to DEBUG, except for org.apache.hadoop. ipc and org.apache.hadoop.security, which are set to INFO.

Note:The global log level must always be the first parameter. For example:

DEBUG;org.apache.hadoop. ipc=INFO;org.apache. hadoop.security=INFO is valid.

org.apache.hadoop.ipc=INFO;org.apache.hadoop. security=INFO is not valid.

INFO

tez.am.max.app.attempts

Specifies the total number of times that the app master is re-run in case recovery is triggered.

2

tez.am.maxtaskfailures.per.node

The maximum number of allowed task attempt failures on a node before it gets marked as blacklisted.

10

tez.am.modify-acls

Enables specified users or groups to modify operations on the AM such as submitting DAGs, pre-warming the session, killing DAGs, or shutting down the session.

Format: comma-separated list of users, followed by a whitespace, and then a comma-separated list of groups. For example, "lhale,msmith administrators,users"

No default setting

tez.am.resource.cpu.vcores

The number of virtual cores to be used by the AppMaster process. Set this to > 1 if the RM Scheduler is configured to support virtual cores.

1

tez.am.resource.memory.mb

The amount of memory to be used by the AppMaster. Used only if the value is not specified explicitly by the DAG definition.

1536

tez.am.session.min.held-containers

The minimum number of containers that are held in session mode. Not active in non-session mode. Enables an idle session that is not running a DAG to hold on to a minimum number of containers to provide fast response times for the next DAG.

0

tez.am.task.max.failed.attempts

The maximum number that can fail for a particular task before the task fails. This does not count killed attempts. A task failure results in a DAG failure. Must be an integer.

4

tez.am.view-acls

AM view ACLs. This setting enables the specified users or groups to view the status of the AM and all DAGs that run within the AM. Format: a comma-separated list of users, a whitespace, and then a comma-separated list of groups. For example, "lhale,msmith administrators,users"

No default value

tez.cluster.additional.classpath.prefix

Specify additional classpath information to be used for Tez AM and all containers. This is prepended to the classpath before all framework specific components have been specified.

/usr/hdp/${hdp.version}/hadoop/lib/hadoop-lzo-0.6.0.${hdp.version}.jar:/etc/hadoop/conf/secure

tez.container.max.java.heap.fraction

A double value. Tez automatically determines the Xmx for the Java virtual machines that are used to run Tez tasks and Application Masters. This is enabled if the Xmx or Xms values have not been specified in the launch command options. Automatic Xmx calculation is preferred because Tez can determine the best value based on the actual allocation of memory to tasks in the cluster. The value should be greater than 0 and less than 1.

0.8

tez.counters.max

The number of allowed counters for the executing DAG.

2000

tez.counters.max.groups

The number of allowed counter groups for the executing DAG.

1000

tez.generate.debug.artifacts

Generates debug artifacts such as a text representation of the submitted DAG plan.

false

tez.grouping.max-size

Upper size limit (in bytes) of a grouped split, to avoid generating an excessively large split. Replaces tez.am.grouping.max-size

1073741824 (1 GB)

tez.grouping.min-size

Lower size limit (in bytes) of a grouped split, to avoid generating too many splits.

52428800 (50 MB)

tez.grouping.split-waves

The multiplier for available queue capacity when determining number of tasks for a Vertex. When set to its default value of 1.7 with 100% queue available implies generating a number of tasks roughly equal to 170% of the available containers on the queue.

1.7

tez.history.logging.service.class

The class to be used for logging history data. Set to org.apache.tez.dag.history.logging. ats.ATSHistoryLoggingService to log to ATS. Set to org.apache.tez. dag.history.logging.impl. SimpleHistoryLoggingService to log to the filesystem specified by ${fs.defaultFS}.

org.apache.tez.dag.history.logging. ats.ATSHistoryLoggingService

tez.lib.uris

Comma-delimited list of the location of the Tez libraries which is localized for DAGs. Specifying a single .tar.gz or .tgz assumes that a compressed version of the tez libs is being used. This is uncompressed into a tezlibs directory when running containers, and tezlibs/;tezlibs/lib/ are added to the classpath (after . and .*). If multiple files are specified - files are localized as regular files, contents of directories are localized as regular files (non-recursive).

There is no default value, but it should be set to: /hdp/apps/${hdp.version}/tez/tez.tar.gz, or the location of the tez tarball on HDFS, or the appropriate distributed filesystem path.

tez.queue.name

This property should not be set in tez-site.xml. Instead, it can be provided on the command line when you are launching a job to determine which YARN queue to submit a job to.

No default setting

tez.runtime.compress

Specifies whether intermediate data should be compressed or not.

true

tez.runtime.compress.codec

The codec to be used if compressing intermediate data. Only applicable if tez.runtime.compress is enabled.

org.apache.hadoop.io.compress. SnappyCodec

tez.runtime.io.sort.factor

The number of streams to merge at once while sorting files. This determines the number of open file handles.

10

tez.runtime.io.sort.mb

The size of the sort buffer when output is sorted.

512

tez.runtime.sorter.class

Which sorter implementation to use. Valid values:

  • LEGACY

  • PIPELINED

The legacy sorter implementation is based on the Hadoop MapReduce shuffle implementation. It is restricted to 2GB memory limits.

Pipeline sorter is a more efficient sorter that supports > 2GB sort buffers.

PIPELINED

tez.runtime.sort.spill.percent

The soft limit in the serialization buffer. Once this limit is reached, a thread begins to spill the contents to disk in the background.

Note:Collection is not blocked if this threshold is exceeded while a spill is already in progress, so spills can be larger than this threshold when it is set to less than .5

0.8

tez.runtime.unordered.output. buffer.size-mb

The size of the buffer when output is not sorted.

100

tez.session.am.dag.submit.timeout.secs

Time (in seconds) for which the Tez AM should wait for a DAG to be submitted before shutting down.

300

tez.session.client.timeout.secs

Time (in seconds) to wait for AM to come up when trying to submit a DAG from the client.

-1

tez.shuffle-vertex-manager.max-src-fraction

In case of a ScatterGather connection, once this fraction of source tasks have completed, all tasks on the current vertex can be scheduled. Number of tasks ready for scheduling on the current vertex scales linearly between min-fraction and max-fraction.

0.4

tez.shuffle-vertex-manager.min-src-fraction

In case of a ScatterGather connection, the fraction of source tasks which should complete before tasks for the current vertex are scheduled.

0.2

tez.staging-dir

The staging directory used while submitting DAGs.

/tmp/${user.name}/staging

tez.task.am.heartbeat.counter.interval-ms.max

Time interval at which task counters are sent to the AM.

4000

tez.task.generate.counters.per.io

Sets whether to generate counters per IO or not. Enabling this renames CounterGroups/CounterNames, making them unique per vertex edge instead of unique per vertex.

true

tez.task.get-task.sleep.interval-ms.max

Maximum amount of time, in seconds, to wait before a task asks an AM for another task.

200

tez.task.launch.cluster-default.cmd-opts

Note: This property should only be set by administrators -- it should not be used by non-administrative users.

Cluster default Java options for tasks. These are prepended to the properties specified with tez.task.launch.cmd-opts

-server -Djava.net.preferIPv4Stack=true -Dhdp.version=${hdp.version}

tez.task.launch.cmd-opts

Java options for tasks. The Xmx value is derived based on tez.task.resource.memory.mb and is 80% of this value by default. Used only if the value is not specified explicitly by the DAG definition.

-XX:+PrintGCDetails -verbose:gc -XX:+PrintGCTimeStamps -XX:+UseNUMA -XX:+UseParallelGC

tez.task.launch.env

Additional execution environment entries for Tez. This is not an additive property. You must preserve the original value if you want to have access to native libraries. Used only if the value is not specified explicitly by the DAG definition.

LD_LIBRARY_PATH=/usr/hdp/${hdp.version}/hadoop/lib/native:/usr/hdp/${hdp.version}/hadoop/lib/native/Linux-amd64-64/

tez.task.log.level

Root logging level that is passed to the Tez tasks.

Simple configuration: Set the log level for all loggers. For example, set to INFO. This sets the log level to INFO for all loggers.

Advanced configuration: Set the log level for all classes, along with a different level for some classes. For example, set to DEBUG;org.apache.hadoop. ipc=INFO;org.apache.hadoop.security=INFO

This sets the log level for all loggers to DEBUG, except for org.apache.hadoop.ipc and org.apache.hadoop.security, which are set to INFO.

Note:The global log level must always be the first parameter. For example:

DEBUG;org.apache.hadoop. ipc=INFO;org.apache.hadoop. security=INFO is valid.

org.apache.hadoop.ipc= INFO;org.apache.hadoop.security=INFO is not valid.

INFO

tez.task.max-events-per-heartbeat

Maximum number of events to fetch from the AM by the tasks in a single heartbeat.

500

tez.task.resource.cpu.vcores

The number of virtual cores to be used by the Tez tasks. Set this to > 1 if RM Scheduler is configured to support virtual cores.

1

tez.task.resource.memory.mb

The amount of memory to be used by launched tasks. Used only if the value is not specified explicitly by the DAG definition.

1024

tez.use.cluster.hadoop-libs

Specifies whether Tez uses the cluster Hadoop libraries. This property should not be set in tez-site.xml, or if it is set, the value should be false.

false


[Note]Note

There are no additional steps required to secure Tez if your cluster is already configured for security.

To monitor the progress of a Tez job or to analyze the history of a Tez job, set up the Tez View in Ambari. For information about setting up the Tez view, see Configuring Your Cluster for Tez View in the HDP Ambari Views Guide.