3. Configure Tez

Perform the following steps to configure Tez for your Hadoop cluster:

  1. Create a tez-site.xml configuration file and place it in the /etc/tez/conf configuration directory. A sample tez-site.xml file is included in the configuration_files/tez folder in the HDP companion files.

  2. Create the $TEZ_CONF_DIR environment variable and set it to to the location of the tez-site.xml file.

    export TEZ_CONF_DIR=/etc/tez/conf
  3. Create the $TEZ_JARS environment variable and set it to the location of the Tez .jar files and their dependencies.

    export TEZ_JARS=/usr/hdp/current/tez-client/*:/usr/hdp/current/tez-client/lib/*
  4. In the tez-site.xml file, configure the tez.lib.uris property with the HDFS path containing the Tez tarball file.

    ...
    <property>
         <name>tez.lib.uris</name>
         <value>/hdp/apps/<hdp_version>/tez/tez.tar.gz</value>
    </property>
    ...

    Where <hdp_version> is the current HDP version, such as 2.2.4.2.

  5. Add $TEZ_CONF_DIR and $TEZ_JARS to the $HADOOP_CLASSPATH environment variable.

    export HADOOP_CLASSPATH=$TEZ_CONF_DIR:$TEZ_JARS:$HADOOP_CLASSPATH
 

Table 8.1. Tez Configuration Parameters

Configuration Parameter

Description

Default Value

tez.lib.uris

Comma-delimited list of the location of the Tez libraries which will be localized for DAGs. Specifying a single .tar.gz or .tgz assumes that a compressed version of the tez libs is being used. This is uncompressed into a tezlibs directory when running containers, and tezlibs/;tezlibs/lib/ are added to the classpath (after . and .*). If multiple files are specified - files are localized as regular files, contents of directories are localized as regular files (non-recursive).

/hdp/apps/<hdp_version>/tez/tez.tar.gz

tez.use.cluster.hadoop-libs

Specifies whether Tez will use the cluster Hadoop libraries. This property should not be set in tez-site.xml, or if it is set, the value should be false.

false

tez.cluster.additional.classpath.prefix

Specify additional classpath information to be used for Tez AM and all containers. This will be prepended to the classpath before all framework specific components have been specified.

/usr/hdp/${hdp.version}/hadoop/lib/hadoop-lzo-0.6.0.${hdp.version}.jar:/etc/hadoop/conf/secure

tez.am.log.level

Root logging level passed to the Tez Application Master.

INFO

tez.generate.debug.artifacts

Generates debug artifacts such as a text representation of the submitted DAG plan.

false

tez.staging-dir

The staging directory used while submitting DAGs.

/tmp/${user.name}/staging

tez.am.resource.memory.mb

The amount of memory to be used by the AppMaster. Used only if the value is not specified explicitly by the DAG definition.

TODO-CALCULATE-MEMORY-SETTINGS (place-holder for calculated value) Example value:1536

tez.am.launch.cluster-default.cmd-opts

Cluster default Java options for the Tez AppMaster process. These will be prepended to the properties specified with tez.am.launch.cmd-opts.

Note: this property should only be set by administrators -- it should not be used by non-administrative users.

-server -Djava.net.preferIPv4Stack=true -Dhdp.version=${hdp.version}

tez.task.resource.memory.mb

The amount of memory to be used by launched tasks. Used only if the value is not specified explicitly by the DAG definition.

1024

tez.task.launch.cluster-default.cmd-opts

Cluster default Java options for tasks. These will be prepended to the properties specified with tez.task.launch.cmd-opts

Note: this property should only be set by administrators -- it should not be used by non-administrative users.

-server -Djava.net.preferIPv4Stack=true -Dhdp.version=${hdp.version}

tez.task.launch.cmd-opts

Java options for tasks. The Xmx value is derived based on tez.task.resource.memory.mb and is 80% of this value by default. Used only if the value is not specified explicitly by the DAG definition.

-XX:+PrintGCDetails -verbose:gc -XX:+PrintGCTimeStamps -XX:+UseNUMA -XX:+UseParallelGC

tez.task.launch.env

Additional execution environment entries for tez. This is not an additive property. You must preserve the original value if you want to have access to native libraries. Used only if the value is not specified explicitly by the DAG definition.

LD_LIBRARY_PATH=/usr/hdp/${hdp.version}/hadoop/lib/native:/usr/hdp/${hdp.version}/hadoop/lib/native/Linux-amd64-64<

tez.am.grouping.max-size

Specifies the upper size limit of the primary input to each task when the Tez Application Master determines the parallelism of primary input reading tasks. This configuration property prevents input tasks from being too large, which prevents their parallelism from being too small.

1073741824

tez.shuffle-vertex-manager.min-src-fraction

In case of a ScatterGather connection, the fraction of source tasks which should complete before tasks for the current vertex are scheduled.

0.2

tez.shuffle-vertex-manager.max-src-fraction

In case of a ScatterGather connection, once this fraction of source tasks have completed, all tasks on the current vertex can be scheduled. Number of tasks ready for scheduling on the current vertex scales linearly between min-fraction and max-fraction.

0.4

tez.am.am-rm.heartbeat.interval-ms.max

The maximum heartbeat interval between the AM and RM in milliseconds.

250

tez.grouping.split-waves

The multiplier for available queue capacity when determining number of tasks for a Vertex. When set to its default value of 1.7 with 100% queue available implies generating a number of tasks roughly equal to 170% of the available containers on the queue.

1.7

tez.grouping.min-size

Lower size limit (in bytes) of a grouped split, to avoid generating too many splits.

16777216

tez.grouping.max-size

Upper size limit (in bytes) of a grouped split, to avoid generating an excessively large split.

1073741824

tez.am.container.reuse.enabled

Configuration that specifies whether a container should be reused.

true

tez.am.container.reuse.rack-fallback.enabled

Specifies whether to reuse containers for rack local tasks. Active only if reuse is enabled.

true

tez.am.container.reuse.non-local-fallback.enabled

Specifies whether to reuse containers for non-local tasks. Active only if reuse is enabled.

false

tez.am.container.idle.release-timeout-min.millis

The minimum amount of time to hold on to a container that is idle. Only active when reuse is enabled.

10000

tez.am.container.idle.release-timeout-max.millis

The maximum amount of time to hold on to a container if no task can be assigned to it immediately. Only active when reuse is enabled.

20000

tez.am.container.reuse.locality.delay-allocation-millis

The amount of time to wait before assigning a container to the next level of locality. NODE -> RACK -> NON_LOCAL

250

tez.am.max.app.attempts

Specifies the total time the app master will run in case recovery is triggered.

2

tez.am.maxtaskfailures.per.node

The maximum number of allowed task attempt failures on a node before it gets marked as blacklisted.

10

tez.task.am.heartbeat.counter.interval-ms.max

Time interval at which task counters are sent to the AM.

4000

tez.task.get-task.sleep.interval-ms.max

Maximum amount of time, in seconds, to wait before a task asks an AM for another task.

200

tez.task.max-events-per-heartbeat

Maximum number of events to fetch from the AM by the tasks in a single heartbeat.

500

tez.session.client.timeout.secs

Time (in seconds) to wait for AM to come up when trying to submit a DAG from the client.

-1

tez.session.am.dag.submit.timeout.secs

Time (in seconds) for which the Tez AM should wait for a DAG to be submitted before shutting down.

300

tez.counters.max

The number of allowed counters for the executing DAG.

2000

tez.counters.max.groups

The number of allowed counter groups for the executing DAG.

1000

tez.runtime.compress

Specifies whether intermediate data should be compressed or not.

true

tez.runtime.compress.codec

The coded to be used if compressing intermediate data. Only applicable if tez.runtime.compress is enabled.

org.apache.hadoop.io.compress. SnappyCodec

tez.runtime.io.sort.mb

The size of the sort buffer when output is sorted.

512

tez.runtime.unordered.output. buffer.size-mb

The size of the buffer when output is not sorted.

100

tez.history.logging.service.class

The class to be used for logging history data. Set to org.apache.tez.dag.history.logging. ats.ATSHistoryLoggingService to log to ATS. Set to org.apache.tez. dag.history.logging.impl. SimpleHistoryLoggingService to log to the filesystem specified by ${fs.defaultFS}.

org.apache.tez.dag.history.logging. ats.ATSHistoryLoggingService


[Note]Note

There are no additional steps required to secure Tez if your cluster is already configured for security.


loading table of contents...