Perform the following steps to configure Tez for your Hadoop cluster:
Create a tez-site.xml configuration file and place it in the /etc/tez/conf configuration directory. A sample tez-site.xml file is included in the configuration_files/tez folder in the HDP companion files.
Create the $TEZ_CONF_DIR environment variable and set it to to the location of the tez-site.xml file.
export TEZ_CONF_DIR=/etc/tez/conf
Create the $TEZ_JARS environment variable and set it to the location of the Tez .jar files and their dependencies.
export TEZ_JARS=/usr/hdp/current/tez-client/*:/usr/hdp/current/tez-client/lib/*
In the tez-site.xml file, configure the tez.lib.uris property with the HDFS path containing the Tez tarball file.
... <property> <name>tez.lib.uris</name> <value>/hdp/apps/<hdp_version>/tez/tez.tar.gz</value> </property> ...
Where <hdp_version> is the current HDP version, such as 2.2.0.0.
Add $TEZ_CONF_DIR and $TEZ_JARS to the $HADOOP_CLASSPATH environment variable.
export HADOOP_CLASSPATH=$TEZ_CONF_DIR:$TEZ_JARS:$HADOOP_CLASSPATH
Table 8.1. Tez Configuration Parameters
Configuration Parameter | Description | Default Value |
---|---|---|
tez.lib.uris | Comma-delimited list of the location of the Tez libraries which will be localized for DAGs. Specifying a single .tar.gz or .tgz assumes that a compressed version of the tez libs is being used. This is uncompressed into a tezlibs directory when running containers, and tezlibs/;tezlibs/lib/ are added to the classpath (after . and .*). If multiple files are specified - files are localized as regular files, contents of directories are localized as regular files (non-recursive). | /hdp/apps/<hdp_version>/tez/tez.tar.gz |
tez.use.cluster.hadoop-libs | Specifies whether Tez will use the cluster Hadoop libraries. This property should not be set in tez-site.xml, or if it is set, the value should be false. | false |
tez.cluster.additional.classpath.prefix | Specify additional classpath information to be used for Tez AM and all containers. This will be prepended to the classpath before all framework specific components have been specified. | /usr/hdp/${hdp.version}/hadoop/lib/hadoop-lzo-0.6.0.${hdp.version}.jar:/etc/hadoop/conf/secure |
tez.am.log.level | Root logging level passed to the Tez Application Master. | INFO |
tez.generate.debug.artifacts | Generates debug artifacts such as a text representation of the submitted DAG plan. | false |
tez.staging-dir | The staging directory used while submitting DAGs. | /tmp/${user.name}/staging |
tez.am.resource.memory.mb | The amount of memory to be used by the AppMaster. Used only if the value is not specified explicitly by the DAG definition. | TODO-CALCULATE-MEMORY-SETTINGS (place-holder for calculated value) Example value:1536 |
tez.am.launch.cluster-default.cmd-opts | Cluster default Java options for the Tez AppMaster process. These will be prepended to the properties specified with tez.am.launch.cmd-opts. Note: this property should only be set by administrators -- it should not be used by non-administrative users. | -server -Djava.net.preferIPv4Stack=true -Dhdp.version=${hdp.version} |
tez.task.resource.memory.mb | The amount of memory to be used by launched tasks. Used only if the value is not specified explicitly by the DAG definition. | 1024 |
tez.task.launch.cluster-default.cmd-opts | Cluster default Java options for tasks. These will be prepended to the properties specified with tez.task.launch.cmd-opts Note: this property should only be set by administrators -- it should not be used by non-administrative users. | -server -Djava.net.preferIPv4Stack=true -Dhdp.version=${hdp.version} |
tez.task.launch.cmd-opts | Java options for tasks. The Xmx value is derived based on tez.task.resource.memory.mb and is 80% of this value by default. Used only if the value is not specified explicitly by the DAG definition. | -XX:+PrintGCDetails -verbose:gc -XX:+PrintGCTimeStamps -XX:+UseNUMA -XX:+UseParallelGC |
tez.task.launch.env | Additional execution environment entries for tez. This is not an additive property. You must preserve the original value if you want to have access to native libraries. Used only if the value is not specified explicitly by the DAG definition. | LD_LIBRARY_PATH=/usr/hdp/${hdp.version}/hadoop/lib/native:/usr/hdp/${hdp.version}/hadoop/lib/native/Linux-amd64-64< |
tez.am.grouping.max-size | Specifies the upper size limit of the primary input to each task when the Tez Application Master determines the parallelism of primary input reading tasks. This configuration property prevents input tasks from being too large, which prevents their parallelism from being too small. | 1073741824 |
tez.shuffle-vertex-manager.min-src-fraction | In case of a ScatterGather connection, the fraction of source tasks which should complete before tasks for the current vertex are scheduled. | 0.2 |
tez.shuffle-vertex-manager.max-src-fraction | In case of a ScatterGather connection, once this fraction of source tasks have completed, all tasks on the current vertex can be scheduled. Number of tasks ready for scheduling on the current vertex scales linearly between min-fraction and max-fraction. | 0.4 |
tez.am.am-rm.heartbeat.interval-ms.max | The maximum heartbeat interval between the AM and RM in milliseconds. | 250 |
tez.grouping.split-waves | The multiplier for available queue capacity when determining number of tasks for a Vertex. When set to its default value of 1.7 with 100% queue available implies generating a number of tasks roughly equal to 170% of the available containers on the queue. | 1.7 |
tez.grouping.min-size | Lower size limit (in bytes) of a grouped split, to avoid generating too many splits. | 16777216 |
tez.grouping.max-size | Upper size limit (in bytes) of a grouped split, to avoid generating an excessively large split. | 1073741824 |
tez.am.container.reuse.enabled | Configuration that specifies whether a container should be reused. | true |
tez.am.container.reuse.rack-fallback.enabled | Specifies whether to reuse containers for rack local tasks. Active only if reuse is enabled. | true |
tez.am.container.reuse.non-local-fallback.enabled | Specifies whether to reuse containers for non-local tasks. Active only if reuse is enabled. | false |
tez.am.container.idle.release-timeout-min.millis | The minimum amount of time to hold on to a container that is idle. Only active when reuse is enabled. | 10000 |
tez.am.container.idle.release-timeout-max.millis | The maximum amount of time to hold on to a container if no task can be assigned to it immediately. Only active when reuse is enabled. | 20000 |
tez.am.container.reuse.locality.delay-allocation-millis | The amount of time to wait before assigning a container to the next level of locality. NODE -> RACK -> NON_LOCAL | 250 |
tez.am.max.app.attempts | Specifies the total time the app master will run in case recovery is triggered. | 2 |
tez.am.maxtaskfailures.per.node | The maximum number of allowed task attempt failures on a node before it gets marked as blacklisted. | 10 |
tez.task.am.heartbeat.counter.interval-ms.max | Time interval at which task counters are sent to the AM. | 4000 |
tez.task.get-task.sleep.interval-ms.max | Maximum amount of time, in seconds, to wait before a task asks an AM for another task. | 200 |
tez.task.max-events-per-heartbeat | Maximum number of events to fetch from the AM by the tasks in a single heartbeat. | 500 |
tez.session.client.timeout.secs | Time (in seconds) to wait for AM to come up when trying to submit a DAG from the client. | -1 |
tez.session.am.dag.submit.timeout.secs | Time (in seconds) for which the Tez AM should wait for a DAG to be submitted before shutting down. | 300 |
tez.counters.max | The number of allowed counters for the executing DAG. | 2000 |
tez.counters.max.groups | The number of allowed counter groups for the executing DAG. | 1000 |
tez.runtime.compress | Specifies whether intermediate data should be compressed or not. | true |
tez.runtime.compress.codec | The coded to be used if compressing intermediate data. Only applicable if tez.runtime.compress is enabled. | org.apache.hadoop.io.compress. SnappyCodec |
tez.runtime.io.sort.mb | The size of the sort buffer when output is sorted. | 512 |
tez.runtime.unordered.output. buffer.size-mb | The size of the buffer when output is not sorted. | 100 |
tez.history.logging.service.class | The class to be used for logging history data. Set to org.apache.tez.dag.history.logging. ats.ATSHistoryLoggingService to log to ATS. Set to org.apache.tez. dag.history.logging.impl. SimpleHistoryLoggingService to log to the filesystem specified by ${fs.defaultFS}. | org.apache.tez.dag.history.logging. ats.ATSHistoryLoggingService |
Note | |
---|---|
There are no additional steps required to secure Tez if your cluster is already configured for security. |