Installing Spark - Hortonworks Data Platform

Non-Ambari Cluster Installation Guide

Also available as:

PDF

Contents

1. Getting Ready to Install
- Meet Minimum System Requirements
- Virtualization and Cloud Platforms
- Configure the Remote Repositories
- Decide on Deployment Type
- Collect Information
- Prepare the Environment
- Download Companion Files
- Define Environment Parameters
- [Optional] Create System Users and Groups
- Determine HDP Memory Configuration Settings
  - Running the YARN Utility Script
  - Manually Calculating YARN and MapReduce Memory Configuration Settings
- Configuring NameNode Heap Size
- Allocate Adequate Log Space for HDP
- Download the HDP Maven Artifacts
2. Installing HDFS, YARN, and MapReduce
- Set Default File and Directory Permissions
- Install the Hadoop Packages
- Install Compression Libraries
  - Install Snappy
  - Install LZO
- Create Directories
3. Installing Apache ZooKeeper
- Install the ZooKeeper Package
- (Optional) Securing ZooKeeper with Kerberos
- Securing ZooKeeper Access
- Set Directories and Permissions
- Set Up the Configuration Files
- Start ZooKeeper
4. Setting Up the Hadoop Configuration
5. Validating the Core Hadoop Installation
- Format and Start HDFS
- Smoke Test HDFS
- Configure YARN and MapReduce
- Start YARN
- Start MapReduce JobHistory Server
- Smoke Test MapReduce
6. Installing Apache HBase
- Install the HBase Package
- Set Directories and Permissions
- Set Up the Configuration Files
- Validate the Installation
- Start the HBase Thrift and REST Servers
7. Installing Apache Phoenix
- Installing the Phoenix Package
- Configuring HBase for Phoenix
- Configuring Phoenix to Run in a Secure Cluster
- Validating the Phoenix Installation
- Best Practices for Setting Client-side Timeouts
- Troubleshooting Phoenix
8. Installing and Configuring Apache Tez
- Prerequisites
- Installing the Tez Package
- Configuring Tez
- Creating a New Tez View Instance
- Validating the Tez Installation
- Troubleshooting
9. Installing Apache Hive and Apache HCatalog
- Installing the Hive-HCatalog Package
- Setting Directories and Permissions
- Setting Up the Hive/HCatalog Configuration Files
  - HDP-Utility script
  - Configure Hive and HiveServer2 for Tez
- Setting Up the Database for the Hive Metastore
- Setting up RDBMS for use with Hive Metastore
- Creating Directories on HDFS
- Enabling Tez for Hive Queries
- Disabling Tez for Hive Queries
- Configuring Tez with the Capacity Scheduler
- Validating Hive-on-Tez Installation
10. Installing Apache Pig
- Install the Pig Package
- Validate the Installation
11. Installing Apache WebHCat
- Install the WebHCat Package
- Upload the Pig, Hive and Sqoop Tarballs to HDFS
- Set Directories and Permissions
- Modify WebHCat Configuration Files
- Set Up HDFS User and Prepare WebHCat Directories
- Validate the Installation
12. Installing Apache Oozie
- Install the Oozie Package
- Set Directories and Permissions
- Set Up the Oozie Configuration Files
- Configure Your Database for Oozie
- Set Up the sharelib
- Validate the Installation
13. Installing Apache Ranger
- Installation Prerequisites
- Installing Policy Manager
- Installing UserSync
  - Using the LDAP Connection Check Tool
  - Installing UserSync and Starting the Service
- Installing Ranger Plug-ins
- Enabling Audit Logging for HDFS and Solr
- Verifying the Installation
14. Installing Hue
- Before You Begin
- Configure HDP to Support Hue
- Install the Hue Packages
- Configure Hue to Communicate with the Hadoop Components
  - Configure the Web Server
  - Configure Hadoop
- Configure Hue for Databases
- Start, Stop, and Restart Hue
- Validate the Hue Installation
15. Installing Apache Sqoop
- Install the Sqoop Package
- Set Up the Sqoop Configuration
- Validate the Sqoop Installation
16. Installing Apache Mahout
- Install Mahout
- Validate Mahout
17. Installing and Configuring Apache Flume
- Understanding Flume
- Installing Flume
- Configuring Flume
- Starting Flume
- HDP and Flume
- A Simple Example
18. Installing and Configuring Apache Storm
- Install the Storm Package
- Configure Storm
- Configure a Process Controller
- (Optional) Configure Kerberos Authentication for Storm
- (Optional) Configuring Authorization for Storm
- Validate the Installation
19. Installing and Configuring Apache Spark
- Spark Prerequisites
- Installing Spark
- Configuring Spark
- Configuring the Spark History Server
  - Configuring the Spark History Server to Use HDFS
- (Optional) Starting the Spark Thrift Server
- Validating Spark
20. Installing and Configuring Apache Kafka
- Install Kafka
- Configure Kafka
- Validate Kafka
21. Installing Apache Accumulo
- Installing the Accumulo Package
- Configuring Accumulo
- Configuring the "Hosts" Files
- Validating Accumulo
- Smoke Testing Accumulo
22. Installing Apache Falcon
- Installing the Falcon Package
- Setting Directories and Permissions
- Configuring Proxy Settings
- Configuring Falcon Entities
- Configuring Oozie for Falcon
- Configuring Hive for Falcon
- Configuring for Secure Clusters
- Validating Falcon
23. Installing Apache Knox
- Install the Knox Package on the Knox Server
- Set up and Validate the Knox Gateway Installation
24. Installing Apache Slider
25. Installing and Configuring Apache Atlas
- Atlas Prerequisites
- Installing Atlas
- Installing Atlas Metadata Hive Plugin
- Configuring Hive Hook
- Configuring the Graph Database
- Configuring for Secure Clusters
- Configuring Atlas in a Kerberized Cluster
- Validating Atlas
26. Setting Up Kerberos Security for Manual Installs
27. Uninstalling HDP

Installing Spark

When you install Spark, the following directories will be created:

/usr/hdp/current/spark-client for submitting Spark jobs
/usr/hdp/current/spark-history for launching Spark master processes, such as the Spark History Server
/usr/hdp/current/spark-thriftserver for the Spark Thrift Server

To install Spark:

Search for Spark in the HDP repo:

For RHEL or CentOS:
yum search spark
For SLES:
zypper install spark
For Ubuntu and Debian:
apt-cache spark

This will show all the versions of Spark available. For example:

spark_2_3_6_0_$BUILD-master.noarch : Server for Spark master
spark_2_3_6_0_$BUILD-python.noarch : Python client for Spark
spark_2_3_6_0_$BUILD-worker.noarch : Server for Spark worker
spark_2_3_6_0_$BUILD.noarch : Lightning-Fast Cluster Computing

Install the version corresponding to the HDP version you currently have installed.
- For RHEL or CentOS:
  yum install spark_<version>-master spark_<version>-python
- For SLES:
  zypper install spark_<version>-master spark_<version>-python
- For Ubuntu and Debian:
  apt-get install spark_<version>-master apt-get install spark_<version>-python
Before you launch the Spark Shell or Thrift Server, make sure that you set $JAVA_HOME:
export JAVA_HOME=<path to JDK 1.8>