Chapter 2. Planning for a DPS Installation
Prior to installing DPS, you must consider various aspects of your HDP environment and prepare them prior to DPS installation. This includes items such as operating system version, cluster security, the node configuration requirements for DPS Platform and associated services, etc.
Review the following information prior to starting your DPS installation to ensure your environment is properly configured for a successful DPS installation.
Support Matrix information
You can find the most current information about interoperability for this release on the Support Matrix. The Support Matrix tool provides information about:
HDP and Ambari
Operating Systems
Databases
Browsers
JDKs
To access the tool, go to: https://supportmatrix.hortonworks.com.
Requirements for DataPlane Service Platform Host
Hortonworks DataPlane Service (DPS™) is composed of a platform (DPS Platform) and the services that plug into the platform (DLM, DSS, etc.), which are all installed on the same host node. DPS also includes engines and agents that are installed on the clusters used with DPS.
You should install the DPS Platform on a host remote to the cluster. The DPS Platform host must meet the requirements identified in the following sections.
All clusters registered with DPS must be managed by Apache Ambari.
DPS Support Matrix Information
See the Requirements for DataPlane Service Platform Host and the Hortonworks Support Matrix for details regarding supported operating systems, databases, software, etc.
Required Docker Versions
Docker 1.12.x
Other Software Requirements
On each DPS Platform host, ensure that the following software is available:
yum
andrpm
tar
andwget
bash shell
Processing and Memory Requirements
The DPS Platform host requires the following:
Multicore processor, with minimum 8 cores
Minimum 16 GB RAM
See the HDP and Ambari Support Matrices for requirements.
Port and Network Requirements
Have the following ports available and open:
Port Number | Purpose | Required to be open? |
---|---|---|
80 | Where DPS Platform runs. | Yes |
443 | For SSL-based communication. | Yes |
8443 | Where the Apache Knox instance for login runs. | Yes |
8500 |
For debugging using Consul. This port must be available, but it is optional to have it open. | No |
It is recommended that you use a DNS server to resolve host names. If resolving
host names from an /etc/hosts
file, you must add the names to
the hosts files of each DPS container. Follow the instructions, Add Host Entries to /etc/hosts Files in DPS
Administration.
LDAP and AD Support Requirements
To use LDAP and AD, you must use the same LDAP and AD instance across all HDP clusters managed by DataPlane, as well as for DataPlane Service itself.
HDP 2.6.3 Apache Component Requirements
The following additional Apache components are required for DPS Platform support:
Component | Purpose | Comments |
---|---|---|
Knox | User authentication with LDAP (SSO) | Knox must be enabled on clusters before you can register the clusters with DPS. |
Ambari | Cluster registration in DPS | All clusters used with DPS must be using Ambari. |
SmartSense Requirements
A SmartSense ID is required to install DPS Services (DLM and DSS).
You can retrieve the SmartSense ID from the Hortonworks Support Portal, under the Tools tab.
Additional DPS Requirements and Recommendations
Understanding the requirements and recommendations indicated below can help to avoid common issues during and after DPS installation.
Prior to starting installation, you must have downloaded the required tarballs and MPacks from the customer portal, following the instructions provided as part of the product procurement process.
You need to have root access to the nodes on which all DPS services will be installed.
If you are using AWS, do not use the public DNS to access DPS.
Use a public IP address or set up and use a DNS (Route 53) fully qualified domain name (FQDN).
Every host name used with DPS must be resolvable by DNS or configured in the
/etc/hosts
file on the DPS container, so that host names can be resolved between all cluster nodes.Using a DNS server is the recommended method, unless you are using Amazon Web Services (AWS). But if the instances are added to
/etc/hosts
, you must explicitly register the cluster host names within the DPS Docker containers. It is not sufficient to have the host names included in the/etc/hosts
file on the DPS Platform host. See the DPS Platform Administration guide for instructions.If you are not using the LDAP server packaged with DPS, you need the corporate LDAP settings to configure LDAP.
Ensure you have the correct settings if using your own LDAP, as most of the settings cannot be changed in DPS after they are set.
Use the default Knox user.
If you choose a server to host Knox that is not the one the Ambari Server defaulted to, proxyuser rules will change and you will be prompted for a restart.
When enabling DPS Platform and DLM, and installing Knox, follow the automated Ambari placement recommendations to avoid requiring a restart.
Important: Do not edit the cluster name from Ambari after registering the cluster with DPS Platform.
DPS Service Requirements for HDP Clusters
DPS Support Matrix Information
See the Requirements for DataPlane Service Platform Host and the Hortonworks Support Matrix for details regarding supported HDP configurations.
All clusters used with DPS must be managed by Ambari.
Configuring Cluster Security for DPS Services
Following are lists of the minimum required actions that you must perform on each HDP cluster as part of configuring security for DPS and onboarding clusters for each of the DPS services. You can perform any additional security-related tasks as appropriate for your environment and company policies.
Table 2.1. Minimum Security Requirements Checklist for DPS
Task | Instructions | Found in... | Comments |
---|---|---|---|
Enable Knox in Ambari | Install Knox | Apache Knox Gateway User's Guide | Services required in the Knox topology for DPS are Ambari, AmbariUI, JobTracker, NameNode, Ranger, RangerUI, and ResourceManager |
Enable Ranger in Ambari | Installing Ranger Using Ambari | HDP Security guide | |
Configure a reverse proxy with Knox | Configuring the Knox Gateway | HDP Security guide | The Knox Gateway is not required, but is recommended |
Configure SSO topology | Form-based Identity Provider (IdP) | HDP Security guide | |
Configure LDAP with Ambari | Configuring Ambari Authentication with LDAP or Active Directory Authentication | HDP Security guide | |
Synchronize required LDAP users and groups with Ambari | Synchronizing LDAP Users and Groups | HDP Security guide |
You must disable LDAP pagination; Users registering clusters in DPS must have Admin role in Ambari |
Configure LDAP with Ranger | Configuring Ranger Authentication with UNIX, LDAP, or AD | HDP Security guide | Required for DSS and if using Ranger with DLM |
Configure LDAP with Knox for proxy authentication | Setting Up LDAP Authentication | HDP Security guide | |
Configure Knox for HA | Setting Up Knox Services for HA | HDP Security guide | Required only if clusters are configured for HA |
Configure Knox SSO for Ambari | Setting up Knox SSO for Ambari | HDP Security guide | If done on an existing cluster, at login you will see a Knox page and must log in with your LDAP credentials |
If you are performing Hive replication with the Data Lifecycle Manager (DLM) service, ensure that the following tasks were completed during cluster installation. You must configure Ambari Ranger on clusters used in replicating Hive databases.
Table 2.2. Minimum Security Requirements Checklist for DLM
Task | Instructions | Found in... | Comments |
---|---|---|---|
Configure LDAP with Ranger | Configuring Ranger Authentication with UNIX, LDAP, or AD | HDP Security guide | Required if using Ranger with DLM |
Configure user synchronization for policy administration | Configure Ranger User Sync | HDP Security guide | Required only if using Ranger |
Configure Ranger plugin for HDFS | Enabling Ranger Plugins: HDFS | HDP Security guide | Required only if using Ranger |
Configure Ranger plugin for Hive | Enabling Ranger Plugins: Hive | HDP Security guide | Required only if using Ranger |
Configure Ranger plugin for Knox | Enabling Ranger Plugins: Knox | HDP Security guide | Required only if using Ranger |
Configure Ranger HDFS plugin for Kerberos | Ranger Plugins--Kerberos: HDFS | HDP Security guide | Required only if using Ranger |
Configure Ranger Hive plugin for Kerberos | Ranger Plugins--Kerberos: Hive | HDP Security guide | Required only if using Ranger |
Configure Ranger Knox plugin for Kerberos | Ranger Plugins--Kerberos: Knox | HDP Security guide | Required only if using Ranger |
Configure Knox SSO for Ranger | Setting up Knox SSO for Ranger | HDP Security guide |
If you are using the Data Steward Studio (DSS) service, ensure that the following tasks were completed during cluster installation. You must configure Apache Atlas and Apache Knox SSO before you can use DSS.
Table 2.3. Minimum Security Requirements Checklist for DSS
Task | Instructions | Found in... | Comments |
---|---|---|---|
Enable Atlas in Ambari | Installing and Configuring Apache Atlas Using Ambari | HDP Data Governance guide | |
Configure LDAP with Atlas | Customize Services | HDP Data Governance guide | Adapt the instructions for Ranger |
Configure Ranger plugin for Atlas | Enabling Ranger Plugins: Atlas | HDP Security guide | |
Configure Knox SSO for Atlas | Setting up Knox SSO for Atlas | HDP Security guide | |
Configure Knox SSO for Ranger | Setting up Knox SSO for Ranger | HDP Security guide |
Data Lifecycle Manager (DLM) Installation Requirements and Recommendations
The clusters on which you install the Data Lifecycle Manager (DLM) Engine must meet the requirements identified in the following sections. After the DLM Engine is installed and properly configured on a cluster, the cluster can be used for DLM replication.
Important | |
---|---|
Clusters used as source and destination in a DLM replication relationship must have exactly the same configurations for LDAP, Kerberos, Ranger, Knox, HA, etc. |
DLM Support Matrix Information
See the Requirements for Clusters Used With Data Lifecycle Manager Engine and the Hortonworks Support Matrix for details regarding supported operating systems, databases, software, etc.
Port and Network Requirements
Have the following ports available and open:
Default Port Number | Purpose | Comments | Required to be open? |
---|---|---|---|
25968 | Port for DLM Engine (Beacon) service on hosts. |
Accessibility is required from all clusters. “Beacon” is the internal name for the DLM Engine. You will see the name Beacon in some paths, commands, etc. | Yes |
8020 | NameNode host | Yes | |
50010 | All DataNode hosts | Yes | |
8080 | Ambari server host | Yes | |
10000 | HiveServer2 host | Binary mode port (Thrift) | Yes |
10001 | HiveServer2 host | HTTP mode port | Yes |
2181 | ZooKeeper hosts | Yes | |
6080 | Ranger Port | Yes | |
8443 | Knox Port | Yes | |
8050 | Yarn Port | Yes |
HDP 2.6.3 Apache Component Requirements
The following additional Apache components are required for DLM support:
Component | Purpose | Comments |
---|---|---|
Hive 1 | For replicating Hive database content | Hive 2 queries are supported, but for replication, HiveServer 2 with Hive 1 is always used. |
HDFS | For replicating HDFS data. | |
Knox | Authentication federation from DPS | Knox must be enabled on clusters before you can register the clusters with DPS. |
Ranger | Authorization on clusters during replication | Ranger is optional for HDFS replication, but required for Hive replication. |
Additional DLM Requirements and Recommendations
Understanding the requirements and recommendations indicated below can help to avoid common issues during and after installation of the DLM service.
Apache Hive should be installed during initial installation, unless you are certain you will not use Hive replication in the future.
If you decide to install Hive after creating HDFS replication policies in Data Lifecycle Manager, all HDFS replication policies must be deleted and then recreated after adding Hive.
Clusters used in DLM replication must have symmetrical configurations.
That is, each cluster in a replication relationship must be configured exactly the same for Kerberos, LDAP, High Availability (HA), Apache Ranger, and so forth.
Data Steward Studio (DSS) Installation Requirements and Recommendations
The clusters on which you install the DSS Profiler Agent must meet the requirements identified in the following sections. After the Profiler Agent is installed and properly configured on a cluster, the cluster can be used by DSS.
Data Steward Studio (DSS) is provided as Evaluation Software with Hortonworks DPS 1.0. Evaluation Software is provided without charge and pursuant to your the DataPlane Service Terms of Use. Evaluation Software may only be used for internal business, evaluation, and non-production purposes. Feedback on Evaluation Software is welcomed and may be submitted through your regular support channels.
DSS Support Matrix Information
See the Requirements for Data Steward Studio Profiler and the Hortonworks Support Matrix for details regarding supported operating systems, databases, software, etc.
Other Software Requirements
DSS has no additional software requirements.
Port and Network Requirements
Have the following ports available and open:
Port Number | Purpose | Required? |
---|---|---|
21900 | Profiler Web service runs on this port | This is required for DataPlane to access profiled data from the profiler datastore. |
8999 | Livy runs on this port | Livy is the observer for profilers and is required for submitting profiler jobs. |
21000 | Atlas | Required if you are installing in a different DMZ. |
6080 | Ranger | Required if you are installing in a different DMZ. |
8443 | Knox | Required if you are installing in a different DMZ. |
8080 | Ambari | Yes |
HDP 2.6.3 Apache Component Requirements
The following additional Apache components are required for DSS support:
Component | Purpose | Comments |
---|---|---|
Atlas | For Hive Metadata availability and storage of univariate statistics | |
Ranger | For access logs availability for usage profiling | |
Spark 2 | For Profiler computation – both univariate and Ranger profilers | |
Livy Server 2 | Job Server for Profilers | |
HDFS | For registering and sharing Profiler .jars | Co-located on the Profiler Agent Node. |
Hive | For column profiling |