Using CDH with Isilon Storage
EMC Isilon is a storage service with a distributed filesystem that can used in place of HDFS to provide storage for CDH services.
- Supported Versions
- Differences Between Isilon HDFS and CDH HDFS
- Preliminary Steps on the Isilon Service
- Installing Cloudera Manager with Isilon
- Installing a Secure Cluster with Isilon
- Upgrading a Cluster with Isilon
- Isilon Storage
- Configuring Replication with Kerberos and Isilon
For Cloudera and Isilon compatibility information, see the product compatibility matrix for Product Compatibility for EMC Isilon.
Differences Between Isilon HDFS and CDH HDFS
The following features of HDFS are not implemented with Isilon OneFS:
- HDFS caching
- HDFS encryption
- HDFS ACLs
Preliminary Steps on the Isilon Service
Before installing a Cloudera Manager cluster to use Isilon storage, perform the following steps on the Isilon OneFS system. For detailed information on setting up Isilon OneFS for Cloudera Manager, see Cloudera and Isilon Implementation
- Create an Isilon access zone with HDFS support. For example:
- CDH includes a default setting in hdfs-site.xml to support the WRT checksum type on datanodes. You must run the following OneFS command to support WRT
isi hdfs --checksum-type=crc32
- Create two directories to be used by all CDH services:
- Create a tmp directory in the access zone:
- Create supergroup group and hdfs user.
- Create a tmp directory and set ownership to hdfs:supergroup and permissions to 1777. For
cd hdfs_root_directory isi_run -z zone_id mkdir tmp isi_run -z zone_id chown hdfs:supergroup tmp isi_run -z zone_id chmod 1777 tmp
- Create a user directory in the access zone and set ownership to hdfs:supergroup and permissions to 755. For example:
cd hdfs_root_directory isi_run -z zone_id mkdir user isi_run -z zone_id chown hdfs:supergroup user isi_run -z zone_id chmod 755 user
- Create a tmp directory in the access zone:
- Create the service-specific users, groups, or directories for each CDH service you plan to use. Create the directories in the access zone you have created.
- ZooKeeper: nothing required.
- Create the hbase group with hbase user.
- Create the root directory for HBase. For example:
hdfs_root_directory/hbase hbase:hbase 755
- YARN (MR2)
- Create the mapred group with mapred user.
- Create the cmjobuser user and add it to the hadoop group. For example:
isi-cloudera-1# isi auth users create cmjobuser --zone subnet1 isi-cloudera-1# isi auth users modify cmjobuser --add-group hadoop --zone subnet1
- Create the cloudera-scm user and add it to the supergroup group. For example:
isi-cloudera-1# isi auth users create cloudera-scm --zone subnet1 isi-cloudera-1# isi auth users modify cloudera-scm --add-group supergroup --zone subnet1
- Create history directory for YARN. For example:
hdfs_root_directory/user/history mapred:hadoop 777
- Create the remote application log directory for YARN. For example:
hdfs_root_directory/tmp/logs mapred:hadoop 775
- Create the cmjobuser directory for YARN. For example:
hdfs_root_directory/user/cmjobuser cmjobuser:hadoop 775
- Create the cloudera-scm directory for YARN. For example:
hdfs_root_directory/user/cloudera-scm cloudera-scm:supergroup 775
- Create the tmp/cmYarnContainerMetrics directory. For example:
hdfs_root_directory/tmp/cmYarnContainerMetrics cmjobuser:supergroup 775
- Create the tmp/cmYarnContainerMetricsAggregate directory. For example:
hdfs_root_directory/tmp/cmYarnContainerMetricsAggregate cloudera-scm:supergroup 775
- Create the oozie group with oozie user.
- Create the user directory for Oozie. For example:
hdfs_root_directory/user/oozie oozie:oozie 775
- Create the flume group with flume user.
- Create the user directory for Flume. For example:
hdfs_root_directory/user/flume flume:flume 775
- Create the hive group with hive user.
- Create the user directory for Hive. For example:
hdfs_root_directory/user/hive hive:hive 775
- Create the warehouse directory for Hive. For example:
hdfs_root_directory/user/hive/warehouse hive:hive 1777
- Create a temporary directory for Hive. For example:
hdfs_root_directory/tmp/hive hive:supergroup 777
- Create the solr group with solr user.
- Create the data directory for Solr. For example:
hdfs_root_directory/solr solr:solr 775
- Create the sqoop group with sqoop2 user.
- Create the user directory for Sqoop. For example:
hdfs_root_directory/user/sqoop2 sqoop2:sqoop 775
- Create the hue group with hue user.
- Create sample group with sample user.
- Create the user directory for Hue. For example:
hdfs_root_directory/user/hue hue:hue 775
- Create the spark group with spark user.
- Create the user directory for Spark. For example:
hdfs_root_directory/user/spark spark:spark 751
- Create the application history directory for Spark. For example:
hdfs_root_directory/user/spark/applicationHistory spark:spark 1777
- Map the hdfs user to root on the Isilon service. For example:
isiloncluster1-1# isi zone zones modify --user-mapping-rules="hdfs=>root" --zone zone1 isiloncluster1-1# isi services isi_hdfs_d disable ; isi services isi_hdfs_d enable The service 'isi_hdfs_d' has been disabled. The service 'isi_hdfs_d' has been enabled.If you are using Cloudera Manager, also map the cloudera-scm user to root on the Isilon service. For example:
isiloncluster1-1# isi zone zones modify --user-mapping-rules="cloudera-scm=>root" --zone zone1 isiloncluster1-1# isi services isi_hdfs_d disable ; isi services isi_hdfs_d enable The service 'isi_hdfs_d' has been disabled. The service 'isi_hdfs_d' has been enabled.
- Create the following proxy users for the Flume, Impala, Hive, Hue, and Oozie services:
Service Proxy User Users to Add as Members Flume
Create the proxy users on the Isilon system by running the following command as root:
isi hdfs proxyusers create username --add-user username1 --add-user username2 ... --zone access_zoneFor example, to create proxy users for Hue:
isi hdfs proxyusers create hue --add-user oozie --add-user yarn --add-user impala --add-user hive --zone subnet1
Once the users, groups, and directories are created in Isilon OneFS, you can install Cloudera Manager.
Installing Cloudera Manager with Isilon
- The simplest installation procedure, suitable for development or proof of concept, is Installation Path A, which uses embedded databases that are installed as part of the Cloudera Manager installation process.
- For production environments, Installation Path B - Installation Using Cloudera Manager Parcels or Packages describes configuring external databases for Cloudera Manager and CDH storage.
If you choose parcel installation on the Cluster Installation screen, the installation wizard points to the latest parcels of CDH available.
On the installation wizard Cluster Setup page, click Custom Services, and select the services to install in the cluster. Be sure to select the Isilon service; do not select the HDFS service, and do not check Include Cloudera Navigator at the bottom of the Cluster Setup page. On the Role Assignments page, specify the hosts that will serve as gateway roles for the Isilon service. You can add gateway roles to one, some, or all nodes in the cluster.
Installing a Secure Cluster with Isilon
- Create an insecure Cloudera Manager cluster as described above in Installing Cloudera Manager with Isilon.
- Follow the Isilon documentation to enable Kerberos for your access zone: Cloudera CDH with Isilon and Active Directory Kerberos Implementation. This includes adding a Kerberos authentication provider to your Isilon access zone.
- Follow the instructions in Configuring Authentication in Cloudera Manager to configure a secure cluster with Kerberos.
Upgrading a Cluster with Isilon
- If required, upgrade OneFS to a version compatible with the version of CDH to which you are upgrading. See the product compatibility matrix for Product Compatibility for EMC Isilon. For OneFS upgrade instructions, see the EMC Isilon documentation.
- (Optional) Upgrade Cloudera Manager. See Cloudera Upgrade.
The Cloudera Manager minor version must always be equal to or greater than the CDH minor version because older versions of Cloudera Manager may not support features in newer versions of CDH. For example, if you want to upgrade to CDH 5.4.8 you must first upgrade to Cloudera Manager 5.4 or higher.
- Upgrade CDH. See Upgrading CDH and Managed Services Using Cloudera Manager.
Using Impala with Isilon Storage
You can use Impala to query data files that reside on EMC Isilon storage devices, rather than in HDFS. This capability allows convenient query access to a storage system where you might already be managing large volumes of data. The combination of the Impala query engine and Isilon storage is certified on CDH versions 5.4.4 through 5.15.
isi hdfs settings modify --default-block-size=256MB
The typical use case for Impala and Isilon together is to use Isilon for the default filesystem, replacing HDFS entirely. In this configuration, when you create a database, table, or partition, the data always resides on Isilon storage and you do not need to specify any special LOCATION attribute. If you do specify a LOCATION attribute, its value refers to a path within the Isilon filesystem. For example:
-- If the default filesystem is Isilon, all Impala data resides there -- and all Impala databases and tables are located there. CREATE TABLE t1 (x INT, s STRING); -- You can specify LOCATION for database, table, or partition, -- using values from the Isilon filesystem. CREATE DATABASE d1 LOCATION '/some/path/on/isilon/server/d1.db'; CREATE TABLE d1.t2 (a TINYINT, b BOOLEAN);
Impala can write to, delete, and rename data files and database, table, and partition directories on Isilon storage. Therefore, Impala statements such as CREATE TABLE, DROP TABLE, CREATE DATABASE, DROP DATABASE, ALTER TABLE, and INSERT work the same with Isilon storage as with HDFS.
When the Impala spill-to-disk feature is activated by a query that approaches the memory limit, Impala writes all the temporary data to a local (not Isilon) storage device. Because the I/O bandwidth for the temporary data depends on the number of local disks, and clusters using Isilon storage might not have as many local disks attached, pay special attention on Isilon-enabled clusters to any queries that use the spill-to-disk feature. Where practical, tune the queries or allocate extra memory for Impala to avoid spilling. Although you can specify an Isilon storage device as the destination for the temporary data for the spill-to-disk feature, that configuration is not recommended due to the need to transfer the data both ways using remote I/O.
When tuning Impala queries on HDFS, you typically try to avoid any remote reads. When the data resides on Isilon storage, all the I/O consists of remote reads. Do not be alarmed when you see non-zero numbers for remote read measurements in query profile output. The benefit of the Impala and Isilon integration is primarily convenience of not having to move or copy large volumes of data to HDFS, rather than raw query performance. You can increase the performance of Impala I/O for Isilon systems by increasing the value for the num_remote_hdfs_io_threads configuration parameter, in the Cloudera Manager user interface for clusters using Cloudera Manager, or through the --num_remote_hdfs_io_threads startup option for the impalad daemon on clusters not using Cloudera Manager.
For information about managing Isilon storage devices through Cloudera Manager, see Using CDH with Isilon Storage.
- In HDFS Client Advanced Configuration Snippet (Safety Valve) for hdfs-site.xml hdfs-site.xml and the Cluster-wide Advanced Configuration Snippet (Safety Valve) for core-site.xml properties for the Isilon service, set the value of the dfs.client.file-block-storage-locations.timeout.millis property to 10000.
- In the Isilon Cluster-wide Advanced Configuration Snippet (Safety Valve) for core-site.xml property for the Isilon service, set the value of the hadoop.security.token.service.use_ip property to FALSE.
- If you see errors that reference the .Trash directory, make sure that the Use Trash property is selected.
Configuring Replication with Kerberos and Isilon
- Create a custom Kerberos Keytab and Kerberos principal that the replication jobs use to authenticate to storage and other CDH services. See Configuring Authentication.
- In Cloudera Manager, select .
- Search for and enter values for the following properties:
- Custom Kerberos Keytab Location – Enter the location of the Custom Kerberos Keytab.
- Custom Kerberos Principal Name – Enter the principal name to use for replication between secure clusters.
- When you create a replication schedule, enter the Custom Kerberos Principal Name in the Run As Username field. See Configuring Replication of HDFS Data and Configuring Replication of Hive Data.
- Ensure that both the source and destination clusters have the same set of users and groups. When you set ownership of files (or when maintaining ownership), if a user or group does not exist, the chown command fails on Isilon. See Performance and Scalability Limitations
- Cloudera recommends that you do not select the Replicate Impala Metadata option for Hive replication schedules. If you need to use this feature, create a custom principal of the form hdfs/hostname@realm or impala/hostname@realm.
- Add the following property and value to the HDFS Service Advanced Configuration Snippet (Safety Valve) for hdfs-site.xml and Cluster-wide Advanced Configuration Snippet (Safety Valve) for core-site.xml properties:
hadoop.security.token.service.use_ip = false
java.io.IOException: Failed on local exception: java.io.IOException: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]; Host Details : local host is: "foo.mycompany.com/220.127.116.11"; destination host is: "myisilon-1.mycompany.com":8020;Set the Isilon cluster-wide time-to-live setting to a higher value on the destination cluster for the replication: Note that higher values may affect load balancing in the Isilon cluster by causing workloads to be less distributed. A value of 60 is a good starting point. For example:
isi networks modify pool subnet4:nn4 --ttl=60You can view the settings for a subnet with a command similar to the following:
isi networks list pools --subnet subnet3 -v