Deploying Cloudera Search
When you deploy Cloudera Search, SolrCloud partitions your data set into multiple indexes and processes, while simplifying the management through the use of ZooKeeper, resulting in running a cluster of coordinating Solr servers.
This section assumes that you have already completed the process of installing Search using either Installing Cloudera Search with Cloudera Manager or Installing Cloudera Search without Cloudera Manager. You are now about to distribute the processes across multiple hosts. Before completing this process, you may want to review Choosing where to Deploy the Cloudera Search Processes.
Installing and Starting ZooKeeper Server
SolrCloud mode uses a ZooKeeper Service as a highly available, central location for cluster management. For a small cluster, running a ZooKeeper node collocated with the NameNode is recommended. For larger clusters, you may wish to run multiple ZooKeeper servers. For more information, see Installing the ZooKeeper Packages.
Initializing Solr
Once your ZooKeeper Service is running, configure each Solr node with the ZooKeeper Quorum address or addresses. Provide the ZooKeeper Quorum address for each ZooKeeper server you have deployed. This could be a single address in smaller deployments, or multiple addresses if you chose to deploy additional servers.
Configure the ZooKeeper Quorum address in /etc/default/solr. Edit the property to configure the nodes with the address of the ZooKeeper service. You must make this configuration change for every Solr Server host. An example configuration with three ZooKeeper hosts might appear as follows:
SOLR_ZK_ENSEMBLE=<zkhost1>:2181,<zkhost2>:2181,<zkhost3>:2181/solr
Configuring Solr for Use with HDFS
To set up Solr for use with your established HDFS service, perform the following configurations:
- Configure the HDFS URI for Solr to use as a backing
store in /etc/default/solr.
Edit the following property to configure the location of Solr index data in
HDFS. Do this on every Solr Server
host:
SOLR_HDFS_HOME=hdfs://namenodehost:8020/solr
Be sure to replace namenodehost with the hostname of your HDFS NameNode (as specified by fs.default.name or fs.defaultFS in your conf/core-site.xml file); you may also need to change the port number from the default (8020). On an HA-enabled cluster, ensure that the HDFS URI you use reflects the designated nameservice utilized by your cluster. This value should be reflected in fs.default.name; instead of a hostname, you would see hdfs://nameservice1 or something similar.
- In some cases, such as for configuring
Solr to work with HDFS High Availability (HA), you may want to configure Solr's
HDFS client. You can do this by setting the HDFS configuration directory in
/etc/default/solr. Locate
the appropriate HDFS configuration directory on each node, and edit the
following property with the absolute path to this directory. Do this on every
Solr Server
host:
SOLR_HDFS_CONFIG=/etc/hadoop/conf
Be sure to replace the path with the correct directory containing the proper HDFS configuration files, core-site.xml and hdfs-site.xml.
Configuring Solr use with Secure HDFS
- For information on setting up a secure CDH cluster for CDH 4, see the CDH 4 Security Guide.
- For information on setting up a secure CDH cluster for CDH 5, see the CDH 5 Security Guide.
- Create the Kerberos principals and
Keytab files for every node in your cluster:
- Create the Solr principal
using either kadmin
or kadmin.local
(for CDH 4, see Create and
Deploy the Kerberos Principals and Keytab Files or for
CDH 5, seeCreate and
Deploy the Kerberos Principals and Keytab Files for
information on using kadmin or kadmin.local).
kadmin: addprinc -randkey solr/fully.qualified.domain.name@YOUR-REALM.COM
kadmin: xst -norandkey -k solr.keytab solr/fully.qualified.domain.name
- Create the Solr principal
using either kadmin
or kadmin.local
(for CDH 4, see Create and
Deploy the Kerberos Principals and Keytab Files or for
CDH 5, seeCreate and
Deploy the Kerberos Principals and Keytab Files for
information on using kadmin or kadmin.local).
- Deploy the Kerberos Keytab files on
every node in your cluster:
- Copy or move the keytab files
to a directory that Solr can access, such as /etc/solr/conf.
$ sudo mv solr.keytab /etc/solr/conf/
$ sudo chown solr:hadoop /etc/solr/conf/solr.keytab $ sudo chmod 400 /etc/solr/conf/solr.keytab
- Copy or move the keytab files
to a directory that Solr can access, such as /etc/solr/conf.
- Add Kerberos related settings to
/etc/default/solr on
every node in your cluster, substituting appropriate
values:
SOLR_KERBEROS_ENABLED=true SOLR_KERBEROS_KEYTAB=/etc/solr/conf/solr.keytab SOLR_KERBEROS_PRINCIPAL=solr/fully.qualified.domain.name@YOUR-REALM.COM
Creating the /solr Directory in HDFS
Before starting the Cloudera Search server, you need to create the /solr directory in HDFS. The Cloudera Search master runs as solr:solr so it does not have the required permissions to create a top-level directory.
$ sudo -u hdfs hadoop fs -mkdir /solr $ sudo -u hdfs hadoop fs -chown solr /solr
Initializing ZooKeeper Namespace
$ solrctl init
Starting Solr
$ sudo service solr-server restart
$ sudo jps -lm 31407 sun.tools.jps.Jps -lm 31236 org.apache.catalina.startup.Bootstrap start
Runtime Solr Configuration
In order to start using Solr for indexing the data, you must configure a collection holding the index. A configuration for a collection requires a solrconfig.xml file, a schema.xml and any helper files may be referenced from the xml files. The solrconfig.xml file contains all of the Solr settings for a given collection, and the schema.xml file specifies the schema that Solr uses when indexing documents. For more details on how to configure it for your data set see http://wiki.apache.org/solr/SchemaXml.
$ solrctl instancedir --generate $HOME/solr_configs
You can customize it by directly editing the solrconfig.xml and schema.xml files that have been created in $HOME/solr_configs/conf.
These configuration files are compatible with the standard Solr tutorial example documents.
$ solrctl instancedir --create collection1 $HOME/solr_configs
$ solrctl instancedir --list
If you had used the earlier --create command to create a collection1, the --list command should return collection1.
Users who are familiar with Apache Solr might configure a collection directly in solr home: /var/lib/solr. While this is possible, Cloudera discourages this and recommends using solrctl instead.
Creating Your First Solr Collection
$ solrctl collection --create collection1 -s {{numOfShards}}
You should be able to check that the collection is active. For example, for the server myhost.example.com, you should be able to navigate to http://myhost.example.com:8983/solr/collection1/select?q=*%3A*&wt=json&indent=true and verify that the collection is active. Similarly, you should also be able to observe the topology of your SolrCloud using a URL similar to: http://myhost.example.com:8983/solr/#/~cloud.
Adding Another Collection with Replication
To support scaling for query load, create a second collection with replication. Having multiple servers with replicated collections distributes the request load for each shard. Create one shard cluster with a replication factor of two. Your cluster must have at least two running servers to support this configuration, so ensure Cloudera Search is installed on at least two servers before continuing with this process. A replication factor of two causes two copies of the index files to be stored in two different locations.
- Generate the config files for the
collection:
$ solrctl instancedir --generate $HOME/solr_configs2
- Upload the instance directory to
ZooKeeper:
$ solrctl instancedir --create collection2 $HOME/solr_configs2
- Create the second
collection:
$ solrctl collection --create collection2 -s 1 -r 2
- Verify the collection is live and that your one shard is being served by two nodes. For example, for the server myhost.example.com, you should receive content from: http://myhost.example.com:8983/solr/#/~cloud.
<< Installing Cloudera Search without Cloudera Manager | Installing MapReduce Tools for use with Cloudera Search >> | |