Deploying Cloudera Search
When you deploy Cloudera Search, SolrCloud partitions your data set into multiple indexes and processes, using ZooKeeper to simplify management, resulting in a cluster of coordinating Solr servers.
Installing and Starting ZooKeeper Server
SolrCloud mode uses a ZooKeeper Service as a highly available, central location for cluster management. For a small cluster, running a ZooKeeper host collocated with the NameNode is recommended. For larger clusters, you may want to run multiple ZooKeeper servers. For more information, see Installing the ZooKeeper Packages.
Initializing Solr
Once the ZooKeeper Service is running, configure each Solr host with the ZooKeeper Quorum address or addresses. Provide the ZooKeeper Quorum address for each ZooKeeper server. This could be a single address in smaller deployments, or multiple addresses if you deploy additional servers.
Configure the ZooKeeper Quorum address in solr-env.sh. The file location varies by installation type. If you accepted default file locations, the solr-env.sh file can be found in:
- Parcels: /opt/cloudera/parcels/CDH-*/etc/default/solr
- Packages: /etc/default/solr
Edit the property to configure the hosts with the address of the ZooKeeper service. You must make this configuration change for every Solr Server host. The following example shows a configuration with three ZooKeeper hosts:
SOLR_ZK_ENSEMBLE=<zkhost1>:2181,<zkhost2>:2181,<zkhost3>:2181/solr
Configuring Solr for Use with HDFS
To use Solr with your established HDFS service, perform the following configurations:
- Configure the HDFS URI for Solr to use as a backing store in /etc/default/solr or /opt/cloudera/parcels/CDH-*/etc/default/solr. On every Solr Server host, edit the following property to configure the location of Solr index data in HDFS:
SOLR_HDFS_HOME=hdfs://namenodehost:8020/solr
Replace namenodehost with the hostname of your HDFS NameNode (as specified by fs.default.name or fs.defaultFS in your conf/core-site.xml file). You may also need to change the port number from the default (8020). On an HA-enabled cluster, ensure that the HDFS URI you use reflects the designated name service utilized by your cluster. This value should be reflected in fs.default.name; instead of a hostname, you would see hdfs://nameservice1 or something similar.
- In some cases, such as for configuring Solr to work with HDFS High Availability (HA), you may want to configure the Solr HDFS client by setting
the HDFS configuration directory in /etc/default/solr or /opt/cloudera/parcels/CDH-*/etc/default/solr. On every Solr Server host, locate
the appropriate HDFS configuration directory and edit the following property with the absolute path to this directory :
SOLR_HDFS_CONFIG=/etc/hadoop/conf
Replace the path with the correct directory containing the proper HDFS configuration files, core-site.xml and hdfs-site.xml.
Configuring Solr to Use Secure HDFS
For information on setting up a secure CDH cluster, see the CDH 5 Security Guide.
In addition to the previous steps for Configuring Solr for use with HDFS, perform the following steps if security is enabled:
- Create the Kerberos principals and Keytab files for every host in your cluster:
- Create the Solr principal using either kadmin or kadmin.local.
kadmin: addprinc -randkey solr/fully.qualified.domain.name@YOUR-REALM.COM
kadmin: xst -norandkey -k solr.keytab solr/fully.qualified.domain.name
For more information, see Step 4: Create and Deploy the Kerberos Principals and Keytab Files
- Create the Solr principal using either kadmin or kadmin.local.
- Deploy the Kerberos Keytab files on every host in your cluster:
- Copy or move the keytab files to a directory that Solr can access, such as /etc/solr/conf.
$ sudo mv solr.keytab /etc/solr/conf/
$ sudo chown solr:hadoop /etc/solr/conf/solr.keytab $ sudo chmod 400 /etc/solr/conf/solr.keytab
- Copy or move the keytab files to a directory that Solr can access, such as /etc/solr/conf.
- Add Kerberos-related settings to /etc/default/solr or /opt/cloudera/parcels/CDH-*/etc/default/solr on every host in your cluster, substituting appropriate values. For a package based installation, use something similar to the
following:
SOLR_KERBEROS_ENABLED=true SOLR_KERBEROS_KEYTAB=/etc/solr/conf/solr.keytab SOLR_KERBEROS_PRINCIPAL=solr/fully.qualified.domain.name@YOUR-REALM.COM
Creating the /solr Directory in HDFS
Before starting the Cloudera Search server, you need to create the /solr directory in HDFS. The Cloudera Search master runs as solr:solr, so it does not have the required permissions to create a top-level directory.
$ sudo -u hdfs hadoop fs -mkdir /solr $ sudo -u hdfs hadoop fs -chown solr /solr
Initializing the ZooKeeper Namespace
$ solrctl init
Starting Solr
$ sudo service solr-server restart
$ sudo jps -lm 31407 sun.tools.jps.Jps -lm 31236 org.apache.catalina.startup.Bootstrap start
Runtime Solr Configuration
To start using Solr for indexing the data, you must configure a collection holding the index. A configuration for a collection requires a solrconfig.xml file, a schema.xml and any helper files referenced from the xml files. The solrconfig.xml file contains all of the Solr settings for a given collection, and the schema.xml file specifies the schema that Solr uses when indexing documents. For more details on how to configure a collection for your data set, see http://wiki.apache.org/solr/SchemaXml.
$ solrctl instancedir --generate $HOME/solr_configs
You can customize it by directly editing the solrconfig.xml and schema.xml files created in $HOME/solr_configs/conf.
These configuration files are compatible with the standard Solr tutorial example documents.
$ solrctl instancedir --create collection1 $HOME/solr_configs
$ solrctl instancedir --list
If you used the earlier --create command to create collection1, the --list command should return collection1.
Creating Your First Solr Collection
$ solrctl collection --create collection1 -s {{numOfShards}}
You should be able to check that the collection is active. For example, for the server myhost.example.com, you should be able to go to http://myhost.example.com:8983/solr/collection1/select?q=*%3A*&wt=json&indent=true and verify that the collection is active. Similarly, you should be able to view the topology of your SolrCloud using a URL similar to http://myhost.example.com:8983/solr/#/~cloud.
Adding Another Collection with Replication
To support scaling for the query load, create a second collection with replication. Having multiple servers with replicated collections distributes the request load for each shard. Create one shard cluster with a replication factor of two. Your cluster must have at least two running servers to support this configuration, so ensure Cloudera Search is installed on at least two servers. A replication factor of two causes two copies of the index files to be stored in two different locations.
- Generate the config files for the collection:
$ solrctl instancedir --generate $HOME/solr_configs2
- Upload the instance directory to ZooKeeper:
$ solrctl instancedir --create collection1 $HOME/solr_configs2
- Create the second collection:
$ solrctl collection --create collection1 -s 1 -r 2
- Verify that the collection is live and that the one shard is served by two hosts. For example, for the server myhost.example.com, you should receive content from: http://myhost.example.com:8983/solr/#/~cloud.
Creating Replicas of Existing Shards
You can create additional replicas of existing shards using a command of the following form:
$ solrctl --zk <zkensemble> --solr <target solr server> core \ --create <new core name> -p collection=<collection> -p shard=<shard to replicate>
For example to create a new replica of collection named collection1 that is comprised of shard1, use the following command:
$ solrctl --zk myZKEnsemble:2181/solr --solr mySolrServer:8983/solr core \ --create collection1_shard1_replica2 -p collection=collection1 -p shard=shard1
Adding a New Shard to a Solr Server
You can use solrctl to add a new shard to a specified solr server.
$ solrctl --solr http://<target_solr_server>:8983/solr core --create <core_name> \ -p dataDir=hdfs://<nameservice>/<index_hdfs_path> -p collection.configName=<config_name> \ -p collection=<collection_name> -p numShards=<int> -p shard=<shard_id>
Where:
- target_solr_server: The server to host the new shard
- core_name: <collection_name><shard_id><replica_id>
- shard_id: New shard identifier
For example, to add a new second shard named shard2 to a solr server named mySolrServer, where the collection is named myCollection, you would use the following command:
$ solrctl --solr http://mySolrServer:8983/solr core --create myCore \ -p dataDir=hdfs://namenode/solr/myCollection/index -p collection.configName=myConfig \ -p collection=myCollection -p numShards=2 -p shard=shard2