Solr and HDFS - the Block Cache

Cloudera Search enables Solr to store indexes in an HDFS filesystem. To maintain performance, an HDFS block cache has been implemented using Least Recently Used (LRU) semantics. This enables Solr to cache HDFS index files on read and write, storing the portions of the file in JVM direct memory (off heap) by default, or optionally in the JVM heap.

Batch jobs typically do not use the cache, while Solr servers (when serving queries or indexing documents) should. When running indexing using MapReduce (MR), the MR jobs themselves do not use the block cache. Block write caching is turned off by default and should be left disabled.

Tuning of this cache is complex and best practices are continually being refined. In general, allocate a cache that is about 10-20% of the amount of memory available on the system. For example, when running HDFS and Solr on a host with 96 GB of memory, allocate 10-20 GB of memory using solr.hdfs.blockcache.slab.count. As index sizes grow you may need to tune this parameter to maintain optimal performance.

Configure Index Caching

The following parameters control caching. They can be configured at the Solr process level by setting the respective Java system property or by editing solrconfig.xml directly.

If the parameters are set at the collection level (using solrconfig.xml), the first collection loaded by the Solr server takes precedence, and block cache settings in all other collections are ignored. Because you cannot control the order in which collections are loaded, you must make sure to set identical block cache settings in every collection solrconfig.xml. Block cache parameters set at the collection level in solrconfig.xml also take precedence over parameters at the process level.

Parameter Cloudera Manager Setting Default Description
solr.hdfs.blockcache.global Not directly configurable. Cloudera Manager automatically enables the global block cache. To override this setting, you must use the Solr Service Environment Advanced Configuration Snippet (Safety Valve). true If enabled, one HDFS block cache is used for each collection on a host. If blockcache.global is disabled, each SolrCore on a host creates its own private HDFS block cache. Enabling this parameter simplifies managing HDFS block cache memory.
solr.hdfs.blockcache.enabled HDFS Block Cache true Enable the block cache.
solr.hdfs.blockcache.read.enabled Not directly configurable. If the block cache is enabled, Cloudera Manager automatically enables the read cache. To override this setting, you must use the Solr Service Environment Advanced Configuration Snippet (Safety Valve). true Enable the read cache.
solr.hdfs.blockcache.write.enabled Not directly configurable. If the block cache is enabled, Cloudera Manager automatically disables the write cache. false Enable the write cache.
solr.hdfs.blockcache.direct.memory.allocation HDFS Block Cache Off-Heap Memory true Enable direct memory allocation. If this is false, heap is used.
solr.hdfs.blockcache.blocksperbank HDFS Block Cache Blocks per Slab 16384 Number of blocks per cache slab. The size of the cache is 8 KB (the block size) times the number of blocks per slab times the number of slabs.
solr.hdfs.blockcache.slab.count HDFS Block Cache Number of Slabs 1 Number of slabs per block cache. The size of the cache is 8 KB (the block size) times the number of blocks per slab times the number of slabs.

Solr HDFS optimizes caching when performing NRT indexing using Lucene's NRTCachingDirectory.

Lucene caches a newly created segment if both of the following conditions are true:
  • The segment is the result of a flush or a merge and the estimated size of the merged segment is <= solr.hdfs.nrtcachingdirectory.maxmergesizemb.
  • The total cached bytes is <= solr.hdfs.nrtcachingdirectory.maxcachedmb.
The following parameters control NRT caching behavior:
Parameter Default Description
solr.hdfs.nrtcachingdirectory.enable true Whether to enable the NRTCachingDirectory.
solr.hdfs.nrtcachingdirectory.maxcachedmb 192 Size of the cache in megabytes.
solr.hdfs.nrtcachingdirectory.maxmergesizemb 16 Maximum segment size to cache.
This is an example solrconfig.xml file with defaults:
 <directoryFactory name="DirectoryFactory">
    <bool name="solr.hdfs.blockcache.enabled">${solr.hdfs.blockcache.enabled:true}</bool>
    <int name="solr.hdfs.blockcache.slab.count">${solr.hdfs.blockcache.slab.count:1}</int>
    <bool name="solr.hdfs.blockcache.direct.memory.allocation">${solr.hdfs.blockcache.direct.memory.allocation:true}</bool>
    <int name="solr.hdfs.blockcache.blocksperbank">${solr.hdfs.blockcache.blocksperbank:16384}</int>
    <bool name="solr.hdfs.blockcache.read.enabled">${solr.hdfs.blockcache.read.enabled:true}</bool>
    <bool name="solr.hdfs.nrtcachingdirectory.enable">${solr.hdfs.nrtcachingdirectory.enable:true}</bool>
    <int name="solr.hdfs.nrtcachingdirectory.maxmergesizemb">${solr.hdfs.nrtcachingdirectory.maxmergesizemb:16}</int>
    <int name="solr.hdfs.nrtcachingdirectory.maxcachedmb">${solr.hdfs.nrtcachingdirectory.maxcachedmb:192}</int>
</directoryFactory>