Explaining SOLR, SOLRCloud to the rookie by Exdiscoran
Unlike other commenters on SOLR, a popular search platform, I will not show how to install SOLR and repeat what is shown on Apache SOLR website. Instead, I will use a simple language to explain the following.
- What is SOLR and Apache Lucene?
- What is SOLRCloud, Zookeeper and why does SOLRCloud use Zookeeper?
- What is HDFS?
- What are the advantages and disadvantages of using SOLR on HDFS?
Through out this document, Node means a Linux server.
What is SOLR or SOLRCloud? SOLR is an executable that you download to a node, unzip and start it, as a daemon process. Once started, SOLR binds to a http port and exposes a service that accepts any text file. When you provide a text file to SOLR it reads that file and creates a reverse index of that file. To create this reverse index it uses Apache Lucene library.
A reverse index is a sorted file of words pointing to the file names they are part of.
Before SOLR, during the initial days of search farms, developers built search functionality for their websites directly on top of Lucene. When data became too large, developers scaled vertically by upgrading their nodes for larger file storage and memory. However, a solution to scale horizontally is more cost effective and manageable. SOLR provides such a solution. It allows developers to add new nodes to their search farms and scale horizontally. To achieve horizontal scale, SOLR offers two parameters – Shards and Replicas.
- Shards, If you imagine each file as a document with a unique ID, you can think of shard-ing as the process of spreading documents evenly among available nodes. More precisely, if DOC ID: 1, is converted to an inverted index and stored in Node x, then DOC ID: 2, is converted to an inverted index and stored in Node y. The logic, inside SOLR, to spread documents among nodes is either round-robin or based on hash of certain words inside the document. Based on the algorithm, it is possible that DOC ID: 1 and DOC ID: 2 end up in the same shard.
When you setup SOLR you mention how many shards are necessary for an index. Ideally, you will want more nodes in a SOLR cluster than the number of shards.
- Replicas, Replicas are duplicate copies of a shard. Ideally, replicas are stored in different nodes than the shard itself. They are useful to spread the request load or take over as the primary shard when a failure occurs. We talk about primary shards/leader nodes later in this article. In other words, there are multiple copies of DOC ID: 1 and DOC ID: 2.
When you setup SOLR you mention how many replicas are necessary for an index.
There is no relation between number of shards and number or replicas. Shards are purely aimed at splitting indices to ensure memory consumed and storage used is within a node’s capability. Replicas are aimed at replacing a broken shard or serving traffic that cannot be satisfied by existing nodes.
Decide on the number of shards based on the number of documents you expect for an index and the amount of memory a node has. A single Lucene index can have approximately 200 billion words or 2 billion documents.
Decide the number of replicas based on the traffic you expect to receive.
In the diagram below, the developer provides the name Mycollection to SOLR when indexing documents. SOLR calls Lucene’s write APIs to index the document. The name of the index will be Mycollection. In other words, you will find a folder called Mycollection inside a SOLR node, that contains files written by Lucene in inverted index format.
SOLRCloud is the upgraded version of SOLR. The previous versions of SOLR were called SOLRCore and versions 4-Beta upwards were called SOLRCloud. SOLRCloud, offers a new set of distributed capabilities to make SOLR highly available and fault tolerant.
SOLRCore could not automatically route to shards or replicas during write and search queries. Users had to manually specify the node names in request URLs. Elasticsearch, another competing technology that also uses Lucene underneath, did not have this manual limitation. SOLRCloud addresses this limitation and automatically routes requests to shards or replicas for writes and search queries. To achieve this, SOLRCloud uses Zookeeper to maintain and synchronize states between various SOLRCloud nodes.
Zookeeper elects a leader node for every shard. A shard’s leader node receives writes for that shard and propagates the writes to the shard’s replicas. Zookeeper will mark a node dead if it does not send sufficient keep-alives. Zookeeper will also elect different nodes as leader nodes for shards that were previously led by a dead node. Thus, Zookeeper is central to identifying a SOLRCloud’s current state. It however, does not store any Lucene index data. All search data is still kept inside Lucene indices residing in the file system of SOLRCloud nodes.
HDFS and SOLRCloud, Hadoop distributed file system (HDFS) is another technology to shard and replicate large files among multiple nodes. However, unlike Lucene it does not maintain an inverted index but rather works at the binary level by splitting files into block sizes of 128MB each. Imagine having a large 10 gig Lucene inverted index stored in HDFS. Suddenly, you do not require large hard disks as part of a SOLRCloud node itself. You can instead have powerful processor nodes with little to no storage space, all reading and writing data to a separate Hadoop cluster. This opens up many possibilities like using SOLRCloud to build a NOSQL database. We can also start new SOLR replica nodes without having to copy index data. We would now just start a new SOLR replica node and point it to the HDFS file system.
There is however, a disadvantage of using HDFS. The Namenode server, which is Hadoop’s brain, is a single point of failure. You can reduce this risk by setting up Hadoop in HA. However, this makes your architecture complex. Also, any reads or writes for indexed data, by SOLRCloud nodes will be a remote call to the Hadoop cluster. HDFS, though, has made efforts to minimize latency by providing a client side distributed file system cache similar to the local file system cache.
I suggest the following books to learn SOLRCloud, Zookeeper and HDFS.
If you’d want to hear my monotonous voice on the same topic, visit: