Running ElasticSearch in Production

(Updated version. Originally Published on: Nov 3, 2013) Here are some things to keep in mind as you go about designing your ElasticSearch cluster. Many of these are from real life experiences and IMO are the basic common sense items you should consider. In addition to these settings noted here, there maybe other settings that are relevant to your use case.

Turn off multicast. Node discovery can happen using either multicast or unicast. For production (or even lower environments IMO) turn off multicast. If you are running with master nodes then the data nodes can all have a static list of the master nodes.

Delete All Indexes (Nah). To ensure that someone does not issue a DELETE operation on all indexes (* or _all) set action.destructive_requires_name to true.

Avoiding Split-Brain. This is where the cluster (for reasons known or unknown) splits up. Say you have 10 node cluster. 4 nodes disconnect from the cluster, but they are still able to see each other. These 4 nodes then create another cluster with the same name (happy happy). They will even elect a new master among themselves. We now have two clusters with the same name; one with 6 nodes and other with 4 nodes. Each has a master node too. This is what is called split-brain issue with ES clusters.To avoid this, set the ES parameter discovery.zen.minimum_master_nodes to half the number of nodes PLUS 1. So in our example here, the 4 nodes that broke off will not equal to 6 (which is half the number of nodes + 1). Thus no split brain. We will have to fix the core issue on the 4 nodes (maybe a restart) and let them re-join the main cluster.

Use SSD to improve I/O. If you have high write frequency, avoiding I/O bottlenecks is highly desirable. Using an SSD storage is highly recommended.

Use the right JDK– Don’t use older JDK’s even if its compatible with ElasticSearch. Use the latest Oracle JDK (as approved by the particular release of ElasticSearch). Don’t use Open JDK or other implementations.

Use dedicated master nodes. For production systems use dedicated master nodes and mark the data nodes as data only (not master eligible). Make sure to run master nodes on their own serves and not run other processes on that server. Don’t send indexing or query traffic to the master nodes. Their only purpose in life is to keep the cluster healthy.

Client Nodes. These are special ElasticSearch nodes that are neither data or master eligible (node.data: false and node.master: false). Client nodes are cluster aware and therefore can act as smart load balancers. You can send your queries to the client nodes which can then take on the expensive task of gathering responses to the query results from each of the data nodes. Thus relieving the data nodes from that responsibility.

Number of shards. An ElasticSearch index is made up of 1 or more shards. By default an index is split into 5 shards. An individual shard is a Lucene index. You cannot change the number of shards at run-time for an index. To change the number of shards you will have to reindex the data to a new index.

Number of Replicas. Set it to something other than zero (for production definitely). If you need to design a highly available and fault-tolerant search; you will need to create replicas. You can change the replica count at runtime if you choose. Though make sure to factor in storage capacity if you decide to add 2 or more replicas. If the node on which the primary shard blows up; then the cluster will continue to work since a replica copy exists which can now serve as primary index.

Allocating Physical Memory. Allocate half the memory to ES and leave the other half to the OS. This will help with file system caching. Also set the min and max JVM heap size to the same value.

indices.memory.index_buffer_size defaults to 10% of heap size. Increase if you have a heavy write index.

Index Management (optional). If possible create day/weekly/monthly indexes so that you can simply discard them as time rolls on. Otherwise you would have to put in TTL’s or delete operations which would then require you to run potentially expensive optimize operations. This will be useful for larger indexes. Use Curator to schedule index management operations.

Avoid memory swapping: ElasticSearch performance can suffer when the OS decides to swap out unused application memory. Disable swapping by setting OS level settings or set the following in ElasticSearch config –  bootstrap.mlockall: true

Fielddata Cache: When running certain queries, all unique field values are loaded into memory for performance reasons. If left unbounded (the default) you can potentially bring down a node or your entire cluster. Set indices.fielddata.cache.size to an upper bound (size or percent value).  FieldData Docs.

Doc Values: (Will be the default instead of fielddata cache in ElasticSearch 2.0)If you can live with a “slightly” slower query response then throw out field data cache and use doc values instead. With field data you are limited by the amount of heap memory (either all the available heap or the cache size you allocated). With doc values are written to disk and built at index time (vs the dynamic nature of the same with field data). Downside is slightly larger index sizes and additional time to retrieve the same from disk.

Circuit Breakers: Set the field data & per request circuit breakers to ensure that either of the operations do not take up too much memory. FieldData Docs.

ulimit: You will need to increase the number of open file descriptors (-n 12346) and maybe the max locked memory (-l unlimited) settings for the user account running ElasticSearch.

Application Performance Management / Monitoring / Alerting: Use appropriate tools to gain insight into server metrics. Example, you could use NewRelic which can give you both server as well as JVM metrics for your cluster. If you have purchased support from Elastic.co then you can use a combination of Marvel + Watcher to implement both passive as well as active monitoring/alerting. If you are using the free version then you could create some scripts that periodically query cluster/index stats and then send alerts if predefined thresholds are crossed.

File Paths: Change the paths to logs and data in the elasticsearch.yml file. Example:
path:
   logs: /var/log/elasticsearch
   data: /var/data/elasticsearch

Cluster/Node names: Make sure to provide unique names to each of your nodes and also name your cluster. While the Marvel character names are fun (love it); random node names in production cluster will create headaches.

In Conclusion. ElasticSearch has great defaults to get started. But once past the initial experimentation stage you must spend some time to tweak the settings for your needs.