Combining Docker and Elasticsearch together will make many things in life easier. Depending on the scales, there are three scenarios for an implementation of Elasticsearch use case -

- 1. Single ES instance on single machine
- 2. Multiple ES instances on single machine 
- 3. Multiple ES instances on multiple machines

Docker has three mechanisms corresponding to each of the scenarios above. 

- 1. Docker commands
- 2. Docker Compose
- 3. Docker Swarm or Kubernetes

I played around Docker and Elasticsearch for a while and really learned quite a few hard lessons. Hereby I summerize some of them below for future reference. 

1. Choose the right Docker base image

  - One great thing using Docker is that Elasticsearch can be painlessly [upgraded](https://www.elastic.co/guide/en/elasticsearch/reference/current/setup-upgrade.html).
  - It may be better to use the Alpine versions from the [official Dockerhub repo](https://hub.docker.com/_/elasticsearch/): no extra extension; smaller size
  - Commit customized images to a repository, which scales to the cluster or other clusters

2. Persistence of three things

  - The configuration of Elasticsearch `elasticsearch.yml` and the logging is controlled by `log4j2.properties`.
  - The three things including configuration, data and logging should be mounted, like the docker compose file. 

  ```
  version: '3'
  services:
    elasticsearch:
      image: your-repo/elasticsearch:5.4.0
      environment:
        - cluster.name=docker-cluster
        - bootstrap.memory_lock=true
        - "ES_JAVA_OPTS=-Xms8g -Xmx8g"
        - node.name=main01
        - network.publish_host=${MAIN01IP}
      ulimits:
        nofile:
          soft: 65536
          hard: 65536
        memlock:
          soft: -1
          hard: -1
      volumes:
         - ${YOUR_PATH}/elasticsearch.yml:/usr/share/elasticsearch/config/elasticsearch.yml 
         - ${YOUR_PATH}/data:/usr/share/elasticsearch/data
         - ${YOUR_PATH}/logs:/usr/share/elasticsearch/logs
         - ${YOUR_PATH}/backup:/usr/share/elasticsearch/backup

      ports:
        - 9200:9200
        - 9300:9300
  ```
  - If the mechanism of snapshot is desired, the backup volume could be also mounted. 
  
3.  Three levels of backup:
    - Snapshot important indexes
    - Version control
    - Delete by alias 
    

4. Three levels of isolation
  
  - It is always the first challenge to decide the entry point for a data source. Wrong decision means re-indexing. Elasticsearch has three levels for the isolation: cluster
  - 1. Cluster level: multiple clusters could be formed for various purposes, although they may coexist in the same physical machines. The data on different clusters are completely isolated unless a cluster exhausts the resources other clusters also rely on.
  - 2. Index level: data that are separated on different indexes means that they are on different shards. It is still possible to search and aggregate them together, such as `GET /idxA,idxB/type1,type2/_search?q=Q1 `, but the indices are logically distant and any operation is costly. The most important parameters for an index are `number of shards` and `number of replicas`.
```
# Python representation
from elasticsearch_dsl import Index

blogs = Index('blogs')
blogs.settings(
    number_of_shards=1,
    number_of_replicas=0
)
```

  - 3. Type level: not like a table from a relational dabase, `type` itself works only as filter. But it is the minimal unit to hold the `mapping`, which rules over how to search. Just like the incoming Elasticsearch 6, the `_all` field should be always disabled and will save up to 50% disk usage (if all fileds are text fields). 
```
# Python representation
from elasticsearch_dsl import DocType, Date, Integer, Keyword, Text, MetaField

@blog.doc_type
class Article(DocType):
    title = Text(analyzer='standard', fields={'raw': Keyword()})
    body = Text(analyzer='snowball')
    tags = Keyword()
    published_from = Date()
    lines = Integer()
    
    class Meta:
        all = MetaField(enabled=False) 

  ```
  
5. Shard management  
  - Try to index once for all, since reindexing is painful. Elasticsearch first searches the documents within a shard. According to the formula `shard = hash(routing) % number_of_primary_shards`, the routing string can be used to specify the shard that a particular document lies. 
      - set `cluster.routing.allocation.same_shard.host=true`
    - disable `cluster.routing.allocation.allow_rebalance` at rush hours

  - time-based. while the disadvantages are the unbalanced shards (one is 20GB and others may only have couple of MB).
  
  - Rollover and Shrink APIs
  - The advantages for customized routing are the faster queries and the easiness to phase out the cold data later on, Cold data migration:
  
  ```
  POST /_cluster/reroute
  {
      "commands" : [
          {
              "move" : {
                  "index" : "test", "shard" : 0,
                  "from_node" : "node1", "to_node" : "node2"
              }
      ]
  }  
  ```
  - I think the simplest management tool is the Chrome version of Elasticsearch Head. 
  


6. Use Docker to squeeze hardware

- In my experience, Elasticsearch is mostly memory bound, since every segment costs expensive space. The two rules should be kept in mind.  

- There are some hareware beasts such as Dell 730xd, which are good candidates for multiple instaces on single machine. I have a 6-node docker-compose demo file here. 

7. Use Elasticsearch to monitor Elasticsearch

- There are a lot of options to monitor an Elasticsearch cluster such as Graphite and cAdvisor. But Elasticsearch itself is  Kibana themselves are probably easiest. 
- Sometimes a simple Python script can realize node-level monitoring like
```
import requests
import json
from datetime import datetime
import time

es = Elasticsearch(["IP-ADDRESS-01"])

while True:
    r = requests.get('IP-ADDRESS-02:9200/_nodes/stats?pretty')
    current = json.loads(r.text)
    current.update({'timestamp': datetime.utcnow()})
    es.index(index='es-stats', doc_type='data', body=current)
    time.sleep(1)
```

8. Avoid brain split


- Cluster name is very important. The nodes will find others with the same cluster name to form a cluster. Node name should be unique.
- Add the masters' IP to  `discovery.zen.ping.unicast.hosts`. Then the node will scan the port from 9300 to 9305。
- Set `discovery.zen.minimum_master_nodes` with the equation `(the number of masters / 2) + 1`. If there are 3 masters, then 
   `discovery.zen.minimum_master_nodes: 2`

9. Use Nginx as Docker cluster's load balancer

- Indexing in Elasticsearch takes quite a lot of computaton power, because it needs an extra tokenization step while ingesting data. As an application, the indexer is usually not a rate limiter, since we can easily do multi-processing. In other words, the indexing is CPU bound on ES cluster. 
- The usual way to speed up this process is to build a streaming to conduct bulk indexing based on a queue service such as [Kafka](https://thoughts.t37.net/how-we-reindexed-36-billions-documents-in-5-days-within-the-same-elasticsearch-cluster-cd9c054d1db8). 
- An simple alternative is to use Ningix as a cheap load balancer so that every CPU will be fully exposed. I have seen an ES culster to increase throughput from 2k documents per second(DPS) to 5k DPS after applying this approach. 

```
http {
    upstream docker_cluster {
        server docker_ip_1;
        server docker_ip_2;
        server docker_ip_3;
    }
}
```
- Or furthermore adjust the servers with `weight` based on their hardware limits.

```
http {
    upstream docker_cluster {
        server docker_ip_1 weight=5;
        server docker_ip_2 weight=3;
        server docker_ip_3 weight=2;
    }
}
```