suensummit/elasticsearch.md

## elasticsearch.md

      
    Raw
  

              elasticsearch.md
            
          
    ##TUNING##
Configuration

System:
set file descriptors to 32K or 64K
vim /etc/security/limit.conf
elasticsearch - nofile 65535
elasticsearch - memlock unlimited

use following command to check
curl localhost:9200/_nodes/process?pretty

"process" : {
     "refresh_interval_in_millis" : 1000,
     "id" : 2697,
     "max_file_descriptors" : 65535,
     "mlockall" : true
 }

To set this value permanently, update the vm.max_map_count setting in /etc/sysctl.conf
sysctl -w vm.max_map_count=262144
#If you installed Elasticsearch using a package (.deb, .rpm) this setting 
#will be changed automatically. To verify, run sysctl vm.max_map_count.

Disable swap
vm.swappiness to 0

Disk Performance

For SSDs in r3, maybe it's better to mount with discard option since it supports TRIM:
vim /etc/fstab/
/dev/xvdb /mnt ext4 defaults,noatime,nodiratime,discard 0 0

Use noop scheduler for SSD:
echo noop | sudo tee /sys/block/xvdc/queue/scheduler

ES Settings

vim /etc/default/elasticsearch
use half of machine memory for JVM or not excess 32g
ES_HEAP_SIZE=15g
MAX_OPEN_FILES=65535
MAX_LOCKED_MEMORY=unlimited

vim /etc/elasticsearch/elasticsearch.yaml
never swaping
bootstrap.mlockall: true

indexing performance
"indices.memory.index_buffer_size": "30%",    #10%
"index.translog.flush_threshold_ops": 50000,  #1000
"index.refresh_interval": "5s",               #1s
#"index.store.type": "mmapfs"

adjust thoughput from 20mb to 100mb
PUT /_cluster/settings
{
    "persistent" : {
        "indices.store.throttle.max_bytes_per_sec" : "100mb"
    }
}

Mapping


elasticsearch 會儲存原始檔案在 _source 欄位, 如果不需要可以關閉


elasticsearch 會把所有欄位的資料處理好放在 _all 欄位, 如果不需要也可以關閉
{ 
  '_id': 1
  'title': 'this is first blog', 
  'author': 'kakashi', 
  'content': 'test 123'
}
存到ES後會變成
{
  '_id': 1,
  '_all': 'this, is, first, blog, kakashi, test, 123',
  'title': 'this, is, first, blog',
  'author': 'kakashi',
  'content': 'test, 123',
  '_source': {
      'title': 'this is first blog', 
      'author': 'kakashi', 
      'content': 'test 123'
  }


如果把 _source 關閉, 可以利用 _store 決定是否要儲存此field
{
   "tweet" : {
     "properties" : {
         "message" : {
             "type" : "string",
             "store" : true,
             "index" : "analyzed",
         },


使用 _source 和 _store 的最大差別, 用 _source 可以利用 update API 去更新值


在 analyze field 時, 如果不需要算出score (相關性), 可以把norms關閉, 會節省大量memory


index_options 可以決定要不要存term frequencies 還有 positions


不需要index的欄位請使用no, 該欄位不需要切詞可以用not_analyzed


建立mapping的方式


利用template
PUT _template/blog-template
{  
  "template": "db*",  <--- index(db) name
  "mappings": { 
     "blog": {        <---- type (table) name
        "properties": {
          "author": {
            "type": "string",
            "index": "not_analyzed"
          },
          "content": {
            "type": "string"
         }
      }
   }
}


取得mapping GET db/_mapping/


直接修改db的mapping PUT db/_mapping


Indexing


利用Bulk indexing的方式, 最好控制在1MB~5MB間
重要性較低的資料可以用bulk UDP indexing （可以忍受掉資料)
reindexing時可以將refresh_interval設成-1, Bulk indexing時手動做refresh
可以利用index warmer增加搜索速度 (http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-warmers.html)

Sharding & Replica


增加Sharding & 機器 -> 增加indexing能力
增加Replica & 機器 -> 增加Read能力

Reference####

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/setup-configuration.html
https://blog.codecentric.de/en/2014/05/elasticsearch-indexing-performance-cheatsheet/
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-core-types.html