Skip to content

Instantly share code, notes, and snippets.

@AlloVince
Last active July 14, 2022 12:52
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save AlloVince/35dc272fe0f6703b195425e07f22ed38 to your computer and use it in GitHub Desktop.
Save AlloVince/35dc272fe0f6703b195425e07f22ed38 to your computer and use it in GitHub Desktop.
npm install -g reveal-md && wget -q -O elastic.md https://gist.githubusercontent.com/AlloVince/35dc272fe0f6703b195425e07f22ed38/raw/elastic.md && reveal-md elastic.md

8102年如何入门Elastic

2018.9 @AlloVince


什么是Elastic

数据搜索/ETL/可视化的开源技术栈


  • 2000 Elasticsearch (Lucene)
  • 2012 ELK Stack
  • 2015 Elastic company -> ELK 5.0
  • 2017 ECE (Elastic Cloud Enterprise)
  • 2018 X-Pack open source

Elastic主技术栈

  • Elasticsearch
  • Logstash
  • Kibaba
  • Beats
  • X-Pack



Elastic的用途

  • Searching
  • Log Analytics
  • APM / Tracing
  • Metrics Monitoring
  • Business Analytics

Demo: [Flights] Global Flight Dashboard


安装和启动

Docker一键安装全家桶

git clone https://github.com/elastic/stack-docker.git
docker-compose -f setup.yml up

全部使用RESTFul接口, 使用Curl即可调试

GET localhost:9200

{
  "name" : "-BRIEeP",
  "cluster_name" : "docker-cluster",
  "cluster_uuid" : "WN9OpZpbR6iqmyfSCqn3gQ",
  "version" : {
    "number" : "6.4.0",
    ...
  },
  "tagline" : "You Know, for Search"
}

轻量安装

services:
  elastic:
    image: docker.elastic.co/elasticsearch/elasticsearch:6.4.0
    container_name: elastic
    environment:
      - cluster.name=docker-cluster
      - bootstrap.memory_lock=true
      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
    volumes:
      - ./elastic/elastic:/usr/share/elasticsearch/data
    ports:
      - 9200:9200
  kibana:
    image: docker.elastic.co/kibana/kibana:6.4.0
    container_name: kibana
    environment:
      ELASTICSEARCH_URL: http://elastic:9200
    volumes:
      - ./elastic/kibana.yml:/usr/share/kibana/config/kibana.yml
    ports:
      - 5601:5601
    depends_on:
      - elastic

Elastic 基本概念

  • Cluster <--> 集群
  • Node <--> 节点
  • Index <--> 类比数据表
  • Type <--> 对Index归类,已废弃
  • Document <--> 类比单行数据
  • Mapping <--> 映射, 关联原始数据与Elastic类型
  • Field <--> 字段, 一个mapping可以有多个fields
  • Shards <--> 分片

与数据库的区别


数据库索引: 顺序扫描

InnoDB主索引


InnoDB辅助索引


数据库:

侧重数据精确查找和范围筛选, 读写比较均衡


全文索引

  • 正排索引(ForwardIndex): fast indexing, less efficient query's
  • 倒排索引(InvertIndex): fast query, slower indexing

倒排索引: 词典+倒排链表


倒排索引的特点

  • 多个词叠加AND/OR搜索只需要合并链表
  • 容易支持跨字段搜索
  • 索引建立后可重复使用
  • 空间占用大
  • 建立索引慢

Elastic:

侧重数据的搜索, 牺牲写入换取较快的读取


Elastic 基础


对一个文档进行索引

PUT cn/article/1
{
  "title": "中国人民万岁"
}

查看索引情况

GET cn

列表

GET cn/_search

批量创建

POST _bulk
{ "index": { "_index": "cn", "_type": "article", "_id": 1 } }
{ "title": "中国人民万岁" }
{ "index": { "_index": "cn", "_type": "article", "_id": 2 } }
{ "title": "全世界都有中国人" }
{ "index": { "_index": "cn", "_type": "article", "_id": 3 } }
{ "title": "国与国关系" }
{ "index": { "_index": "cn", "_type": "article", "_id": 4 } }
{ "title": "中国人口多" }
{ "index": { "_index": "cn", "_type": "article", "_id": 5 } }
{ "title": "中国地大物博" }

搜索

GET cn/article/_search
{
  "query" : { "match" : { "title" : "中国人" }}
}

搜索结果分析

GET cn/article/_search
{
  "explain": true,
  "query" : { "match" : { "title" : "中国人" }}
}

搜索引擎实战


通用搜索引擎Workflow


中文分词

安装语法分析

elasticsearch-plugin install analysis-icu

重启检查分词器安装

POST _analyze
{
  "tokenizer": "icu_tokenizer",
  "text":      "中国人"
}

语法分析器构成

  • Analyzer
    • Character filter * N (字符过滤)
    • Tokenizer * 1 (分词)
    • Token filter * N (词过滤)



对索引字段配置分词

PUT cn 
{
  "settings": {
    "analysis": {
      "analyzer": {
        "cn_analyzer": {
          "type": "custom",
          "tokenizer": "icu_tokenizer"
        }
      }
    }
  },
  "mappings": {
    "article": {
      "properties": {
        "title": { 
          "type": "text",
          "analyzer":"cn_analyzer"
        }
      }
    }
  }
}

Elastic字段类型

  • String: text, keyword
  • Numeric: long, integer, short, byte, double, float, half_float, scaled_float
  • Date
  • Boolean
  • Binary
  • Range: integer_range, float_range, long_range, double_range, date_range

检查字段分词已经生效

POST cn/_analyze
{
  "field": "title",
  "text": "中国人"
}

搜索进阶


动态模板

Keep schema free


PUT cn
{
  "settings": {
    "number_of_shards": 1, 
    "analysis": {
      "analyzer": {
        "cn_analyzer": {
          "type": "custom",
          "tokenizer": "icu_tokenizer"
        }
      }
    }
  },
  "mappings": {
    "article": {
      "dynamic_templates": [
        {
          "strings": {
            "match_mapping_type": "string",
            "mapping": {
              "type": "text",
              "analyzer": "cn_analyzer",
              "fields": {
                "raw": {
                  "type":  "keyword",
                  "ignore_above": 256
                }
              }
            }
          }
        }
      ]
    }
  }
}

精确匹配

GET cn/article/_search
{
  "query" : { "term": { "title.raw": "国与国关系" }}
}

多字段权重

multi_match: {
  query: keyword,
  type: 'cross_fields',
  operator: 'and',
  fields: ['title', 'tags', 'summary^0.5']
}

组合搜索

  query: {
    bool: {
      filter: {
        term: {
          'termKeywords.raw': keyword
        }
      },
      should: ...
    }
  }

衰减函数

function_score: {
  query: {
    multi_match: {
      ....
    }
  },
  functions: [
    {
      exp: {
        timestamp: {
          origin: Datetime.now(),
          offset: 86400 * 60,
          scale: 86400 * 60
        }
      }
    }
  ],
  boost_mode: 'multiply'
}


  • 全文检索
    • match: 基本单字段and/or搜索
    • multi_match: 多字段匹配
    • match_phrase: 短语匹配, 搜索的所有词必须存在且顺序一致

  • 精确匹配
    • term
    • terms
    • range
    • wildcard
    • pregexp

  • 组合检索
    • Bool
      • must
      • filter
      • should
      • must_not
    • Function Score

相关度排序

  • TF/IDF
  • Cosine Similarity (余弦近似度)

实例


数据同步

  • Kafka Connector
  • Logstash Plugin
  • Custom script

Logstash Plugins

  • Input plugin
  • Output plugin

Logstash JDBC

  1. 下载JDBC库
  • Docker进入Logstash容器运行 yum install mysql-connector-java
  • 复制 /usr/share/java/mysql-connector-java.jar
  • 后续可以挂载运行
  1. 配置Logstash

input {
  jdbc {
    jdbc_driver_library => "/usr/share/java/mysql-connector-java.jar"
    jdbc_driver_class => "com.mysql.jdbc.Driver"
    jdbc_connection_string => "jdbc:mysql://docker.for.mac.localhost:3306/accounts"
    jdbc_user => "root"
    jdbc_password => "password"
    schedule => "* * * * *"
    statement => "SELECT * FROM accounts WHERE id >= :sql_last_value"
    use_column_value => true
    tracking_column_type => "numeric"
    tracking_column => "id"
    last_run_metadata_path => "syncpoint_table"
  }
}

output {
  elasticsearch {
    hosts    => [ 'elasticsearch' ]
    user     => 'elastic'
    password => "${ELASTIC_PASSWORD}"
    index => 'wxmp'
    document_id => "%{id}"
    document_type => 'accounts'
    ssl => true
    cacert => '/usr/share/logstash/config/certs/ca/ca.crt'
  }
}

简单数据同步并没有意义

  • 无法直接用于搜索
  • 数据增删改
  • 出错重试

准实时数据ETL

  • 数据库端
    • Iterator API
    • Single Item Read API
    • Binlog -> Message Queue
  • ES端
    • Iterator All Items
    • Pull Single Item (Upsert)
    • ETL
    • Receive notify

X-Pack

  • Security: 角色及权限
  • Alerting: 根据日志数据创建报警
  • Monitoring: ELK栈CPU、内存、磁盘等使用状况
  • Reporting: 导出Kibana数据表为PDF
  • Graph: 数据可视化高阶图表
  • Machine Learning: 非监督学习发现数据异常
  • SQL: 使用SQL语法查询数据

POST /_xpack/sql?format=txt
{
    "query": "SHOW tables"
}
POST /_xpack/sql?format=txt
{
    "query": "SELECT title FROM cn"
}

  • < 6.3: 闭源, 作为独立组件安装
  • >= 6.3: 开源, 强制集成, 提供30天试用, License只能申请

日志收集

  • Logstash (Golang)
  • Filebeat (Java)
  • Filebeat + Logstash


数据分析


评价

优点:

  • 技术栈全面, 仍然是最好的搜索和日志分析解决方案
  • RESTFul接口, 理解上手容易
  • SDK齐全,开发简单
  • 集群自动连接, 门槛低

缺点:

  • 细节屏蔽多, 深入有难度
  • 资源占用较大
  • 商业化导致免费版本不友好
  • JSON作为Query格式比较啰嗦

Refers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment