Elasticsearch analyzer 和 search_analyzer 的使用记录

环境

  • elasticsearch 6.4.3

示例

下面一段文字用 ik 进行分词

http://34.0.7.184:9200/_analyze/ POST

1
2
3
4
{
"analyzer": "ik_smart",
"text": "关于加快建设合肥地铁七号线的通知说明"
}

分词结果

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
{
"tokens": [
{
"token": "关于",
"start_offset": 0,
"end_offset": 2,
"type": "CN_WORD",
"position": 0
}
,
{
"token": "加快",
"start_offset": 2,
"end_offset": 4,
"type": "CN_WORD",
"position": 1
}
,
{
"token": "建设",
"start_offset": 4,
"end_offset": 6,
"type": "CN_WORD",
"position": 2
}
,
{
"token": "合肥",
"start_offset": 6,
"end_offset": 8,
"type": "CN_WORD",
"position": 3
}
,
{
"token": "地铁",
"start_offset": 8,
"end_offset": 10,
"type": "CN_WORD",
"position": 4
}
,
{
"token": "七号",
"start_offset": 10,
"end_offset": 12,
"type": "CN_WORD",
"position": 5
}
,
{
"token": "线",
"start_offset": 12,
"end_offset": 13,
"type": "CN_CHAR",
"position": 6
}
,
{
"token": "的",
"start_offset": 13,
"end_offset": 14,
"type": "CN_CHAR",
"position": 7
}
,
{
"token": "通知",
"start_offset": 14,
"end_offset": 16,
"type": "CN_WORD",
"position": 8
}
,
{
"token": "说明",
"start_offset": 16,
"end_offset": 18,
"type": "CN_WORD",
"position": 9
}
]
}
  • 这个时候如果配置的 analyzer 为 ik_smart 或者 analyzer 和 search_analyzer 都为 ik_smart, 则短语中每一个字都能搜到结果,还可以设置高亮信息来着重看一下

  • 如果配置的 analyzer 为 ik search_analyzer 为 standard ,则 通知,说明,七号 这样的词是搜不到的,而 线 这样的词可以搜到,理解一下

http://34.0.7.184:9200/attachment_libs/_search POST

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
{
"query": {
"multi_match": {
"query": "关于",
"fields": [
"fileName^1.0"
],
"type": "best_fields",
"operator": "OR",
"slop": 0,
"prefix_length": 0,
"max_expansions": 50,
"zero_terms_query": "NONE",
"auto_generate_synonyms_phrase_query": true,
"fuzzy_transpositions": true,
"boost": 1
}
},
"_source": {
"includes": [
"fileName"
],
"excludes": [
"data"
]
},
"highlight": {
"pre_tags": [
"<span style = 'color:red'>"
],
"post_tags": [
"</span>"
],
"fields": {
"*": {}
}
}
}

返回的结果为:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 0,
"max_score": null,
"hits": [ ]
}
}

而搜索 线 返回的结果为:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.2876821,
"hits": [
{
"_index": "attachment_libs",
"_type": "attachment_info",
"_id": "fd45d5be-c314-488a-99d3-041acc015377",
"_score": 0.2876821,
"_source": {
"fileName": "关于加快建设合肥地铁七号线的通知说明"
},
"highlight": {
"fileName": [
"关于加快建设合肥地铁七号<span style = 'color:red'>线</span>的通知说明"
]
}
}
]
}
}

总结

  • 分析器主要有两种情况会被使用,一种是插入文档时,将text类型的字段做分词然后插入倒排索引,第二种就是在查询时,先对要查询的text类型的输入做分词,再去倒排索引搜索
  • 如果想要让 索引 和 查询 时使用不同的分词器,ElasticSearch也是能支持的,只需要在字段上加上search_analyzer参数
    1. 在索引时,只会去看字段有没有定义analyzer,有定义的话就用定义的,没定义就用ES预设的
    2. 在查询时,会先去看字段有没有定义search_analyzer,如果没有定义,就去看有没有analyzer,再没有定义,才会去使用ES预设的