TrumanWong

Elasticsearch全文搜索match查询

TrumanWong
1/8/2023

全文搜索的match单个词查询

Elasticsearch中进行全文搜索时,如果要给字段指定查询的特定字词,可以使用match类型的查询。范例如下:

# 数据准备
POST /myindex-match-search/_bulk
{"index": {"_id": 1}}
{"title": "The flower and the dog"}
{"index": {"_id": 2}}
{"title": "The flower and the dog are beautiful"}
{"index": {"_id": 3}}
{"title": "the dog are beautiful"}

# 使用match类型的查询
GET /myindex-match-search/_search
{
  "query": {
    "match": {
      "title": "flower"
    }
  }
}

以上语句执行match查询的步骤如下:

  1. 检查字段类型。

    title字段是text类型(内容会被分词),说明此字段在存储时和查询时都会进行分词,而且在存储时会建立倒排索引。

  2. 分析查询字符串。

    将查询的字符串flower传入标准分词其中,输出的结果是单词flower。因为只有一个单词,所以match查询执行的是单个底层term查询。

  3. 查找匹配的文档。

    term查询在倒排索引中查找flower,然后获取一组包含该单词的文档数据。

  4. 为每个文档评分。

    term查询计算出每个文档的评分。

使用match查询时,返回结果中文档的评分时和该文档中字段的内容长度有关的,即字段内容越短,评分就越高,执行结果如下所示:

TrumanWong

可以看到,_id等于1和_id等于2的文档都符合查询要求,但是_id等于的文档内容短,所以评分较高。

全文搜索的match多个词查询

范例如下:

GET /myindex-match-search/_search
{
  "query": {
    "match": {
      "title": "flower dog"
    }
  }
}

因为match查询必须查找两个单词(flowerdog),它在内部实际上先执行两次term查询,然后将两次查询的结果合并起来作为最终的查询结果。为了做到这点,它将两个term查询嵌入到一个布尔查询中,范例如下:

GET /myindex-match-search/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "term": {
            "title": "flower"
          }
        },
        {
          "term": {
            "title": "dog"
          }
        }
      ]
    }
  }
}

上面两个查询语句返回的结果是一致的,执行结果如下所示:

TrumanWong

可以看到,_id等于1和_id等于2的文档数据都匹配到了这两个单词,而_id等于1的文档内容短,所以分数较高,_id等于3的文档内容只是匹配到了一个单词,所以分数低。

全文搜索的控制match的匹配精度

根据前面范例中的索引数据,如果用户给定3个查询单词,想查找只包含其中两个的文档,那么我们将逻辑运算符设置成and或者or都不合适。而match查询支持minimum_should_match(最小匹配参数)选项,我们可以将其设置为某个具体数字。范例如下:

GET /myindex-match-search/_search
{
  "query": {
    "match": {
      "title": {
        "query": "flower dog the",
        "minimum_should_match": 3
      }
    }
  }
}

执行结果如下所示:

TrumanWong

可以看到,返回结果符合我们查询的要求,文档内容必须满足匹配到3个单词。需要注意的是,实际应用中更常用的做法是将其设置为一个百分数,因为我们无法控制用户查询时输入的单词数量。范例如下:

GET /myindex-match-search/_search
{
  "query": {
    "match": {
      "title": {
        "query": "flower dog the",
        "minimum_should_match": "80%"
      }
    }
  }
}

执行结果如下所示:

TrumanWong

minumum_should_match参数的值类型如下表所示:

Type Example Description
Integer 3 Indicates a fixed value regardless of the number of optional clauses.
Negative integer -2 Indicates that the total number of optional clauses, minus this number should be mandatory.
Percentage 75% Indicates that this percent of the total number of optional clauses are necessary. The number computed from the percentage is rounded down and used as the minimum.
Negative percentage -25% Indicates that this percent of the total number of optional clauses can be missing. The number computed from the percentage is rounded down, before being subtracted from the total to determine the minimum.
Combination 3<90% A positive integer, followed by the less-than symbol, followed by any of the previously mentioned specifiers is a conditional specification. It indicates that if the number of optional clauses is equal to (or less than) the integer, they are all required, but if it’s greater than the integer, the specification applies. In this example: if there are 1 to 3 clauses they are all required, but for 4 or more clauses only 90% are required.
Multiple combinations 2<-25% 9<-3 Multiple conditional specifications can be separated by spaces, each one only being valid for numbers greater than the one before it. In this example: if there are 1 or 2 clauses both are required, if there are 3-9 clauses all but 25% are required, and if there are more than 9 clauses, all but three are required.