MySQL全文索引(一)查询模式

TrumanWong
11/26/2024
TrumanWong

语法

MySQL数据库支持全文搜索Full-Text Search的查询,语法如下:

MATCH (col1,col2,...) AGAINST (expr [search_modifier])
search_modifier:
  {
       IN NATURAL LANGUAGE MODE
     | IN NATURAL LANGUAGE MODE WITH QUERY EXPANSION
     | IN BOOLEAN MODE
     | WITH QUERY EXPANSION
  }

MySQL数据库通过MATCH()…AGAINST()语法支持全文检索的查询,MATCH指定了需要被查询的列,AGAINST指定了使用何种方法去进行查询。本文将对各种查询模式进行介绍。

Natural Language

全文搜索通过MATCH函数进行查询,默认采用自然语言Natural Language模式,其表示查询带有指定word的文档。例如,查询articlescontent字段中含MySQL的记录,如果不使用全文搜索,通常情况我们可能会使用like查询:

mysql> select * from articles where content like '%MySQL%';

查看SQL语句的查询计划,可以看到这条SQL语句不会使用到索引:

mysql> explain select * from articles where content like '%MySQL%'\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: articles
   partitions: NULL
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 107
     filtered: 11.11
        Extra: Using where
1 row in set, 1 warning (0.00 sec)

如果使用全文搜索,可以使用以下SQL

mysql> select * from articles where MATCH(content) AGAINST('MySQL' IN NATURAL LANGUAGE MODE);

由于全文搜索默认的查询模式为NATURAL LANGUAGE MODE,因此可以省略查询修饰符:

mysql> select * from articles where MATCH(content) AGAINST('MySQL');

查看查询计划,可以看到查询使用了全文所搜,type列显示了fulltext,说明使用全文检索的倒排索引:

mysql> explain select * from articles where MATCH(content) AGAINST('MySQL')\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: articles
   partitions: NULL
         type: fulltext
possible_keys: full_text_content
          key: full_text_content
      key_len: 0
          ref: const
         rows: 1
     filtered: 100.00
        Extra: Using where; Ft_hints: sorted
1 row in set, 1 warning (0.00 sec)

如果查询列没有创建全文索引,则执行MATCH函数会抛出如下异常:

mysql> explain select * from articles where MATCH(title) AGAINST('MySQL')\G
ERROR 1191 (HY000): Can't find FULLTEXT index matching the column list

全文搜索的查询返回结果是根据相关性Relevance降序排序的,即相关性最高的结果排在第一位。相关性的值是一个非负的浮点数字,0表示没有任何的相关性。根据MySQL官方的文档可知,其相关性的计算依据以下四个条件:

为了统计Match查询的结果数,可以使用如下SQL

mysql> select count(*) from articles where MATCH(content) AGAINST ('MYSQL') ;
+----------+
| count(*) |
+----------+
|        6 |
+----------+
1 row in set (0.00 sec)

但你可能会发现以下SQL可能查询更快:

mysql> select count(IF(MATCH(content) AGAINST('MYSQL'), 1, null)) as count from articles;
+-------+
| count |
+-------+
|     6 |
+-------+
1 row in set (0.00 sec)

虽然结果都相同,但第一个查询会做一些额外的工作,如按相关性对结果进行排序。如果搜索匹配的行很少,索引查找可能会使第一个查询更快。第二个查询执行全表扫描,如果搜索词出现在大多数行中,则这可能比索引查找更快。

我们也可以查看相关性:

mysql> select id, MATCH(content) AGAINST ('MySQL') as score from articles where MATCH(content) AGAINST ('MySQL');
+----+--------------------+
| id | score              |
+----+--------------------+
| 86 |  46.96748352050781 |
| 91 |  21.91815948486328 |
| 35 | 15.655828475952148 |
| 58 | 12.524662971496582 |
| 61 | 3.1311657428741455 |
| 45 | 1.5655828714370728 |
+----+--------------------+
6 rows in set (0.00 sec)

对于InnodDB存储引擎的全文搜索,还需要考虑以下因素:

InnoDB默认停用词stopword列表相对较短,可以通过以下SQL查询默认InnoDB停用词列表:

mysql> SELECT * FROM INFORMATION_SCHEMA.INNODB_FT_DEFAULT_STOPWORD;
+-------+
| value |
+-------+
| a     |
| about |
| an    |
| are   |
| as    |
| at    |
| be    |
| by    |
| com   |
| de    |
| en    |
| for   |
| from  |
| how   |
| i     |
| in    |
| is    |
| it    |
| la    |
| of    |
| on    |
| or    |
| that  |
| the   |
| this  |
| to    |
| was   |
| what  |
| when  |
| where |
| who   |
| will  |
| with  |
| und   |
| the   |
| www   |
+-------+
36 rows in set (0.00 sec)

如果词在stopword中,则不对该词进行查询,如对this这个词进行查询,结果如下:

mysql> CREATE TABLE opening_lines (
    -> id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
    -> opening_line TEXT(500),
    -> author VARCHAR(200),
    -> title VARCHAR(200)
    -> ) ENGINE=InnoDB;
Query OK, 0 rows affected (0.03 sec)

mysql>  CREATE FULLTEXT INDEX idx ON opening_lines(opening_line);
Query OK, 0 rows affected, 1 warning (0.09 sec)
Records: 0  Duplicates: 0  Warnings: 1

mysql> INSERT INTO opening_lines(opening_line,author,title) VALUES
    -> ('Call me Ishmael.','Herman Melville','Moby-Dick'),
    -> ('A screaming comes across the sky.','Thomas Pynchon','Gravity\'s Rainbow'),
    -> ('I am an invisible man.','Ralph Ellison','Invisible Man'),
    -> ('Where now? Who now? When now?','Samuel Beckett','The Unnamable'),
    -> ('It was love at first sight.','Joseph Heller','Catch-22'),
    -> ('All this happened, more or less.','Kurt Vonnegut','Slaughterhouse-Five'),
    -> ('Mrs. Dalloway said she would buy the flowers herself.','Virginia Woolf','Mrs. Dalloway'),
    -> ('It was a pleasure to burn.','Ray Bradbury','Fahrenheit 451');
Query OK, 8 rows affected (0.00 sec)
Records: 8  Duplicates: 0  Warnings: 0

mysql> select id, opening_line, MATCH(opening_line) AGAINST('this') as score from opening_lines;
+----+-------------------------------------------------------+-------+
| id | opening_line                                          | score |
+----+-------------------------------------------------------+-------+
|  1 | Call me Ishmael.                                      |     0 |
|  2 | A screaming comes across the sky.                     |     0 |
|  3 | I am an invisible man.                                |     0 |
|  4 | Where now? Who now? When now?                         |     0 |
|  5 | It was love at first sight.                           |     0 |
|  6 | All this happened, more or less.                      |     0 |
|  7 | Mrs. Dalloway said she would buy the flowers herself. |     0 |
|  8 | It was a pleasure to burn.                            |     0 |
+----+-------------------------------------------------------+-------+
8 rows in set (0.00 sec)

可以看到,this虽然在文档6中出现,但由于其是stopword,所以相关性为0。

注意:在InnoDB存储引擎中,参数innodb_ft_min_token_size的默认值为3,参数innodb_ft_max_token_size的默认值为84。因此默认情况下,长度小于 3 个字符或长度大于 84 个字符的word会被全文索引忽略。但这适用于ngram解析器,ngramngram_token_size参数控制,该参数默认值为2.

MyISAM的全文索引停用词列表可以查看storage/myisam/ft_static.c文件。

Boolean

MySQL数据库允许使用IN BOOLEAN MODE修饰符来进行全文检索。当使用该修饰符时,查询字符串的前后字符会有特殊的含义,例如下面的语句要求查询有字符串MySQL但没有YourSQL的文档,其中+-分别表示这个单词必须出现,或者一定不存在:

mysql> CREATE TABLE articles (
    ->   id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
    ->   title VARCHAR(200),
    ->   body TEXT,
    ->   FULLTEXT (title,body)
    -> )  ENGINE=InnoDB;
Query OK, 0 rows affected (0.06 sec)

mysql> INSERT INTO articles (title,body) VALUES
    ->   ('MySQL Tutorial','This database tutorial ...'),
    ->   ("How To Use MySQL",'After you went through a ...'),
    ->   ('Optimizing Your Database','In this database tutorial ...'),
    ->   ('MySQL vs. YourSQL','When comparing databases ...'),
    ->   ('MySQL Security','When configured properly, MySQL ...'),
    ->   ('Database, Database, Database','database database database'),
    ->   ('1001 MySQL Tricks','1. Never run mysqld as root. 2. ...'),
    ->   ('MySQL Full-Text Indexes', 'MySQL fulltext indexes use a ..');
Query OK, 8 rows affected (0.01 sec)
Records: 8  Duplicates: 0  Warnings: 0

mysql> select * from articles where MATCH(title, body) AGAINST('+MySQL -YourSQL' IN BOOLEAN MODE);
+----+-------------------------+-------------------------------------+
| id | title                   | body                                |
+----+-------------------------+-------------------------------------+
|  5 | MySQL Security          | When configured properly, MySQL ... |
|  8 | MySQL Full-Text Indexes | MySQL fulltext indexes use a ..     |
|  1 | MySQL Tutorial          | This database tutorial ...          |
|  2 | How To Use MySQL        | After you went through a ...        |
|  7 | 1001 MySQL Tricks       | 1. Never run mysqld as root. 2. ... |
+----+-------------------------+-------------------------------------+
5 rows in set (0.01 sec)

Boolean全文索引支持以下操作符:

示例

查询有MySQL又有YourSQL的记录

mysql> select * from articles where MATCH(title, body) AGAINST('+MySQL +YourSQL' IN BOOLEAN MODDE);
+----+-------------------+------------------------------+
| id | title             | body                         |
+----+-------------------+------------------------------+
|  4 | MySQL vs. YourSQL | When comparing databases ... |
+----+-------------------+------------------------------+
1 row in set (0.00 sec)

查询有TutorialSecurity的记录

mysql> select * from articles where MATCH(title, body) AGAINST('Tutorial Security' IN BOOLEAN
MODE);
+----+--------------------------+-------------------------------------+
| id | title                    | body                                |
+----+--------------------------+-------------------------------------+
|  5 | MySQL Security           | When configured properly, MySQL ... |
|  1 | MySQL Tutorial           | This database tutorial ...          |
|  3 | Optimizing Your Database | In this database tutorial ...       |
+----+--------------------------+-------------------------------------+
3 rows in set (0.00 sec)

@distance查询

mysql> select * from articles where MATCH(title, body) AGAINST('"After through" @4' IN BOOLEAN MODE);
+----+------------------+------------------------------+
| id | title            | body                         |
+----+------------------+------------------------------+
|  2 | How To Use MySQL | After you went through a ... |
+----+------------------+------------------------------+
1 row in set (0.00 sec)

mysql> select * from articles where MATCH(title, body) AGAINST('"After through" @3' IN BOOLEAN MODE);
Empty set (0.00 sec)

mysql> select * from articles where MATCH(title, body) AGAINST('"Never root" @5' IN BOOLEAN MODDE);
+----+-------------------+-------------------------------------+
| id | title             | body                                |
+----+-------------------+-------------------------------------+
|  7 | 1001 MySQL Tricks | 1. Never run mysqld as root. 2. ... |
+----+-------------------+-------------------------------------+
1 row in set (0.00 sec)

mysql> select * from articles where MATCH(title, body) AGAINST('"Never root" @4' IN BOOLEAN MODDE);
Empty set (0.00 sec)

可以看到,从After开始计数,到through的距离为3个单词,因此@4可以返回结果,而@3没有返回任何结果,后面的Never root的查询也是同理。

查询是否有MySQLTutorial,且Tutorial相关性增加

mysql> select id, title, body, MATCH(title, body) AGAINST('MySQL >Tutorial' IN BOOLEAN MODE) ass score from articles where MATCH(title, body) AGAINST('MySQL >Tutorial' IN BOOLEAN MODE);
+----+--------------------------+-------------------------------------+----------------------+
| id | title                    | body                                | score                |
+----+--------------------------+-------------------------------------+----------------------+
|  1 | MySQL Tutorial           | This database tutorial ...          |   1.7405622005462646 |
|  3 | Optimizing Your Database | In this database tutorial ...       |   1.3624762296676636 |
|  5 | MySQL Security           | When configured properly, MySQL ... | 0.031219376251101494 |
|  8 | MySQL Full-Text Indexes  | MySQL fulltext indexes use a ..     | 0.031219376251101494 |
|  2 | How To Use MySQL         | After you went through a ...        | 0.015609688125550747 |
|  4 | MySQL vs. YourSQL        | When comparing databases ...        | 0.015609688125550747 |
|  7 | 1001 MySQL Tricks        | 1. Never run mysqld as root. 2. ... | 0.015609688125550747 |
+----+--------------------------+-------------------------------------+----------------------+

可以看到文档5虽然有两个MySQL,但是没有tutorial,因此相关性没有文档13高。

降低相关性查询

mysql> select id, title, body, MATCH(title, body) AGAINST('MySQL >Tutorial <After' IN BOOLEAN MODE) as score from arrticles where MATCH(title, body) AGAINST('MySQL >Tutorial <After' IN BOOLEAN MODE);
+----+--------------------------+-------------------------------------+----------------------+
| id | title                    | body                                | score                |
+----+--------------------------+-------------------------------------+----------------------+
|  1 | MySQL Tutorial           | This database tutorial ...          |   1.7405622005462646 |
|  3 | Optimizing Your Database | In this database tutorial ...       |   1.3624762296676636 |
|  5 | MySQL Security           | When configured properly, MySQL ... | 0.031219376251101494 |
|  8 | MySQL Full-Text Indexes  | MySQL fulltext indexes use a ..     | 0.031219376251101494 |
|  4 | MySQL vs. YourSQL        | When comparing databases ...        | 0.015609688125550747 |
|  7 | 1001 MySQL Tricks        | 1. Never run mysqld as root. 2. ... | 0.015609688125550747 |
|  2 | How To Use MySQL         | After you went through a ...        | -0.16881877183914185 |
+----+--------------------------+-------------------------------------+----------------------+
7 rows in set (0.00 sec)

可以看到,当增加<After条件后,文档2的相关性变为了负,这是因为虽然其中存在MySQL单词,但是也存在After单词,所以根据查询条件,其相关性变为了负相关。

查询以data开头的记录

mysql> select * from articles where MATCH(title, body) AGAINST('data*' IN BOOLEAN MODE);
+----+------------------------------+-------------------------------+
| id | title                        | body                          |
+----+------------------------------+-------------------------------+
|  6 | Database, Database, Database | database database database    |
|  3 | Optimizing Your Database     | In this database tutorial ... |
|  1 | MySQL Tutorial               | This database tutorial ...    |
|  4 | MySQL vs. YourSQL            | When comparing databases ...  |
+----+------------------------------+-------------------------------+
4 rows in set (0.00 sec)

可以看到返回结果均包含data开头的记录。

匹配包含MySQL Tutorial的记录

mysql> select * from articles where MATCH(title, body) AGAINST('"MySQL Tutorial"' IN BOOLEAN MODE);
+----+----------------+----------------------------+
| id | title          | body                       |
+----+----------------+----------------------------+
|  1 | MySQL Tutorial | This database tutorial ... |
+----+----------------+----------------------------+
1 row in set (0.00 sec)

mysql> select * from articles where MATCH(title, body) AGAINST('MySQL Tutorial' IN BOOLEAN MODE);
+----+--------------------------+-------------------------------------+
| id | title                    | body                                |
+----+--------------------------+-------------------------------------+
|  1 | MySQL Tutorial           | This database tutorial ...          |
|  3 | Optimizing Your Database | In this database tutorial ...       |
|  5 | MySQL Security           | When configured properly, MySQL ... |
|  8 | MySQL Full-Text Indexes  | MySQL fulltext indexes use a ..     |
|  2 | How To Use MySQL         | After you went through a ...        |
|  4 | MySQL vs. YourSQL        | When comparing databases ...        |
|  7 | 1001 MySQL Tricks        | 1. Never run mysqld as root. 2. ... |
+----+--------------------------+-------------------------------------+
7 rows in set (0.00 sec)

可以看到,第一条SQL使用"MySQL Tutorial",因此将MySQL Tutorial视为一个word,所以只有文档1符合条件;第二条SQL将其视为两个单词,所有返回了7条记录。

带查询扩展的全文搜索Query Expansion

全文搜索支持查询扩展(特别是它的变体“盲查询扩展”)。当搜索短语太短时这通常很有用,这通常意味着用户依赖全文搜索引擎缺乏的隐含知识。例如,搜索 “ database ”的用户可能真正意味着 “ MySQL ”、“ Oracle ”、“ DB2 ”和“ RDBMS ”都是应与 “ databases ”匹配的词组也应该归还。这是隐含知识。

盲查询扩展Blind query expansion(也称为自动相关性反馈automatic relevance feedback)通过添加WITH QUERY EXPANSIONIN NATURAL LANGUAGE MODE WITH QUERY EXPANSION跟随搜索短语来启用。它通过执行两次搜索来工作:

因此,如果其中一个文档包含单词 database和单词MySQL,则第二次搜索会找到包含单词 MySQL的文档,即使它们不包含单词 database。 以下示例显示了这种差异:

mysql> select * from articles where MATCH(title, body) AGAINST('database');
+----+------------------------------+-------------------------------+
| id | title                        | body                          |
+----+------------------------------+-------------------------------+
|  6 | Database, Database, Database | database database database    |
|  3 | Optimizing Your Database     | In this database tutorial ... |
|  1 | MySQL Tutorial               | This database tutorial ...    |
+----+------------------------------+-------------------------------+
3 rows in set (0.00 sec)

mysql> select * from articles where MATCH(title, body) AGAINST('database' WITH QUERY EXPANSION);
+----+------------------------------+-------------------------------------+
| id | title                        | body                                |
+----+------------------------------+-------------------------------------+
|  3 | Optimizing Your Database     | In this database tutorial ...       |
|  6 | Database, Database, Database | database database database          |
|  1 | MySQL Tutorial               | This database tutorial ...          |
|  5 | MySQL Security               | When configured properly, MySQL ... |
|  8 | MySQL Full-Text Indexes      | MySQL fulltext indexes use a ..     |
|  2 | How To Use MySQL             | After you went through a ...        |
|  4 | MySQL vs. YourSQL            | When comparing databases ...        |
|  7 | 1001 MySQL Tricks            | 1. Never run mysqld as root. 2. ... |
+----+------------------------------+-------------------------------------+
8 rows in set (0.00 sec)

可以看到,第一次查询返回了3条记录,body字段包含database关键字。第二次查询开启了Query Expansion,最后得到8条结果,除了之前包含database的记录,也有包含titlebody字段中包含MySQL的文档。这就是Query Expansion

注意:由于Query Expansion的全文检索可能带来许多非相关性的查询,因此在使用时,可能需要非常谨慎。