MySQL
数据库支持全文搜索Full-Text Search
的查询,语法如下:
MATCH (col1,col2,...) AGAINST (expr [search_modifier])
search_modifier:
{
IN NATURAL LANGUAGE MODE
| IN NATURAL LANGUAGE MODE WITH QUERY EXPANSION
| IN BOOLEAN MODE
| WITH QUERY EXPANSION
}
MySQL
数据库通过MATCH()…AGAINST()
语法支持全文检索的查询,MATCH
指定了需要被查询的列,AGAINST
指定了使用何种方法去进行查询。本文将对各种查询模式进行介绍。
Natural Language
全文搜索通过MATCH
函数进行查询,默认采用自然语言Natural Language
模式,其表示查询带有指定word
的文档。例如,查询articles
表content
字段中含MySQL
的记录,如果不使用全文搜索,通常情况我们可能会使用like
查询:
mysql> select * from articles where content like '%MySQL%';
查看SQL
语句的查询计划,可以看到这条SQL
语句不会使用到索引:
mysql> explain select * from articles where content like '%MySQL%'\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: articles
partitions: NULL
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 107
filtered: 11.11
Extra: Using where
1 row in set, 1 warning (0.00 sec)
如果使用全文搜索,可以使用以下SQL
:
mysql> select * from articles where MATCH(content) AGAINST('MySQL' IN NATURAL LANGUAGE MODE);
由于全文搜索默认的查询模式为NATURAL LANGUAGE MODE
,因此可以省略查询修饰符:
mysql> select * from articles where MATCH(content) AGAINST('MySQL');
查看查询计划,可以看到查询使用了全文所搜,type
列显示了fulltext
,说明使用全文检索的倒排索引:
mysql> explain select * from articles where MATCH(content) AGAINST('MySQL')\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: articles
partitions: NULL
type: fulltext
possible_keys: full_text_content
key: full_text_content
key_len: 0
ref: const
rows: 1
filtered: 100.00
Extra: Using where; Ft_hints: sorted
1 row in set, 1 warning (0.00 sec)
如果查询列没有创建全文索引,则执行MATCH
函数会抛出如下异常:
mysql> explain select * from articles where MATCH(title) AGAINST('MySQL')\G
ERROR 1191 (HY000): Can't find FULLTEXT index matching the column list
全文搜索的查询返回结果是根据相关性Relevance
降序排序的,即相关性最高的结果排在第一位。相关性的值是一个非负的浮点数字,0
表示没有任何的相关性。根据MySQL
官方的文档可知,其相关性的计算依据以下四个条件:
为了统计Match
查询的结果数,可以使用如下SQL
:
mysql> select count(*) from articles where MATCH(content) AGAINST ('MYSQL') ;
+----------+
| count(*) |
+----------+
| 6 |
+----------+
1 row in set (0.00 sec)
但你可能会发现以下SQL
可能查询更快:
mysql> select count(IF(MATCH(content) AGAINST('MYSQL'), 1, null)) as count from articles;
+-------+
| count |
+-------+
| 6 |
+-------+
1 row in set (0.00 sec)
虽然结果都相同,但第一个查询会做一些额外的工作,如按相关性对结果进行排序。如果搜索匹配的行很少,索引查找可能会使第一个查询更快。第二个查询执行全表扫描,如果搜索词出现在大多数行中,则这可能比索引查找更快。
我们也可以查看相关性:
mysql> select id, MATCH(content) AGAINST ('MySQL') as score from articles where MATCH(content) AGAINST ('MySQL');
+----+--------------------+
| id | score |
+----+--------------------+
| 86 | 46.96748352050781 |
| 91 | 21.91815948486328 |
| 35 | 15.655828475952148 |
| 58 | 12.524662971496582 |
| 61 | 3.1311657428741455 |
| 45 | 1.5655828714370728 |
+----+--------------------+
6 rows in set (0.00 sec)
对于InnodDB
存储引擎的全文搜索,还需要考虑以下因素:
InnoDB
默认停用词stopword
列表相对较短,可以通过以下SQL
查询默认InnoDB
停用词列表:
mysql> SELECT * FROM INFORMATION_SCHEMA.INNODB_FT_DEFAULT_STOPWORD;
+-------+
| value |
+-------+
| a |
| about |
| an |
| are |
| as |
| at |
| be |
| by |
| com |
| de |
| en |
| for |
| from |
| how |
| i |
| in |
| is |
| it |
| la |
| of |
| on |
| or |
| that |
| the |
| this |
| to |
| was |
| what |
| when |
| where |
| who |
| will |
| with |
| und |
| the |
| www |
+-------+
36 rows in set (0.00 sec)
如果词在stopword
中,则不对该词进行查询,如对this
这个词进行查询,结果如下:
mysql> CREATE TABLE opening_lines (
-> id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
-> opening_line TEXT(500),
-> author VARCHAR(200),
-> title VARCHAR(200)
-> ) ENGINE=InnoDB;
Query OK, 0 rows affected (0.03 sec)
mysql> CREATE FULLTEXT INDEX idx ON opening_lines(opening_line);
Query OK, 0 rows affected, 1 warning (0.09 sec)
Records: 0 Duplicates: 0 Warnings: 1
mysql> INSERT INTO opening_lines(opening_line,author,title) VALUES
-> ('Call me Ishmael.','Herman Melville','Moby-Dick'),
-> ('A screaming comes across the sky.','Thomas Pynchon','Gravity\'s Rainbow'),
-> ('I am an invisible man.','Ralph Ellison','Invisible Man'),
-> ('Where now? Who now? When now?','Samuel Beckett','The Unnamable'),
-> ('It was love at first sight.','Joseph Heller','Catch-22'),
-> ('All this happened, more or less.','Kurt Vonnegut','Slaughterhouse-Five'),
-> ('Mrs. Dalloway said she would buy the flowers herself.','Virginia Woolf','Mrs. Dalloway'),
-> ('It was a pleasure to burn.','Ray Bradbury','Fahrenheit 451');
Query OK, 8 rows affected (0.00 sec)
Records: 8 Duplicates: 0 Warnings: 0
mysql> select id, opening_line, MATCH(opening_line) AGAINST('this') as score from opening_lines;
+----+-------------------------------------------------------+-------+
| id | opening_line | score |
+----+-------------------------------------------------------+-------+
| 1 | Call me Ishmael. | 0 |
| 2 | A screaming comes across the sky. | 0 |
| 3 | I am an invisible man. | 0 |
| 4 | Where now? Who now? When now? | 0 |
| 5 | It was love at first sight. | 0 |
| 6 | All this happened, more or less. | 0 |
| 7 | Mrs. Dalloway said she would buy the flowers herself. | 0 |
| 8 | It was a pleasure to burn. | 0 |
+----+-------------------------------------------------------+-------+
8 rows in set (0.00 sec)
可以看到,this
虽然在文档6中出现,但由于其是stopword
,所以相关性为0。
InnoDB
存储引擎中,参数innodb_ft_min_token_size
的默认值为3,参数innodb_ft_max_token_size
的默认值为84。因此默认情况下,长度小于 3 个字符或长度大于 84 个字符的word
会被全文索引忽略。但这适用于ngram
解析器,ngram
受ngram_token_size
参数控制,该参数默认值为2.MyISAM
的全文索引停用词列表可以查看storage/myisam/ft_static.c
文件。
Boolean
MySQL
数据库允许使用IN BOOLEAN MODE
修饰符来进行全文检索。当使用该修饰符时,查询字符串的前后字符会有特殊的含义,例如下面的语句要求查询有字符串MySQL
但没有YourSQL
的文档,其中+
和-
分别表示这个单词必须出现,或者一定不存在:
mysql> CREATE TABLE articles (
-> id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
-> title VARCHAR(200),
-> body TEXT,
-> FULLTEXT (title,body)
-> ) ENGINE=InnoDB;
Query OK, 0 rows affected (0.06 sec)
mysql> INSERT INTO articles (title,body) VALUES
-> ('MySQL Tutorial','This database tutorial ...'),
-> ("How To Use MySQL",'After you went through a ...'),
-> ('Optimizing Your Database','In this database tutorial ...'),
-> ('MySQL vs. YourSQL','When comparing databases ...'),
-> ('MySQL Security','When configured properly, MySQL ...'),
-> ('Database, Database, Database','database database database'),
-> ('1001 MySQL Tricks','1. Never run mysqld as root. 2. ...'),
-> ('MySQL Full-Text Indexes', 'MySQL fulltext indexes use a ..');
Query OK, 8 rows affected (0.01 sec)
Records: 8 Duplicates: 0 Warnings: 0
mysql> select * from articles where MATCH(title, body) AGAINST('+MySQL -YourSQL' IN BOOLEAN MODE);
+----+-------------------------+-------------------------------------+
| id | title | body |
+----+-------------------------+-------------------------------------+
| 5 | MySQL Security | When configured properly, MySQL ... |
| 8 | MySQL Full-Text Indexes | MySQL fulltext indexes use a .. |
| 1 | MySQL Tutorial | This database tutorial ... |
| 2 | How To Use MySQL | After you went through a ... |
| 7 | 1001 MySQL Tricks | 1. Never run mysqld as root. 2. ... |
+----+-------------------------+-------------------------------------+
5 rows in set (0.01 sec)
Boolean
全文索引支持以下操作符:
MySQL
又有YourSQL
的记录mysql> select * from articles where MATCH(title, body) AGAINST('+MySQL +YourSQL' IN BOOLEAN MODDE);
+----+-------------------+------------------------------+
| id | title | body |
+----+-------------------+------------------------------+
| 4 | MySQL vs. YourSQL | When comparing databases ... |
+----+-------------------+------------------------------+
1 row in set (0.00 sec)
Tutorial
或Security
的记录mysql> select * from articles where MATCH(title, body) AGAINST('Tutorial Security' IN BOOLEAN
MODE);
+----+--------------------------+-------------------------------------+
| id | title | body |
+----+--------------------------+-------------------------------------+
| 5 | MySQL Security | When configured properly, MySQL ... |
| 1 | MySQL Tutorial | This database tutorial ... |
| 3 | Optimizing Your Database | In this database tutorial ... |
+----+--------------------------+-------------------------------------+
3 rows in set (0.00 sec)
@distance
查询mysql> select * from articles where MATCH(title, body) AGAINST('"After through" @4' IN BOOLEAN MODE);
+----+------------------+------------------------------+
| id | title | body |
+----+------------------+------------------------------+
| 2 | How To Use MySQL | After you went through a ... |
+----+------------------+------------------------------+
1 row in set (0.00 sec)
mysql> select * from articles where MATCH(title, body) AGAINST('"After through" @3' IN BOOLEAN MODE);
Empty set (0.00 sec)
mysql> select * from articles where MATCH(title, body) AGAINST('"Never root" @5' IN BOOLEAN MODDE);
+----+-------------------+-------------------------------------+
| id | title | body |
+----+-------------------+-------------------------------------+
| 7 | 1001 MySQL Tricks | 1. Never run mysqld as root. 2. ... |
+----+-------------------+-------------------------------------+
1 row in set (0.00 sec)
mysql> select * from articles where MATCH(title, body) AGAINST('"Never root" @4' IN BOOLEAN MODDE);
Empty set (0.00 sec)
可以看到,从After
开始计数,到through
的距离为3个单词,因此@4
可以返回结果,而@3
没有返回任何结果,后面的Never root
的查询也是同理。
MySQL
或Tutorial
,且Tutorial
相关性增加mysql> select id, title, body, MATCH(title, body) AGAINST('MySQL >Tutorial' IN BOOLEAN MODE) ass score from articles where MATCH(title, body) AGAINST('MySQL >Tutorial' IN BOOLEAN MODE);
+----+--------------------------+-------------------------------------+----------------------+
| id | title | body | score |
+----+--------------------------+-------------------------------------+----------------------+
| 1 | MySQL Tutorial | This database tutorial ... | 1.7405622005462646 |
| 3 | Optimizing Your Database | In this database tutorial ... | 1.3624762296676636 |
| 5 | MySQL Security | When configured properly, MySQL ... | 0.031219376251101494 |
| 8 | MySQL Full-Text Indexes | MySQL fulltext indexes use a .. | 0.031219376251101494 |
| 2 | How To Use MySQL | After you went through a ... | 0.015609688125550747 |
| 4 | MySQL vs. YourSQL | When comparing databases ... | 0.015609688125550747 |
| 7 | 1001 MySQL Tricks | 1. Never run mysqld as root. 2. ... | 0.015609688125550747 |
+----+--------------------------+-------------------------------------+----------------------+
可以看到文档5虽然有两个MySQL
,但是没有tutorial
,因此相关性没有文档1
和3
高。
mysql> select id, title, body, MATCH(title, body) AGAINST('MySQL >Tutorial <After' IN BOOLEAN MODE) as score from arrticles where MATCH(title, body) AGAINST('MySQL >Tutorial <After' IN BOOLEAN MODE);
+----+--------------------------+-------------------------------------+----------------------+
| id | title | body | score |
+----+--------------------------+-------------------------------------+----------------------+
| 1 | MySQL Tutorial | This database tutorial ... | 1.7405622005462646 |
| 3 | Optimizing Your Database | In this database tutorial ... | 1.3624762296676636 |
| 5 | MySQL Security | When configured properly, MySQL ... | 0.031219376251101494 |
| 8 | MySQL Full-Text Indexes | MySQL fulltext indexes use a .. | 0.031219376251101494 |
| 4 | MySQL vs. YourSQL | When comparing databases ... | 0.015609688125550747 |
| 7 | 1001 MySQL Tricks | 1. Never run mysqld as root. 2. ... | 0.015609688125550747 |
| 2 | How To Use MySQL | After you went through a ... | -0.16881877183914185 |
+----+--------------------------+-------------------------------------+----------------------+
7 rows in set (0.00 sec)
可以看到,当增加<After
条件后,文档2
的相关性变为了负,这是因为虽然其中存在MySQL
单词,但是也存在After
单词,所以根据查询条件,其相关性变为了负相关。
data
开头的记录mysql> select * from articles where MATCH(title, body) AGAINST('data*' IN BOOLEAN MODE);
+----+------------------------------+-------------------------------+
| id | title | body |
+----+------------------------------+-------------------------------+
| 6 | Database, Database, Database | database database database |
| 3 | Optimizing Your Database | In this database tutorial ... |
| 1 | MySQL Tutorial | This database tutorial ... |
| 4 | MySQL vs. YourSQL | When comparing databases ... |
+----+------------------------------+-------------------------------+
4 rows in set (0.00 sec)
可以看到返回结果均包含data
开头的记录。
MySQL Tutorial
的记录mysql> select * from articles where MATCH(title, body) AGAINST('"MySQL Tutorial"' IN BOOLEAN MODE);
+----+----------------+----------------------------+
| id | title | body |
+----+----------------+----------------------------+
| 1 | MySQL Tutorial | This database tutorial ... |
+----+----------------+----------------------------+
1 row in set (0.00 sec)
mysql> select * from articles where MATCH(title, body) AGAINST('MySQL Tutorial' IN BOOLEAN MODE);
+----+--------------------------+-------------------------------------+
| id | title | body |
+----+--------------------------+-------------------------------------+
| 1 | MySQL Tutorial | This database tutorial ... |
| 3 | Optimizing Your Database | In this database tutorial ... |
| 5 | MySQL Security | When configured properly, MySQL ... |
| 8 | MySQL Full-Text Indexes | MySQL fulltext indexes use a .. |
| 2 | How To Use MySQL | After you went through a ... |
| 4 | MySQL vs. YourSQL | When comparing databases ... |
| 7 | 1001 MySQL Tricks | 1. Never run mysqld as root. 2. ... |
+----+--------------------------+-------------------------------------+
7 rows in set (0.00 sec)
可以看到,第一条SQL
使用"MySQL Tutorial"
,因此将MySQL Tutorial
视为一个word
,所以只有文档1
符合条件;第二条SQL
将其视为两个单词,所有返回了7条记录。
Query Expansion
全文搜索支持查询扩展(特别是它的变体“盲查询扩展”)。当搜索短语太短时这通常很有用,这通常意味着用户依赖全文搜索引擎缺乏的隐含知识。例如,搜索 “ database ”的用户可能真正意味着 “ MySQL ”、“ Oracle ”、“ DB2 ”和“ RDBMS ”都是应与 “ databases ”匹配的词组也应该归还。这是隐含知识。
盲查询扩展Blind query expansion
(也称为自动相关性反馈automatic relevance feedback
)通过添加WITH QUERY EXPANSION
或IN NATURAL LANGUAGE MODE WITH QUERY EXPANSION
跟随搜索短语来启用。它通过执行两次搜索来工作:
因此,如果其中一个文档包含单词 database
和单词MySQL
,则第二次搜索会找到包含单词 MySQL
的文档,即使它们不包含单词 database
。 以下示例显示了这种差异:
mysql> select * from articles where MATCH(title, body) AGAINST('database');
+----+------------------------------+-------------------------------+
| id | title | body |
+----+------------------------------+-------------------------------+
| 6 | Database, Database, Database | database database database |
| 3 | Optimizing Your Database | In this database tutorial ... |
| 1 | MySQL Tutorial | This database tutorial ... |
+----+------------------------------+-------------------------------+
3 rows in set (0.00 sec)
mysql> select * from articles where MATCH(title, body) AGAINST('database' WITH QUERY EXPANSION);
+----+------------------------------+-------------------------------------+
| id | title | body |
+----+------------------------------+-------------------------------------+
| 3 | Optimizing Your Database | In this database tutorial ... |
| 6 | Database, Database, Database | database database database |
| 1 | MySQL Tutorial | This database tutorial ... |
| 5 | MySQL Security | When configured properly, MySQL ... |
| 8 | MySQL Full-Text Indexes | MySQL fulltext indexes use a .. |
| 2 | How To Use MySQL | After you went through a ... |
| 4 | MySQL vs. YourSQL | When comparing databases ... |
| 7 | 1001 MySQL Tricks | 1. Never run mysqld as root. 2. ... |
+----+------------------------------+-------------------------------------+
8 rows in set (0.00 sec)
可以看到,第一次查询返回了3
条记录,body
字段包含database
关键字。第二次查询开启了Query Expansion
,最后得到8
条结果,除了之前包含database
的记录,也有包含title
或body
字段中包含MySQL
的文档。这就是Query Expansion
。
Query Expansion
的全文检索可能带来许多非相关性的查询,因此在使用时,可能需要非常谨慎。