An index lookup requires three steps:
(1) the tree traversal;
(2) following the leaf node chain;
(3) fetching the table data.
The tree traversal is the only step that has an upper bound for the number of accessed blocks—the index depth. The other two steps might need to access many blocks—they cause a slow index lookup.
索引查找需要三个步骤:(1)遍历树;(2)跟随叶子节点链;(3)获取表数据。树遍历是唯一具有访问块数上限(索引深度)的步骤。其他两个步骤可能需要访问许多块,2、3步骤会导致索引查找缓慢。
The Oracle database is rather verbose in this respect and has three distinct operations that describe a basic index lookup:
INDEX UNIQUE SCAN
The INDEX UNIQUE SCAN performs the tree traversal only. The Oracle database uses this operation if a unique constraint ensures that the search criteria will match no more than one entry.
【primary key或unique index(创建唯一约束就会有unique index)】确保查询条件匹配最多1个记录。
INDEX RANGE SCAN
The INDEX RANGE SCAN performs the tree traversal and follows the leaf node chain to find all matching entries. This is the fallback operation if multiple entries could possibly match the search criteria.
TABLE ACCESS BY INDEX ROWID
The TABLE ACCESS BY INDEX ROWID operation retrieves the row from the table. This operation is (often) performed for every matched record from a preceding index scan operation.
The important point is that an INDEX RANGE SCAN can potentially read a large part of an index. If there is one more table access for each row, the query can become slow even when using an index.
1.1 Primary Key
CREATE TABLE employees (
employee_id int NOT NULL,
first_name VARCHAR(10) NOT NULL,
last_name VARCHAR(10) NOT NULL,
date_of_birth DATE NOT NULL,
phone_number VARCHAR(10) NOT NULL,
PRIMARY KEY (`employee_id`)
)ENGINE=InnoDB DEFAULT CHARSET=utf8;
The database automatically creates an index for the primary key. That means there is an index on the EMPLOYEE_ID column, even though there is no create index statement.
database会自动为主键创建索引。即使没有create index statement。
使用主键检索员工信息
SELECT first_name, last_name
FROM employees
WHERE employee_id = 1;
Mysql中const与Oracle中index unque scan等效。在通过主键索引查到employee_id = 1
还要access by index rowid,因为它与index unque scan连用,只查一次,并没有触发表访问。也就是index unque scan + access by index rowid不会造成慢访问。
1.2 concatenated index
主键包含多列,需要注意的是:级联索引的列顺序对其可用性有很大影响。
联级索引也是B树,即有序。数据库根据其在索引定义中的位置考虑每列,以对索引条目进行排序。第一列是主要的排序标准,第二列仅在第一列中的两个条目具有相同值时确定顺序,依此类推。
假设两公司合并,另一家公司的员工被添加到employees表中,存在的问题:employee_id不能保证唯一。需要额外增加标识符来拓展主键。因此新的主键包含两列:employee_id、subsidiary_id创新建立唯一性。
create unique index employees_pk on
employees(employee_id,subsidiary_id);
当查询使用完整的primary key时,数据库会使用"index unque scan",与列数无关。但是仅使用部分列时会出现问题:
SELECT first_name, last_name
FROM employees
WHERE subsidiary_id = 20;
**执行:**此时会用"table access full"(对应Mysql中all),数据库读取整个表,并用where 子句评估每一行。执行时间与表大小有关。
不用索引理由:subsidiary_id = 20
的数据并没有存储在相邻的位置(位置随意),因为第一排序是employee_id。尽管subsidiary_id=20的均在叶子节点,但对于此查询没有用。
解决方案:可以为它创建索引,但更简单的方法是将employee_id,subsidiary_id的索引顺序调换,当subsidiary_id为第一索引时,相同的在叶子节点是相邻的,可以利用B树找到其位置。
2.1 Case-Insensitive Search Using UPPER or LOWER
collation & character set
A collation is a set of rules associated to a character set (utf8. latin1…) to determine the order of characters, and if two characters are equal. Equal characters could be for example ‘a’ and ‘á’, or ‘a’ and ‘A’ – in the latter care, a collation is case insensitive.
collation:字符集的排序规则和相等的定义。
每个字符集会有个默认排序规则,且是不区分大小写的。
# 可以用建立在last_name上索引:
SELECT* from employees WHERE
last_name=UPPER('jenny');
# 不能用建立在last_name上索引:会进行全表扫描
SELECT* from employees WHERE
upper(last_name)=UPPER('jenny');
**不能用建立在last_name上索引的原因:**在数据库视角如下面代码所示。upper是个函数,可以说该函数的参数与结果没有太大联系。
SELECT first_name, last_name, phone_number
FROM employees
WHERE BLACKBOX(...) = 'WINAND'
解决方法:1.新增列 2.加索引
//mysql不支持在函数上建索引
ALTER TABLE employees
ADD COLUMN last_name_up VARCHAR(255) AS (UPPER(last_name));
CREATE INDEX emp_up_name ON employees (last_name_up);
2.2 User-Defined Functions
用户可以自定义函数,能够用在任何查询处,但是只要参数相同,返回值则相同的函数才能用于建立索引。
# 自定义函数
CREATE FUNCTION get_age(date_of_birth DATE)
RETURN NUMBER
AS
BEGIN
RETURN
TRUNC(MONTHS_BETWEEN(SYSDATE, date_of_birth)/12);
# months_between(data1.data2)两个日期之间间隔的月份
# trunc截断
END
#用户自定义函数可以用在任何查询中
SELECT first_name, last_name, get_age(date_of_birth)
FROM employees
WHERE get_age(date_of_birth) = 42
but you cannot use the function GET_AGE in an index definition because it is not deterministic. That means the result of the function call is not fully determined by its parameters. Only functions that always return the same result for the same parameters—functions that are deterministic—can be indexed.
但是GET_AGE()不能用在索引定义中,因为它是不确定的。具体是指它的结果不取决于参数。只要参数相同,返回值则相同的函数才能用于建立索引。
The reason behind this limitation is simple. When inserting a new row, the database calls the function and stores the result in the index and there it stays, unchanged. There is no periodic process that updates the index. The database updates the indexed age only when the date of birth is changed by an update statement. After the next birthday, the age that is stored in the index will be wrong.
这种限制的理由:插入一条记录,数据库会调用这个函数,将其值存储在索引中,并保持不变,没有定期更新索引的过程。只有date of birth值update,才会更新对应索引值。下一个生日(再过一年),索引存储的年龄值将是错误的。
3.1Greater, Less and BETWEEN
The biggest performance risk of an INDEX RANGE SCAN is the leaf node traversal. It is therefore the golden rule of indexing to keep the scanned index range as small as possible. You can check that by asking yourself where an index scan starts and where it ends.
最大性能风险是叶节点的"index range scan".索引的黄金准则是让范围尽可能小。自查索引扫描的起止范围。
①sql语句明确起止范围,若对date_of_birth建立索引,则会在指定范围内展开搜索范围已确定,无法再缩小。
SELECT first_name, last_name, date_of_birth
FROM employees
WHERE date_of_birth >= TO_DATE(?, 'YYYY-MM-DD')
AND date_of_birth <= TO_DATE(?, 'YYYY-MM-DD')
②起止不明确
SELECT first_name, last_name, date_of_birth
FROM employees
WHERE date_of_birth >= "1971-01-01"
AND date_of_birth <= "1971-01-09"
AND subsidiary_id = 27;
若要在这两列建立索引,根据之前"1.2 concatenated index"中的介绍,列的先后顺序会影响性能。
1)若索引顺序先date_of_birth,后subsidiary_id
先按date_of_birth 排序,若相同,则按subsidiary_id 排序。可以说subsidiary_id整体看是无序的,因此扫描范围是date_of_birth的范围。
2)若索引顺序先subsidiary_id,后date_of_birth
subsidiary_id=1
限制了索引第一索引是单值。在第一索引值相同情况下,第二列有序。"27 26-SEP-72"指针是小于26-SEP-72的,可以有;"27 26-SEP-72"的范围是26-SEP-72至26-SEP-72的,无需所有。搜索范围只有"27 26-SEP-72"的子page。从而只需扫描一个叶子节点即可。
经验:index for equality first—then for ranges.
3.2Indexing LIKE Filters
LIKE filters can only use the characters before the first wild card during tree traversal. The remaining characters are just filter predicates that do not narrow the scanned index range. A single LIKE expression can therefore contain two predicate types: (1) the part before the first wild card as an access predicate; (2) the other characters as a filter predicate.
LIKE过滤器只能在树遍历期间使用第一个通配符之前的字符。其余字符只是过滤谓词,不会缩小扫描索引范围。LIKE表达式可以包含两种谓词类型:(1)第一个通配符之前的部分作为访问谓词;[确定范围](2)其他字符作为过滤谓词。
所以,第一个通配符前的字符越明确,则扫描索引的范围越小,从而使索引更快。
是否区别大小写和collation有关。
3.3 bitmap index(mysql不支持,oracle支持)
可以创建两个单独的索引,每列一个。然后,数据库先扫描两个索引,然后再合并结果。(需要大量的内存和CPU时间来组合中间结果)
在软件中经常会有将任意条件组合作为筛选的条件。如下图:
在实际中,不可能穷尽所有的筛选条件,即使可以成本太高。位图索引就是在单独列创建索引可以取得较好的性能。若事先知道查询,可以定制多列的联合查询,它比组合查询要快。
适用条件:①更新少或只读的表,因为维护位图会占用大量资源
②列是枚举类型,经常这些列组合查询
并发写必会造成死锁。
更详细例子
参考:https://dev.mysql.com/doc/refman/5.7/en/explain-output.html#explain-join-types.