本来以为python字典的实现就是一个哈希表的普通实现,所以在学习基本类型时没去仔细研究,随着对python虚拟机的深入,也开始深入接触字典对象,python虚拟机本身大量使用了字典,包括但不限于参数的传递、类方法及实例属性的存储和查找,这些在Objects/dictnotes.txt文件中有详细的描述。在python2及python3早期版本中字典确实就是哈希表的一般实现,使用稀疏数组、计数变量加查找函数组成,在python3.6之后python重实现了字典,使用了新的稠密数组的方案,到现在python的字典都还保持这样的实现,新方案可以大大节省内存并且遍历字典更高效,新版本的实现还保存了插入的顺序,字典是有序的。
稠密数组的方案由Raymond Hettinger提出,2012年Raymond在这封后来被广泛引用的邮件中简短扼要的描述了稠密数组实现字典的方案,2014年PyPy在新发布的版本中率先实现了这个方案,2016年Python官方版本在python3.6中才实现了相同的方案。
Raymond的邮件简明扼要,这里再用大白话简单描述一下这个方案。
以这个字典为例:
d = {'timmy': 'red', 'barry': 'green', 'guido': 'blue'}
稀疏数组的方案是这样存储的:
entries = [['--', '--', '--'],
[-8522787127447073495, 'barry', 'green'],
['--', '--', '--'],
['--', '--', '--'],
['--', '--', '--'],
[-9092791511155847987, 'timmy', 'red'],
['--', '--', '--'],
[-6480567542315338377, 'guido', 'blue']]
当前字典的容量是8,经过哈希计算timmy存在第6位,barry存在第二位,guido存在第八位,虽然这确实实现了哈希表o(1)的查找速度,但是不难看出字典中存在大量的空位,而这些空位也实实在在占用了一个槽位的内存,这不仅会导致内存的浪费,而且在遍历字典时还需要跳过许多的空位,存在性能的损耗,于是Raymond表示字典可以这样存储:
indices = [None, 1, None, None, None, 0, None, 2]
entries = [[-9092791511155847987, 'timmy', 'red'],
[-8522787127447073495, 'barry', 'green'],
[-6480567542315338377, 'guido', 'blue']]
使用一个目录indices来做索引字典的键值对,而真正的键值对则存储在entries中,仍然以上面的字典为例,barry经过哈希计算位于第二个位置,所以indices中第二个元素有值,而在插入时barry是第二个被插件的,它在entried中排在第二位所以在indices中它的值为1(从0开始计算),timmy和guido也是一样的。也就是说加了一个数组entries来顺序存储插入的键值对,而哈希表中只存储键值对在数组中的索引。
不难看出,这种设计确实大大优化了存储,而且还不损失哈希表的查找效率。根据Raymond的描述,优化后的存储可以达到30%-95%的内存压缩率,而且根据字典不同大小,可以选择不同大小的indices数据结构,比如字典不超过256,使用uint8数组就行了,字典大小在256~65535,使用uint16就行了,依此类推,进一步节省内存。新方案还可以加快迭代速度,因为entries中没有空位,每一项都有值。在字典放大和缩小时也可以加快速度,因为只需要放大或者缩小indices就可以了,entries不用动,除此之外,还可以产生更少的内存碎片,因为数据结构是稠密的,内存占用减少,还可以提高缓存命中率,总之是好处多多。
解析完Raymond的方案,下面看下python3中字典的实现,在软件领域,复杂和高效通常不可兼得,但是python3的字典做到了,这也暗示了python3字典的实现有一定的复杂性,下面来详细看下,来到Objects/dictobject.c文件中,字典对象的定义如下:
/* The ma_values pointer is NULL for a combined table
* or points to an array of PyObject* for a split table
*/
typedef struct {
PyObject_HEAD
/* Number of items in the dictionary */
Py_ssize_t ma_used;
/* Dictionary version: globally unique, value change each time
the dictionary is modified */
#ifdef Py_BUILD_CORE
/* Bits 0-7 are for dict watchers.
* Bits 8-11 are for the watched mutation counter (used by tier2 optimization)
* The remaining bits (12-63) are the actual version tag. */
uint64_t ma_version_tag;
#else
Py_DEPRECATED(3.12) uint64_t ma_version_tag;
#endif
PyDictKeysObject *ma_keys;
/* If ma_values is NULL, the table is "combined": keys and values
are stored in ma_keys.
If ma_values is not NULL, the table is split:
keys are stored in ma_keys and values are stored in ma_values */
PyDictValues *ma_values;
} PyDictObject;
ma_used表示字典中元素数量,ma_version_tag不去关注,ma_keys和ma_values并不是总是键和值的关系,当ma_values为NULL时,字典是一个联合(combined)字典,键和值都存储在ma_keys中,当ma_values不为NULL时,字典是分割(split)字典,键存储在ma_keys中,值存储在ma_values中,split字典主要应用在类实例属性的查找中,在类实例属性的存储中键(ma_keys)是在类中共享的,而值(ma_values)则保存在实例对象中,这个以后详细分析。PyDictKeysObject的定义如下:
struct _dictkeysobject {
Py_ssize_t dk_refcnt;
/* Size of the hash table (dk_indices). It must be a power of 2. */
uint8_t dk_log2_size;
/* Size of the hash table (dk_indices) by bytes. */
uint8_t dk_log2_index_bytes;
/* Kind of keys */
uint8_t dk_kind;
#ifdef Py_GIL_DISABLED
/* Lock used to protect shared keys */
PyMutex dk_mutex;
#endif
/* Version number -- Reset to 0 by any modification to keys */
uint32_t dk_version;
/* Number of usable entries in dk_entries. */
Py_ssize_t dk_usable;
/* Number of used entries in dk_entries. */
Py_ssize_t dk_nentries;
/* Actual hash table of dk_size entries. It holds indices in dk_entries,
or DKIX_EMPTY(-1) or DKIX_DUMMY(-2).
Indices must be: 0 <= indice < USABLE_FRACTION(dk_size).
The size in bytes of an indice depends on dk_size:
- 1 byte if dk_size <= 0xff (char*)
- 2 bytes if dk_size <= 0xffff (int16_t*)
- 4 bytes if dk_size <= 0xffffffff (int32_t*)
- 8 bytes otherwise (int64_t*)
Dynamically sized, SIZEOF_VOID_P is minimum. */
char dk_indices[]; /* char is required to avoid strict aliasing. */
/* "PyDictKeyEntry or PyDictUnicodeEntry dk_entries[USABLE_FRACTION(DK_SIZE(dk))];" array follows:
see the DK_ENTRIES() / DK_UNICODE_ENTRIES() functions below */
};
PyDictKeysObject是字典对象中首要的对象,它的字段含义如下:
struct _dictvalues {
uint8_t capacity;
uint8_t size;
uint8_t embedded;
uint8_t valid;
PyObject *values[1];
};
前面是dictvalue的一些属性字段,最后也是有一个values数组来存储键值对,这个dictvalue对象只在字典为split类型时才会用到,如上所述,实际主要应用场景是类实例属性的存取,等到实际应用时再详细分析。
根据Python C API,创建字典的函数为PyDict_New,它的源码如下所示:
PyObject *
PyDict_New(void)
{
PyInterpreterState *interp = _PyInterpreterState_GET();
/* We don't incref Py_EMPTY_KEYS here because it is immortal. */
return new_dict(interp, Py_EMPTY_KEYS, NULL, 0, 0);
}
PyDict_New调用了new_dict函数,并传入了解释器对象和Py_EMPTY_KEYS,new_dict函数的源码如下:
static PyObject *
new_dict(PyInterpreterState *interp,
PyDictKeysObject *keys, PyDictValues *values,
Py_ssize_t used, int free_values_on_failure)
{
PyDictObject *mp;
assert(keys != NULL);
#ifdef WITH_FREELISTS
struct _Py_dict_freelist *freelist = get_dict_freelist();
if (freelist->numfree > 0) {
mp = freelist->items[--freelist->numfree];
assert (mp != NULL);
assert (Py_IS_TYPE(mp, &PyDict_Type));
OBJECT_STAT_INC(from_freelist);
_Py_NewReference((PyObject *)mp);
}
else
#endif
{
mp = PyObject_GC_New(PyDictObject, &PyDict_Type);
if (mp == NULL) {
dictkeys_decref(interp, keys, false);
if (free_values_on_failure) {
free_values(values, false);
}
return NULL;
}
}
mp->ma_keys = keys;
mp->ma_values = values;
mp->ma_used = used;
mp->ma_version_tag = DICT_NEXT_VERSION(interp);
ASSERT_CONSISTENT(mp);
return (PyObject *)mp;
}
如果使用缓冲池,则会先从缓冲池中尝试获取一个字典对象进行复用,如果没有则使用PyObject_GC_New来新分配一个字典对象,这个过程比较简单,主要留意一下在创建新字典时使用了一个定义好的Py_EMPTY_KEYS作为ma_keys对象,表示它是一个空字典。
以PyDict_SetItemString函数为入口,来分析一下字典插入的过程,插入过程包含了ma_keys的创建过程,索引及键值对的存储过程,如果是插入非空字典,还涉及字典的扩容以及查找过程,所以分析完字典的插入过程可以说基本就分析完字典80%的内容了。
PyDict_SetItemString是插入函数针对字符串类型的版本,顺着调用链PyDict_SetItemString->PyDict_SetItem->_PyDict_SetItem_Take2->setitem_take2_lock_held一直进入到setitem_take2_lock_held函数中,这里判断字典为空就调用insert_to_emptydict,字典不为空就调用insertdict函数来插入。
先看下字典为空的情况,进入insert_to_emptydict函数中,它的源码如下:
static int
insert_to_emptydict(PyInterpreterState *interp, PyDictObject *mp,
PyObject *key, Py_hash_t hash, PyObject *value)
{
assert(mp->ma_keys == Py_EMPTY_KEYS);
ASSERT_DICT_LOCKED(mp);
int unicode = PyUnicode_CheckExact(key);
PyDictKeysObject *newkeys = new_keys_object(
interp, PyDict_LOG_MINSIZE, unicode);
if (newkeys == NULL) {
Py_DECREF(key);
Py_DECREF(value);
return -1;
}
uint64_t new_version = _PyDict_NotifyEvent(
interp, PyDict_EVENT_ADDED, mp, key, value);
/* We don't decref Py_EMPTY_KEYS here because it is immortal. */
assert(mp->ma_values == NULL);
MAINTAIN_TRACKING(mp, key, value);
size_t hashpos = (size_t)hash & (PyDict_MINSIZE-1);
dictkeys_set_index(newkeys, hashpos, 0);
if (unicode) {
PyDictUnicodeEntry *ep = DK_UNICODE_ENTRIES(newkeys);
ep->me_key = key;
STORE_VALUE(ep, value);
}
else {
PyDictKeyEntry *ep = DK_ENTRIES(newkeys);
ep->me_key = key;
ep->me_hash = hash;
STORE_VALUE(ep, value);
}
STORE_USED(mp, mp->ma_used + 1);
mp->ma_version_tag = new_version;
newkeys->dk_usable--;
newkeys->dk_nentries++;
// We store the keys last so no one can see them in a partially inconsistent
// state so that we don't need to switch the keys to being shared yet for
// the case where we're inserting from the non-owner thread. We don't use
// set_keys here because the transition from empty to non-empty is safe
// as the empty keys will never be freed.
FT_ATOMIC_STORE_PTR_RELEASE(mp->ma_keys, newkeys);
return 0;
}
insert_to_emptydict中一个重要的操作就是调用new_keys_object函数创建ma_keys,进入new_keys_object函数中,源码如下所示:
static PyDictKeysObject*
new_keys_object(PyInterpreterState *interp, uint8_t log2_size, bool unicode)
{
PyDictKeysObject *dk;
Py_ssize_t usable;
int log2_bytes;
size_t entry_size = unicode ? sizeof(PyDictUnicodeEntry) : sizeof(PyDictKeyEntry);
assert(log2_size >= PyDict_LOG_MINSIZE);
usable = USABLE_FRACTION((size_t)1<<log2_size);
if (log2_size < 8) {
log2_bytes = log2_size;
}
else if (log2_size < 16) {
log2_bytes = log2_size + 1;
}
#if SIZEOF_VOID_P > 4
else if (log2_size >= 32) {
log2_bytes = log2_size + 3;
}
#endif
else {
log2_bytes = log2_size + 2;
}
#ifdef WITH_FREELISTS
struct _Py_dictkeys_freelist *freelist = get_dictkeys_freelist();
if (log2_size == PyDict_LOG_MINSIZE && unicode && freelist->numfree > 0) {
dk = freelist->items[--freelist->numfree];
OBJECT_STAT_INC(from_freelist);
}
else
#endif
{
dk = PyMem_Malloc(sizeof(PyDictKeysObject)
+ ((size_t)1 << log2_bytes)
+ entry_size * usable);
if (dk == NULL) {
PyErr_NoMemory();
return NULL;
}
}
#ifdef Py_REF_DEBUG
_Py_IncRefTotal(_PyThreadState_GET());
#endif
dk->dk_refcnt = 1;
dk->dk_log2_size = log2_size;
dk->dk_log2_index_bytes = log2_bytes;
dk->dk_kind = unicode ? DICT_KEYS_UNICODE : DICT_KEYS_GENERAL;
#ifdef Py_GIL_DISABLED
dk->dk_mutex = (PyMutex){0};
#endif
dk->dk_nentries = 0;
dk->dk_usable = usable;
dk->dk_version = 0;
memset(&dk->dk_indices[0], 0xff, ((size_t)1 << log2_bytes));
memset(&dk->dk_indices[(size_t)1 << log2_bytes], 0, entry_size * usable);
return dk;
}
这里传入的参数log2_size为PyDict_LOG_MINSIZE,unicode为1(假设为键字符串类型),下面开始进行log2_bytes的计算,这是一个数学过程,由于脱离学校太久,这里需要对这个计算过程详细分析一下,根据_dictkeysobject的定义可以知道索引大小的判断依据如下:
log2_size和log2_bytes都是实际大小的对数,它们的关系如下:
indices_bytes是索引的大小,所以当indices_bytes大小分别为1字节(char),2字节(int16),4字节(int32),8字节(int64)时,log2_bytes和log2_size的对应关系分别为log2_bytes=log2_size+0/1/2/3。
这里忽略缓冲池的情况,看下ma_keys的内存分配过程,内存分配大小为sizeof(PyDictKeysObject) + ((size_t)1 << log2_bytes) + entry_size * usable
,首先是PyDictKeysObject本身的大小,然后是((size_t)1 << log2_bytes) + entry_size * usable
,因为PyDictKeysObject后面跟着的是indices和entries,所以需要计算这两兄弟的大小,1 << log2_bytes
表示将1向左移位log2_bytes,即2的log2_bytes次方,这里假设是字符串类型的字典,entry_size是PyDictUnicodeEntry的大小,usable是可用槽位数量,usable是这样计算出来的usable = USABLE_FRACTION((size_t)1<
1<
创建完ma_keys对象后,后面给ma_keys的字段赋值,然后使用memset初始化了indices和entries的内存,从这里可以清晰看到ma_keys各个字段的含义。
回到insert_to_emptydict函数中,下面计算了当前插入的哈希位置,只需要将哈希值与当前字典大小做与操作,哈希位置即可落入当前字典的大小范围内,因为默认空字典的大小为8,位置索引是0~7,所以使用hash & (PyDict_MINSIZE-1)
计算即可得到哈希位置。然后调用dictkeys_set_index函数设置索引,它的源码如下:
/* write to indices. */
static inline void
dictkeys_set_index(PyDictKeysObject *keys, Py_ssize_t i, Py_ssize_t ix)
{
int log2size = DK_LOG_SIZE(keys);
assert(ix >= DKIX_DUMMY);
assert(keys->dk_version == 0);
if (log2size < 8) {
assert(ix <= 0x7f);
STORE_INDEX(keys, 8, i, ix);
}
else if (log2size < 16) {
assert(ix <= 0x7fff);
STORE_INDEX(keys, 16, i, ix);
}
#if SIZEOF_VOID_P > 4
else if (log2size >= 32) {
STORE_INDEX(keys, 64, i, ix);
}
#endif
else {
assert(ix <= 0x7fffffff);
STORE_INDEX(keys, 32, i, ix);
}
}
设置索引的过程主要是根据索引大小的不同使用STORE_INDEX将索引大小转换后存储到对应的位置i上,这里插入的是空字典,所以设置value是0。
再次回顾一下,因为字典是由索引indices和键值对entries组成的,所以设置完索引还需要设置entry,下面存储entry的过程就简单了,DK_UNICODE_ENTRIES和DK_ENTRIES获取的是newkeys的相同位置,只不过转换的指针类型不同,字符串类型就转换为PyDictUnicodeEntry*类型,通用类型就转换为PyDictKeyEntry*类型,如果是字符串就直接设置entry的me_key(字符串的哈希值存储在字符串对象中),并调用STORE_VALUE存储entry的me_value,如果是通用类型,除了key和value还要存储entry的哈希值,免于每次都计算。最后将字典对象的可用槽位dk_usable–,将已用槽位dk_nentries++,并将新创建的newkeys赋值给字典对象的ma_keys,这样插入空字典的过程就结束了。
如果是插入非空字典,则会调用insertdict函数,它的源码如下:
static int
insertdict(PyInterpreterState *interp, PyDictObject *mp,
PyObject *key, Py_hash_t hash, PyObject *value)
{
PyObject *old_value;
ASSERT_DICT_LOCKED(mp);
if (DK_IS_UNICODE(mp->ma_keys) && !PyUnicode_CheckExact(key)) {
if (insertion_resize(interp, mp, 0) < 0)
goto Fail;
assert(mp->ma_keys->dk_kind == DICT_KEYS_GENERAL);
}
if (_PyDict_HasSplitTable(mp)) {
Py_ssize_t ix = insert_split_key(mp->ma_keys, key, hash);
if (ix != DKIX_EMPTY) {
insert_split_value(interp, mp, key, value, ix);
Py_DECREF(key);
Py_DECREF(value);
return 0;
}
/* No space in shared keys. Resize and continue below. */
if (insertion_resize(interp, mp, 1) < 0) {
goto Fail;
}
}
Py_ssize_t ix = _Py_dict_lookup(mp, key, hash, &old_value);
if (ix == DKIX_ERROR)
goto Fail;
MAINTAIN_TRACKING(mp, key, value);
if (ix == DKIX_EMPTY) {
assert(!_PyDict_HasSplitTable(mp));
/* Insert into new slot. */
assert(old_value == NULL);
if (insert_combined_dict(interp, mp, hash, key, value) < 0) {
goto Fail;
}
STORE_USED(mp, mp->ma_used + 1);
ASSERT_CONSISTENT(mp);
return 0;
}
if (old_value != value) {
uint64_t new_version = _PyDict_NotifyEvent(
interp, PyDict_EVENT_MODIFIED, mp, key, value);
assert(old_value != NULL);
assert(!_PyDict_HasSplitTable(mp));
if (DK_IS_UNICODE(mp->ma_keys)) {
PyDictUnicodeEntry *ep = &DK_UNICODE_ENTRIES(mp->ma_keys)[ix];
STORE_VALUE(ep, value);
}
else {
PyDictKeyEntry *ep = &DK_ENTRIES(mp->ma_keys)[ix];
STORE_VALUE(ep, value);
}
mp->ma_version_tag = new_version;
}
Py_XDECREF(old_value); /* which **CAN** re-enter (see issue #22653) */
ASSERT_CONSISTENT(mp);
Py_DECREF(key);
return 0;
Fail:
Py_DECREF(value);
Py_DECREF(key);
return -1;
}
insertdict函数会判断是否需要对字典调整大小,在函数开始时的insertion_resize调用并不是对字典调整大小,而是调整字典类型,因为默认字典都是unicode(字符串)类型,如果插入的键不是字符串的话,则将其转换为通用类型,insertion_resize也具有转换字典类型的功能,它的源码如下:
static int
insertion_resize(PyInterpreterState *interp, PyDictObject *mp, int unicode)
{
return dictresize(interp, mp, calculate_log2_keysize(GROWTH_RATE(mp)), unicode);
}
insertdict函数调用dictresize函数,传入字典对象和新的字典大小,以及是否为unicode,使用calculate_log2_keysize(GROWTH_RATE(mp))
计算新字典的大小log2_newsize,GROWTH_RATE根据增长率计算新的字典大小,当前版本的增长率是当前字典已使用大小的3倍,也就是ma_used*3,calculate_log2_keysize的功能主要是计算无符号整数最高有效位的索引(MSB),效果等同于floor(log2(x)) + 1
。进入dictresize函数中,它的源码如下:
/*
Restructure the table by allocating a new table and reinserting all
items again. When entries have been deleted, the new table may
actually be smaller than the old one.
If a table is split (its keys and hashes are shared, its values are not),
then the values are temporarily copied into the table, it is resized as
a combined table, then the me_value slots in the old table are NULLed out.
After resizing, a table is always combined.
This function supports:
- Unicode split -> Unicode combined or Generic
- Unicode combined -> Unicode combined or Generic
- Generic -> Generic
*/
static int
dictresize(PyInterpreterState *interp, PyDictObject *mp,
uint8_t log2_newsize, int unicode)
{
PyDictKeysObject *oldkeys, *newkeys;
PyDictValues *oldvalues;
ASSERT_DICT_LOCKED(mp);
if (log2_newsize >= SIZEOF_SIZE_T*8) {
PyErr_NoMemory();
return -1;
}
assert(log2_newsize >= PyDict_LOG_MINSIZE);
oldkeys = mp->ma_keys;
oldvalues = mp->ma_values;
if (!DK_IS_UNICODE(oldkeys)) {
unicode = 0;
}
ensure_shared_on_resize(mp);
/* NOTE: Current odict checks mp->ma_keys to detect resize happen.
* So we can't reuse oldkeys even if oldkeys->dk_size == newsize.
* TODO: Try reusing oldkeys when reimplement odict.
*/
/* Allocate a new table. */
newkeys = new_keys_object(interp, log2_newsize, unicode);
if (newkeys == NULL) {
return -1;
}
// New table must be large enough.
assert(newkeys->dk_usable >= mp->ma_used);
Py_ssize_t numentries = mp->ma_used;
if (oldvalues != NULL) {
LOCK_KEYS(oldkeys);
PyDictUnicodeEntry *oldentries = DK_UNICODE_ENTRIES(oldkeys);
/* Convert split table into new combined table.
* We must incref keys; we can transfer values.
*/
if (newkeys->dk_kind == DICT_KEYS_GENERAL) {
// split -> generic
PyDictKeyEntry *newentries = DK_ENTRIES(newkeys);
for (Py_ssize_t i = 0; i < numentries; i++) {
int index = get_index_from_order(mp, i);
PyDictUnicodeEntry *ep = &oldentries[index];
assert(oldvalues->values[index] != NULL);
newentries[i].me_key = Py_NewRef(ep->me_key);
newentries[i].me_hash = unicode_get_hash(ep->me_key);
newentries[i].me_value = oldvalues->values[index];
}
build_indices_generic(newkeys, newentries, numentries);
}
else { // split -> combined unicode
PyDictUnicodeEntry *newentries = DK_UNICODE_ENTRIES(newkeys);
for (Py_ssize_t i = 0; i < numentries; i++) {
int index = get_index_from_order(mp, i);
PyDictUnicodeEntry *ep = &oldentries[index];
assert(oldvalues->values[index] != NULL);
newentries[i].me_key = Py_NewRef(ep->me_key);
newentries[i].me_value = oldvalues->values[index];
}
build_indices_unicode(newkeys, newentries, numentries);
}
UNLOCK_KEYS(oldkeys);
set_keys(mp, newkeys);
dictkeys_decref(interp, oldkeys, IS_DICT_SHARED(mp));
set_values(mp, NULL);
if (oldvalues->embedded) {
assert(oldvalues->embedded == 1);
assert(oldvalues->valid == 1);
FT_ATOMIC_STORE_UINT8(oldvalues->valid, 0);
}
else {
free_values(oldvalues, IS_DICT_SHARED(mp));
}
}
else { // oldkeys is combined.
if (oldkeys->dk_kind == DICT_KEYS_GENERAL) {
// generic -> generic
assert(newkeys->dk_kind == DICT_KEYS_GENERAL);
PyDictKeyEntry *oldentries = DK_ENTRIES(oldkeys);
PyDictKeyEntry *newentries = DK_ENTRIES(newkeys);
if (oldkeys->dk_nentries == numentries) {
memcpy(newentries, oldentries, numentries * sizeof(PyDictKeyEntry));
}
else {
PyDictKeyEntry *ep = oldentries;
for (Py_ssize_t i = 0; i < numentries; i++) {
while (ep->me_value == NULL)
ep++;
newentries[i] = *ep++;
}
}
build_indices_generic(newkeys, newentries, numentries);
}
else { // oldkeys is combined unicode
PyDictUnicodeEntry *oldentries = DK_UNICODE_ENTRIES(oldkeys);
if (unicode) { // combined unicode -> combined unicode
PyDictUnicodeEntry *newentries = DK_UNICODE_ENTRIES(newkeys);
if (oldkeys->dk_nentries == numentries && mp->ma_keys->dk_kind == DICT_KEYS_UNICODE) {
memcpy(newentries, oldentries, numentries * sizeof(PyDictUnicodeEntry));
}
else {
PyDictUnicodeEntry *ep = oldentries;
for (Py_ssize_t i = 0; i < numentries; i++) {
while (ep->me_value == NULL)
ep++;
newentries[i] = *ep++;
}
}
build_indices_unicode(newkeys, newentries, numentries);
}
else { // combined unicode -> generic
PyDictKeyEntry *newentries = DK_ENTRIES(newkeys);
PyDictUnicodeEntry *ep = oldentries;
for (Py_ssize_t i = 0; i < numentries; i++) {
while (ep->me_value == NULL)
ep++;
newentries[i].me_key = ep->me_key;
newentries[i].me_hash = unicode_get_hash(ep->me_key);
newentries[i].me_value = ep->me_value;
ep++;
}
build_indices_generic(newkeys, newentries, numentries);
}
}
set_keys(mp, newkeys);
if (oldkeys != Py_EMPTY_KEYS) {
#ifdef Py_REF_DEBUG
_Py_DecRefTotal(_PyThreadState_GET());
#endif
assert(oldkeys->dk_kind != DICT_KEYS_SPLIT);
assert(oldkeys->dk_refcnt == 1);
free_keys_object(oldkeys, IS_DICT_SHARED(mp));
}
}
STORE_KEYS_USABLE(mp->ma_keys, mp->ma_keys->dk_usable - numentries);
STORE_KEYS_NENTRIES(mp->ma_keys, numentries);
ASSERT_CONSISTENT(mp);
return 0;
}
从注释中可以看出,dictresize具有多种功能,这里就主要关注unicode combined调整大小和unicode转换generic。
dictresize首先保存了一下旧的keys和values,然后调用new_keys_object创建了新的ma_keys对象,这里因为是combined字典,oldvalues为NULL,旧的字典类型为unicode,新的类型为generic,所以主要关注一下combined unicode -> generic分支,调整大小的过程已经在new_keys_object完成了,在combined unicode -> generic分支中干的事情首先是将entry从旧的ma_keys中复制到新的ma_keys中来,复制完entry之后,调用build_indices_generic重新设置索引,这个build_indices_generic函数也需要关注一下,它的源码如下:
static void
build_indices_generic(PyDictKeysObject *keys, PyDictKeyEntry *ep, Py_ssize_t n)
{
size_t mask = DK_MASK(keys);
for (Py_ssize_t ix = 0; ix != n; ix++, ep++) {
Py_hash_t hash = ep->me_hash;
size_t i = hash & mask;
for (size_t perturb = hash; dictkeys_get_index(keys, i) != DKIX_EMPTY;) {
perturb >>= PERTURB_SHIFT;
i = mask & (i*5 + perturb + 1);
}
dictkeys_set_index(keys, i, ix);
}
}
build_indices_generic设置索引的过程主要是为键找到合适的哈希位置,如果当前哈希位置已经有值,即索引不为DKIX_EMPTY,则使用某种算法继续计算冲突探测链上的下一个位置,直到找到合适的位置,随后调用dictkeys_set_index设置索引值。
前面复制了entry,重新设置了索引,那么resize的过程基本也就结束了,下面进行一些收尾工作,将新创建的ma_keys赋值给字典对象,然后设置新的ma_keys的可用槽位dk_usable和已用槽位numentries。
回到insertdict函数中,前面insertion_resize主要是把字符串字典转换为了通用字典,但是在其中也看到了字典调整大小的过程,下面在字典中查找当前要插入的键,看当前字典是否存在要插入的键,查看_Py_dict_lookup函数的源码,如下所示:
Py_ssize_t
_Py_dict_lookup(PyDictObject *mp, PyObject *key, Py_hash_t hash, PyObject **value_addr)
{
PyDictKeysObject *dk;
DictKeysKind kind;
Py_ssize_t ix;
_Py_CRITICAL_SECTION_ASSERT_OBJECT_LOCKED(mp);
start:
dk = mp->ma_keys;
kind = dk->dk_kind;
if (kind != DICT_KEYS_GENERAL) {
if (PyUnicode_CheckExact(key)) {
#ifdef Py_GIL_DISABLED
if (kind == DICT_KEYS_SPLIT) {
// A split dictionaries keys can be mutated by other
// dictionaries but if we have a unicode key we can avoid
// locking the shared keys.
ix = unicodekeys_lookup_unicode_threadsafe(dk, key, hash);
if (ix == DKIX_KEY_CHANGED) {
LOCK_KEYS(dk);
ix = unicodekeys_lookup_unicode(dk, key, hash);
UNLOCK_KEYS(dk);
}
}
else {
ix = unicodekeys_lookup_unicode(dk, key, hash);
}
#else
ix = unicodekeys_lookup_unicode(dk, key, hash);
#endif
}
else {
INCREF_KEYS_FT(dk);
LOCK_KEYS_IF_SPLIT(dk, kind);
ix = unicodekeys_lookup_generic(mp, dk, key, hash);
UNLOCK_KEYS_IF_SPLIT(dk, kind);
DECREF_KEYS_FT(dk, IS_DICT_SHARED(mp));
if (ix == DKIX_KEY_CHANGED) {
goto start;
}
}
if (ix >= 0) {
if (kind == DICT_KEYS_SPLIT) {
*value_addr = mp->ma_values->values[ix];
}
else {
*value_addr = DK_UNICODE_ENTRIES(dk)[ix].me_value;
}
}
else {
*value_addr = NULL;
}
}
else {
ix = dictkeys_generic_lookup(mp, dk, key, hash);
if (ix == DKIX_KEY_CHANGED) {
goto start;
}
if (ix >= 0) {
*value_addr = DK_ENTRIES(dk)[ix].me_value;
}
else {
*value_addr = NULL;
}
}
return ix;
}
_Py_dict_lookup函数中首先会进行索引查找,再进行值查找,还是先来看下unicode键的查找过程,如果是unicode键的话,直接调用unicodekeys_lookup_unicode函数进行查找,它的源码如下:
static Py_ssize_t _Py_HOT_FUNCTION
unicodekeys_lookup_unicode(PyDictKeysObject* dk, PyObject *key, Py_hash_t hash)
{
return do_lookup(NULL, dk, key, hash, compare_unicode_unicode);
}
unicodekeys_lookup_unicode调用了do_lookup函数,传入了ma_keys对象dk,要查询的键key和哈希值hash,还传入了一个字符串比较函数compare_unicode_unicode,进入到do_lookup函数中:
static inline Py_ALWAYS_INLINE Py_ssize_t
do_lookup(PyDictObject *mp, PyDictKeysObject *dk, PyObject *key, Py_hash_t hash,
int (*check_lookup)(PyDictObject *, PyDictKeysObject *, void *, Py_ssize_t ix, PyObject *key, Py_hash_t))
{
void *ep0 = _DK_ENTRIES(dk);
size_t mask = DK_MASK(dk);
size_t perturb = hash;
size_t i = (size_t)hash & mask;
Py_ssize_t ix;
for (;;) {
ix = dictkeys_get_index(dk, i);
if (ix >= 0) {
int cmp = check_lookup(mp, dk, ep0, ix, key, hash);
if (cmp < 0) {
return cmp;
} else if (cmp) {
return ix;
}
}
else if (ix == DKIX_EMPTY) {
return DKIX_EMPTY;
}
perturb >>= PERTURB_SHIFT;
i = mask & (i*5 + perturb + 1);
// Manual loop unrolling
ix = dictkeys_get_index(dk, i);
if (ix >= 0) {
int cmp = check_lookup(mp, dk, ep0, ix, key, hash);
if (cmp < 0) {
return cmp;
} else if (cmp) {
return ix;
}
}
else if (ix == DKIX_EMPTY) {
return DKIX_EMPTY;
}
perturb >>= PERTURB_SHIFT;
i = mask & (i*5 + perturb + 1);
}
Py_UNREACHABLE();
}
索引查找首先使用键的索引值和mask做与操作得到哈希位置,然后进入for循环,获取当前计算出的哈希位置的索引值,然后调用传入的字符串比较函数check_lookup去比较键是否一致,这里比较的逻辑是如果键地址相同,或者键哈希相同且键值(字符串值)相同,则判断为一致,查询索引并进行键比较的过程会发生三种情况:
经过键查询,返回查询的结果索引ix,如果ix有值则将value_addr赋值为对应的entry的me_value,否则赋值为NULL。
分割(split)字典已经在函数的上面部分处理,这里看下联合(combined)字典的插入,回到insertdict函数中,经过键的查询,现在返回了查询到的索引ix,并将查询到的值赋值给old_value,下面开始进行插入过程,首先判断ix为DKIX_EMPTY的话,说明字典中不存在这个键,直接插入,进入insert_combined_dict函数中,源码如下:
static inline int
insert_combined_dict(PyInterpreterState *interp, PyDictObject *mp,
Py_hash_t hash, PyObject *key, PyObject *value)
{
if (mp->ma_keys->dk_usable <= 0) {
/* Need to resize. */
if (insertion_resize(interp, mp, 1) < 0) {
return -1;
}
}
uint64_t new_version = _PyDict_NotifyEvent(
interp, PyDict_EVENT_ADDED, mp, key, value);
mp->ma_keys->dk_version = 0;
Py_ssize_t hashpos = find_empty_slot(mp->ma_keys, hash);
dictkeys_set_index(mp->ma_keys, hashpos, mp->ma_keys->dk_nentries);
if (DK_IS_UNICODE(mp->ma_keys)) {
PyDictUnicodeEntry *ep;
ep = &DK_UNICODE_ENTRIES(mp->ma_keys)[mp->ma_keys->dk_nentries];
STORE_KEY(ep, key);
STORE_VALUE(ep, value);
}
else {
PyDictKeyEntry *ep;
ep = &DK_ENTRIES(mp->ma_keys)[mp->ma_keys->dk_nentries];
STORE_KEY(ep, key);
STORE_VALUE(ep, value);
STORE_HASH(ep, hash);
}
mp->ma_version_tag = new_version;
STORE_KEYS_USABLE(mp->ma_keys, mp->ma_keys->dk_usable - 1);
STORE_KEYS_NENTRIES(mp->ma_keys, mp->ma_keys->dk_nentries + 1);
assert(mp->ma_keys->dk_usable >= 0);
return 0;
}
插入过程首先也会判断当前字典是否需要扩容,如果需要扩容则调用insertion_resize函数进行扩容,这个过程上面已经解析过,主要是创建新的ma_keys并重新填充索引,这与Raymond的方案一致。然后调用find_empty_slot函数找到一个空的槽位,find_empty_slot函数就是根据当前的哈希值进行探链的过程,在冲突探测链中找到一个合适的位置,赋值给hashpos,确定了索引的位置,下面的过程就熟悉了,就是设置索引+值两步走,结束后同步设置dk_usable和dk_nentries变量。
如果当前ix不为DKIX_EMPTY的话,说明当前字典中存在该键,那么判断value是否一致,如果一致的话则不做任何操作,如果不一致则替换对应entry的value,下面这些赋值的过程就都是熟悉的操作了。
解析完插入过程就解析完了字典80%的内容,字典的查询过程就平平无奇了,都是已经见过的操作。
字典的删除操作也没什么稀奇的,无非就是先查询到对应的键,然后将索引设置为DKIX_DUMMY,并清空对应的value值,将键设置为DKIX_DUMMY是为了保证冲突探测链不中断,关于冲突探测链的过程在《python源码剖析》中有详细描述。那么关于python3.6版本后的字典实现基本就解析完了。