[Devel] [PATCH RH9 01/13] kmapset: set of key-value mappings with build-in

Pavel Tikhomirov ptikhomirov at virtuozzo.com
Tue Sep 21 19:04:19 MSK 2021


From: Stanislav Kinsburskiy <skinsbursky at virtuozzo.com>

This implements effective storage for large set of small mappings.
'map' contains sorted list of 'links' (key -> value pairs).
'set' contains rb-tree of maps which is used for searching duplicate mappings.
'key' contains list of all its links, this simplifies removing all links from
all mappings assiciated with particular key.

Signed-off-by: Konstantin Khlebnikov <khlebnikov at openvz.org>

===============================

kmapset: Fix NULL pointer derefference in kmapset_set_value()

map->links hlist is ordered ascending (from less to bigger).
If we're adding a new_link, whose key is the bigger than
everything, we traverse all the hlist and exit with
old_link = NULL. This lead to kernel panic:

[71531.680537] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
[71531.680553] IP: [<ffffffff812b9540>] kmapset_set_value+0x100/0x170
[71531.680566] PGD aad36067 PUD 9ee7d067 PMD 0
[71531.680576] Oops: 0000 [#1] SMP
[71531.680586] Modules linked in: veth tun ip6t_rpfilter ip6t_REJECT ipt_REJECT xt_conntrack ebtable_nat ebtable_broute ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_raw iptable_filter ip_tables crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd lpc_ich mfd_core virtio_balloon shpchp mperf serio_raw pcspkr ext4 mbcache jbd2 sd_mod sr_mod crc_t10dif cdrom crct10dif_common ata_generic pata_acpi ahci ata_piix libahci libata e1000 virtio_pci virtio virtio_ring dm_mirror dm_region_hash dm_log dm_mod ip6_vzprivnet ip6_vznetstat pio_kaio pio_nfs pio_direct
[71531.680743]  pfmt_raw pfmt_ploop1 ploop ip_vznetstat ip_vzprivnet vziolimit vzcon vzlinkdev vzevent vzlist vzstat vznetstat vznetdev vzcompat vzmon vzdev bridge stp llc
[71531.680784] CPU: 1 PID: 46241 Comm: vzctl ve: 0 Not tainted 3.10.0-123.1.2.vz7.5.16 #1 5.16
[71531.680802] Hardware name: Parallels Software International Inc. Parallels Virtual Platform/Parallels Virtual Platform, BIOS 6.10.24057.1139071 04/30/2015
[71531.680809] task: ffff8800a7834bc0 ti: ffff8800a7a28000 task.ti: ffff8800a7a28000
[71531.680815] RIP: 0010:[<ffffffff812b9540>]  [<ffffffff812b9540>] kmapset_set_value+0x100/0x170
[71531.680824] RSP: 0018:ffff8800a7a29dd0  EFLAGS: 00010246
[71531.680829] RAX: 0000000000000000 RBX: ffff8800368ae440 RCX: ffff8800a7a29fd8
[71531.680833] RDX: ffff8800368ae2d8 RSI: 00000000000000d0 RDI: ffffffff81ce6240
[71531.680837] RBP: ffff8800a7a29df8 R08: 00000000000208e0 R09: ffff880303803b00
[71531.680841] R10: ffffffff812b9470 R11: 0000000000000246 R12: ffff8800368ae7c0
[71531.680845] R13: ffff8802fc370918 R14: ffffffff81ce6240 R15: 0000000000000005
[71531.680850] FS:  00007fb7078dfbc0(0000) GS:ffff880303e40000(0000) knlGS:0000000000000000
[71531.680857] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[71531.680862] CR2: 0000000000000018 CR3: 00000000aadde000 CR4: 00000000000406e0
[71531.680867] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[71531.680871] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[71531.680876] Stack:
[71531.680880]  0000000000000000 ffff8800a78c72d0 0000000000000005 0000000000000000
[71531.680893]  ffff88003648f021 ffff8800a7a29e60 ffffffff81237486 ffff8802fc370918
[71531.680906]  ffff8800368ae7c0 ffff88003648f000 0000000000000024 ffff8802fc370620
[71531.680921] Call Trace:
[71531.680932]  [<ffffffff81237486>] sysfs_perms_write+0x196/0x360
[71531.680942]  [<ffffffff810e26dd>] cgroup_file_write+0x9d/0x310
[71531.680950]  [<ffffffff811be4f8>] ? __sb_start_write+0x58/0x110
[71531.680956]  [<ffffffff811bbcad>] vfs_write+0xbd/0x1e0
[71531.680961]  [<ffffffff811bc6f8>] SyS_write+0x58/0xb0
[71531.680967]  [<ffffffff815da959>] system_call_fastpath+0x16/0x1b
[71531.680972] Code: 41 5d 41 5e 41 5f 5d c3 0f 1f 00 49 8d 44 24 20 48 c7 43 18 00 00 00 00 48 89 43 20 48 8d 43 18 49 89 44 24 20 eb a7 0f 1f 40 00 <48> 8b 04 25 18 00 00 00 48 8d 53 18 48 c7 43 20 18 00 00 00 48
[71531.681092] RIP  [<ffffffff812b9540>] kmapset_set_value+0x100/0x170
[71531.681099]  RSP <ffff8800a7a29dd0>
[71531.681103] CR2: 0000000000000018

This patch adds the biggest key link on the right place (after the last existing link).

https://jira.sw.ru/browse/PSBM-34437

Signed-off-by: Kirill Tkhai <ktkhai at odin.com>
Reviewed-by: Vladimir Davydov <vdavydov at parallels.com>
Signed-off-by: Stanislav Kinsburskiy <skinsbursky at virtuozzo.com>

+++
kmapset: lost map->size update in kmapset_del_value()

If map->size is not properly updated
kmapset_cmp() can crash on access to non-exisitng links.

https://jira.sw.ru/browse/PSBM-127478

Signed-off-by: Vasily Averin <vvs at virtuozzo.com>

khorenko@: reordered "map->size--;" and "kfree_rcu(link, rcu_head);"
in kmapset_del_value() to keep the same order as in kmapset_unlink().

As both operations are under same lock, it does not change the logic,
just makes the code cleaner.

(cherry-picked from vz8 commit 72df2f1b656525392d1588d8307ea558c0742baa)
Signed-off-by: Pavel Tikhomirov <ptikhomirov at virtuozzo.com>
---
 include/linux/kmapset.h | 105 ++++++++++++
 lib/Makefile            |   2 +
 lib/kmapset.c           | 342 ++++++++++++++++++++++++++++++++++++++++
 3 files changed, 449 insertions(+)
 create mode 100644 include/linux/kmapset.h
 create mode 100644 lib/kmapset.c

diff --git a/include/linux/kmapset.h b/include/linux/kmapset.h
new file mode 100644
index 000000000000..947407ccd779
--- /dev/null
+++ b/include/linux/kmapset.h
@@ -0,0 +1,105 @@
+/*
+ *  include/linux/kmapset.h
+ *
+ *  Copyright (c) 2013-2015 Parallels IP Holdings GmbH
+ *  Copyright (c) 2017-2021 Virtuozzo International GmbH. All rights reserved.
+ *
+ */
+
+#ifndef _LINUX_KMAPSET_H
+#define _LINUX_KMAPSET_H
+
+#include <linux/kernel.h>
+#include <linux/rbtree.h>
+#include <linux/rculist.h>
+#include <linux/kref.h>
+
+struct kmapset_map;
+
+struct kmapset_set {
+	struct mutex		mutex;
+	struct rb_root		tree;
+	unsigned long		default_value;
+};
+
+struct kmapset_map {
+	struct kref		kref;
+	unsigned		size;
+	struct kmapset_set	*set;
+	unsigned long		default_value;
+	unsigned long		hash;
+	struct hlist_head	links;
+	union {
+		struct rb_node		node;
+		struct rcu_head		rcu_head;
+	};
+};
+
+struct kmapset_key {
+	struct hlist_head	links;
+};
+
+struct kmapset_link {
+	struct kmapset_map	*map;
+	struct kmapset_key	*key;
+	unsigned long		value;
+	struct hlist_node	map_link;
+	union {
+		struct hlist_node	key_link;
+		struct rcu_head		rcu_head;
+	};
+};
+
+static inline void kmapset_lock(struct kmapset_set *set)
+{
+	mutex_lock(&set->mutex);
+}
+
+static inline void kmapset_unlock(struct kmapset_set *set)
+{
+	mutex_unlock(&set->mutex);
+}
+
+struct kmapset_map *kmapset_new(struct kmapset_set *set);
+
+static inline void kmapset_init_set(struct kmapset_set *set)
+{
+	mutex_init(&set->mutex);
+	set->tree = RB_ROOT;
+	set->default_value = 0;
+}
+
+static inline void kmapset_init_map(struct kmapset_map *map,
+		struct kmapset_set *set)
+{
+	kref_init(&map->kref);
+	map->size = 0;
+	map->set = set;
+	map->default_value = set->default_value;
+	INIT_HLIST_HEAD(&map->links);
+	RB_CLEAR_NODE(&map->node);
+}
+
+static inline void kmapset_init_key(struct kmapset_key *key)
+{
+	 INIT_HLIST_HEAD(&key->links);
+}
+
+struct kmapset_map *kmapset_get(struct kmapset_map *map);
+void kmapset_put(struct kmapset_map *map);
+
+struct kmapset_map *kmapset_dup(struct kmapset_map *old);
+struct kmapset_map *kmapset_commit(struct kmapset_map *map);
+
+struct kmapset_link *kmapset_lookup(struct kmapset_map *map,
+		struct kmapset_key *key);
+unsigned long kmapset_get_value(struct kmapset_map *map,
+		struct kmapset_key *key);
+int kmapset_set_value(struct kmapset_map *map,
+		struct kmapset_key *key, unsigned long value);
+bool kmapset_del_value(struct kmapset_map *map, struct kmapset_key *key);
+void kmapset_set_default(struct kmapset_map *map, unsigned long value);
+
+void kmapset_unlink(struct kmapset_key *key, struct kmapset_set *set);
+
+#endif /* _LINUX_KMAPSET_H */
diff --git a/lib/Makefile b/lib/Makefile
index 5efd1b435a37..834393b7a964 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -130,6 +130,8 @@ obj-$(CONFIG_TEST_LIVEPATCH) += livepatch/
 
 obj-$(CONFIG_KUNIT) += kunit/
 
+obj-y += kmapset.o
+
 ifeq ($(CONFIG_DEBUG_KOBJECT),y)
 CFLAGS_kobject.o += -DDEBUG
 CFLAGS_kobject_uevent.o += -DDEBUG
diff --git a/lib/kmapset.c b/lib/kmapset.c
new file mode 100644
index 000000000000..5fc692149014
--- /dev/null
+++ b/lib/kmapset.c
@@ -0,0 +1,342 @@
+/*
+ *  lib/kmapset.h
+ *
+ *  Copyright (c) 2013-2015 Parallels IP Holdings GmbH
+ *  Copyright (c) 2017-2021 Virtuozzo International GmbH. All rights reserved.
+ *
+ */
+
+#include <linux/mutex.h>
+#include <linux/kmapset.h>
+#include <linux/slab.h>
+#include <linux/hash.h>
+
+struct kmapset_map *kmapset_new(struct kmapset_set *set)
+{
+	struct kmapset_map *map;
+
+	map = kmalloc(sizeof(struct kmapset_map), GFP_KERNEL);
+	if (!map)
+		return NULL;
+	kmapset_init_map(map, set);
+	return map;
+}
+
+static void kmapset_free(struct kmapset_map *map)
+{
+	struct kmapset_link *link;
+	struct hlist_node *next;
+
+	hlist_for_each_entry_safe(link, next, &map->links, map_link)
+		kfree_rcu(link, rcu_head);
+	kfree_rcu(map, rcu_head);
+}
+
+static long kmapset_cmp(struct kmapset_map *map_a, struct kmapset_map *map_b)
+{
+	struct kmapset_link *link_a, *link_b;
+
+	if (map_a->hash != map_b->hash)
+		return map_a->hash - map_b->hash;
+
+	if (map_a->size != map_b->size)
+		return map_a->size - map_b->size;
+
+	link_a = hlist_entry(map_a->links.first,
+			struct kmapset_link, map_link);
+	link_b = hlist_entry(map_b->links.first,
+			struct kmapset_link, map_link);
+	while (&link_a->map_link) {
+		if (link_a->key != link_b->key)
+			return (long)link_a->key - (long)link_b->key;
+		if (link_a->value != link_b->value)
+			return link_a->value - link_b->value;
+		link_a = list_entry(link_a->map_link.next,
+				struct kmapset_link, map_link);
+		link_b = list_entry(link_b->map_link.next,
+				struct kmapset_link, map_link);
+	}
+
+	return map_a->default_value - map_b->default_value;
+}
+
+static inline bool kmapset_hashed(struct kmapset_map *map)
+{
+	return !RB_EMPTY_NODE(&map->node);
+}
+
+static bool kmapset_hash(struct kmapset_map *map, struct kmapset_map **old)
+{
+	struct rb_node **p = &map->set->tree.rb_node;
+	struct rb_node *parent = NULL;
+	struct kmapset_map *cur;
+	struct kmapset_link *link;
+	long diff;
+
+	map->hash = hash_long(map->default_value, BITS_PER_LONG);
+	hlist_for_each_entry(link, &map->links, map_link)
+		map->hash ^= hash_ptr(link->key, BITS_PER_LONG) *
+			     hash_long(link->value, BITS_PER_LONG);
+
+	while (*p) {
+		parent = *p;
+		cur = rb_entry(parent, struct kmapset_map, node);
+		diff = kmapset_cmp(map, cur);
+		if (diff < 0)
+			p = &(*p)->rb_left;
+		else
+			p = &(*p)->rb_right;
+		if (!diff && old) {
+			*old = cur;
+			return true;
+		}
+	}
+	rb_link_node(&map->node, parent, p);
+	rb_insert_color(&map->node, &map->set->tree);
+	return false;
+}
+
+static void kmapset_unhash(struct kmapset_map *map)
+{
+	rb_erase(&map->node, &map->set->tree);
+	RB_CLEAR_NODE(&map->node);
+}
+
+static void kmapset_rehash(struct kmapset_map *map)
+{
+	if (kmapset_hashed(map)) {
+		kmapset_unhash(map);
+		kmapset_hash(map, NULL);
+	}
+}
+
+struct kmapset_map *kmapset_get(struct kmapset_map *map)
+{
+	if (map)
+		kref_get(&map->kref);
+	return map;
+}
+
+static void kmapset_release(struct kref *kref)
+{
+	struct kmapset_map *map = container_of(kref, struct kmapset_map, kref);
+	struct kmapset_set *set = map->set;
+	struct kmapset_link *link;
+
+	if (kmapset_hashed(map))
+		kmapset_unhash(map);
+	hlist_for_each_entry(link, &map->links, map_link)
+		hlist_del(&link->key_link);
+	mutex_unlock(&set->mutex);
+
+	kmapset_free(map);
+}
+
+void kmapset_put(struct kmapset_map *map)
+{
+	if (map)
+		kref_put_mutex(&map->kref, kmapset_release, &map->set->mutex);
+}
+
+/*
+ * kmapset_commit - hash new map into set or lookup existing copy\
+ *
+ * after committing map must stay immutable
+ */
+struct kmapset_map *kmapset_commit(struct kmapset_map *map)
+{
+	struct kmapset_set *set = map->set;
+	struct kmapset_map *ret = map;
+
+	kmapset_lock(set);
+	if (kmapset_hash(map, &ret)) {
+		kmapset_get(ret);
+		kmapset_release(&map->kref);
+	} else
+		kmapset_unlock(set);
+
+	return ret;
+}
+
+/*
+ * kmapset_copy - copy content of one set to another
+ */
+static int kmapset_copy(struct kmapset_map *dst, struct kmapset_map *src)
+{
+	struct kmapset_set *set = src->set;
+	struct kmapset_link *old_link, *new_link;
+	struct hlist_node *next;
+	int i;
+
+	for (i = src->size; i; i--) {
+		new_link = kmalloc(sizeof(struct kmapset_link), GFP_KERNEL);
+		if (!new_link)
+			return -ENOMEM;
+		hlist_add_head(&new_link->map_link, &dst->links);
+	}
+
+	kmapset_lock(set);
+	dst->default_value = src->default_value;
+	new_link = hlist_entry(dst->links.first, struct kmapset_link, map_link);
+	hlist_for_each_entry(old_link, &src->links, map_link) {
+		new_link->key = old_link->key;
+		new_link->value = old_link->value;
+		new_link->map = dst;
+		dst->size++;
+		hlist_add_head(&new_link->key_link, &new_link->key->links);
+		new_link = hlist_entry(new_link->map_link.next,
+				struct kmapset_link, map_link);
+	}
+	kmapset_unlock(set);
+
+	while (&new_link->map_link) {
+		next = new_link->map_link.next;
+		hlist_del(&new_link->map_link);
+		kfree(new_link);
+		new_link = hlist_entry(next, struct kmapset_link, map_link);
+	}
+
+	return 0;
+}
+
+struct kmapset_map *kmapset_dup(struct kmapset_map *map)
+{
+	struct kmapset_map *new;
+
+	new = kmapset_new(map->set);
+	if (!new)
+		return NULL;
+
+	if (kmapset_copy(new, map)) {
+		kmapset_free(new);
+		return NULL;
+	}
+
+	return new;
+}
+
+/*
+ * kmapset_value - lookup link object for given key
+ *
+ * requires kmapset_lock or rcu_read_lock
+ */
+struct kmapset_link *
+kmapset_lookup(struct kmapset_map *map, struct kmapset_key *key)
+{
+	struct kmapset_link *link;
+
+	hlist_for_each_entry_rcu(link, &map->links, map_link) {
+		if (link->key == key)
+			return link;
+		if (link->key > key)
+			break;
+	}
+	return NULL;
+}
+
+/*
+ * kmapset_get_value - retrieve value for given key
+ */
+unsigned long
+kmapset_get_value(struct kmapset_map *map, struct kmapset_key *key)
+{
+	struct kmapset_link *link;
+	unsigned long value;
+
+	rcu_read_lock();
+	link = kmapset_lookup(map, key);
+	value = link ? link->value : map->default_value;
+	rcu_read_unlock();
+	return value;
+}
+
+int kmapset_set_value(struct kmapset_map *map,
+		struct kmapset_key *key, unsigned long value)
+{
+	struct kmapset_set *set = map->set;
+	struct kmapset_link *new_link, *old_link, *last_link = NULL;
+
+	new_link = kmalloc(sizeof(struct kmapset_link), GFP_KERNEL);
+	if (!new_link)
+		return -ENOMEM;
+
+	new_link->key = key;
+	new_link->value = value;
+	new_link->map = map;
+
+	kmapset_lock(set);
+	if (hlist_empty(&map->links)) {
+		hlist_add_head_rcu(&new_link->map_link, &map->links);
+	} else {
+		hlist_for_each_entry(old_link, &map->links, map_link) {
+			last_link = old_link;
+			if (old_link->key < key)
+				continue;
+			if (old_link->key == key) {
+				old_link->value = value;
+				kfree(new_link);
+				goto out;
+			}
+			hlist_add_before_rcu(&new_link->map_link,
+					     &old_link->map_link);
+			goto add;
+		}
+		hlist_add_behind_rcu(&new_link->map_link, &last_link->map_link);
+	}
+add:
+	hlist_add_head(&new_link->key_link, &new_link->key->links);
+	map->size++;
+out:
+	kmapset_unlock(set);
+
+	return 0;
+}
+
+bool kmapset_del_value(struct kmapset_map *map, struct kmapset_key *key)
+{
+	struct kmapset_set *set = map->set;
+	struct kmapset_link *link;
+	bool ret = false;
+
+	kmapset_lock(set);
+	link = kmapset_lookup(map, key);
+	if (link) {
+		hlist_del_rcu(&link->map_link);
+		hlist_del(&link->key_link);
+		map->size--;
+		kfree_rcu(link, rcu_head);
+		ret = true;
+	}
+	kmapset_unlock(set);
+	return ret;
+}
+
+void kmapset_set_default(struct kmapset_map *map, unsigned long value)
+{
+	struct kmapset_set *set = map->set;
+
+	kmapset_lock(set);
+	map->default_value = value;
+	kmapset_unlock(set);
+}
+
+/*
+ * kmapset_unlink - unlink key from all maps in set
+ */
+void kmapset_unlink(struct kmapset_key *key, struct kmapset_set *set)
+{
+	struct kmapset_link *link;
+	struct kmapset_map *map;
+	struct hlist_node *next;
+
+	kmapset_lock(set);
+	hlist_for_each_entry_safe(link, next, &key->links, key_link) {
+		map = link->map;
+		hlist_del(&link->key_link);
+		hlist_del_rcu(&link->map_link);
+		map->size--;
+		kfree_rcu(link, rcu_head);
+		kmapset_rehash(map);
+	}
+	kmapset_unlock(set);
+}
-- 
2.31.1



More information about the Devel mailing list