[Devel] [PATCH RHEL COMMIT] kmapset: set of key-value mappings with build-in
Konstantin Khorenko
khorenko at virtuozzo.com
Wed Sep 22 14:49:41 MSK 2021
The commit is pushed to "branch-rh9-5.14.vz9.1.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git
after ark-5.14
------>
commit b80ac9499d25668805b561754b776ba16b7b44b5
Author: Stanislav Kinsburskiy <skinsbursky at virtuozzo.com>
Date: Wed Sep 22 14:49:41 2021 +0300
kmapset: set of key-value mappings with build-in
This implements effective storage for large set of small mappings.
'map' contains sorted list of 'links' (key -> value pairs).
'set' contains rb-tree of maps which is used for searching duplicate mappings.
'key' contains list of all its links, this simplifies removing all links from
all mappings assiciated with particular key.
Signed-off-by: Konstantin Khlebnikov <khlebnikov at openvz.org>
===============================
kmapset: Fix NULL pointer derefference in kmapset_set_value()
map->links hlist is ordered ascending (from less to bigger).
If we're adding a new_link, whose key is the bigger than
everything, we traverse all the hlist and exit with
old_link = NULL. This lead to kernel panic:
[71531.680537] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
[71531.680553] IP: [<ffffffff812b9540>] kmapset_set_value+0x100/0x170
[71531.680566] PGD aad36067 PUD 9ee7d067 PMD 0
[71531.680576] Oops: 0000 [#1] SMP
[71531.680586] Modules linked in: veth tun ip6t_rpfilter ip6t_REJECT ipt_REJECT xt_conntrack ebtable_nat ebtable_broute ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_raw iptable_filter ip_tables crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd lpc_ich mfd_core virtio_balloon shpchp mperf serio_raw pcspkr ext4 mbcache jbd2 sd_mod sr_mod crc_t10dif cdrom crct10dif_common ata_generic pata_acpi ahci ata_piix libahci libata e1000 virtio_pci virtio virtio_ring dm_mirror dm_region_hash dm_log dm_mod ip6_vzprivnet ip6_vznetstat pio_kaio pio_nfs pio_direct
[71531.680743] pfmt_raw pfmt_ploop1 ploop ip_vznetstat ip_vzprivnet vziolimit vzcon vzlinkdev vzevent vzlist vzstat vznetstat vznetdev vzcompat vzmon vzdev bridge stp llc
[71531.680784] CPU: 1 PID: 46241 Comm: vzctl ve: 0 Not tainted 3.10.0-123.1.2.vz7.5.16 #1 5.16
[71531.680802] Hardware name: Parallels Software International Inc. Parallels Virtual Platform/Parallels Virtual Platform, BIOS 6.10.24057.1139071 04/30/2015
[71531.680809] task: ffff8800a7834bc0 ti: ffff8800a7a28000 task.ti: ffff8800a7a28000
[71531.680815] RIP: 0010:[<ffffffff812b9540>] [<ffffffff812b9540>] kmapset_set_value+0x100/0x170
[71531.680824] RSP: 0018:ffff8800a7a29dd0 EFLAGS: 00010246
[71531.680829] RAX: 0000000000000000 RBX: ffff8800368ae440 RCX: ffff8800a7a29fd8
[71531.680833] RDX: ffff8800368ae2d8 RSI: 00000000000000d0 RDI: ffffffff81ce6240
[71531.680837] RBP: ffff8800a7a29df8 R08: 00000000000208e0 R09: ffff880303803b00
[71531.680841] R10: ffffffff812b9470 R11: 0000000000000246 R12: ffff8800368ae7c0
[71531.680845] R13: ffff8802fc370918 R14: ffffffff81ce6240 R15: 0000000000000005
[71531.680850] FS: 00007fb7078dfbc0(0000) GS:ffff880303e40000(0000) knlGS:0000000000000000
[71531.680857] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[71531.680862] CR2: 0000000000000018 CR3: 00000000aadde000 CR4: 00000000000406e0
[71531.680867] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[71531.680871] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[71531.680876] Stack:
[71531.680880] 0000000000000000 ffff8800a78c72d0 0000000000000005 0000000000000000
[71531.680893] ffff88003648f021 ffff8800a7a29e60 ffffffff81237486 ffff8802fc370918
[71531.680906] ffff8800368ae7c0 ffff88003648f000 0000000000000024 ffff8802fc370620
[71531.680921] Call Trace:
[71531.680932] [<ffffffff81237486>] sysfs_perms_write+0x196/0x360
[71531.680942] [<ffffffff810e26dd>] cgroup_file_write+0x9d/0x310
[71531.680950] [<ffffffff811be4f8>] ? __sb_start_write+0x58/0x110
[71531.680956] [<ffffffff811bbcad>] vfs_write+0xbd/0x1e0
[71531.680961] [<ffffffff811bc6f8>] SyS_write+0x58/0xb0
[71531.680967] [<ffffffff815da959>] system_call_fastpath+0x16/0x1b
[71531.680972] Code: 41 5d 41 5e 41 5f 5d c3 0f 1f 00 49 8d 44 24 20 48 c7 43 18 00 00 00 00 48 89 43 20 48 8d 43 18 49 89 44 24 20 eb a7 0f 1f 40 00 <48> 8b 04 25 18 00 00 00 48 8d 53 18 48 c7 43 20 18 00 00 00 48
[71531.681092] RIP [<ffffffff812b9540>] kmapset_set_value+0x100/0x170
[71531.681099] RSP <ffff8800a7a29dd0>
[71531.681103] CR2: 0000000000000018
This patch adds the biggest key link on the right place (after the last existing link).
https://jira.sw.ru/browse/PSBM-34437
Signed-off-by: Kirill Tkhai <ktkhai at odin.com>
Reviewed-by: Vladimir Davydov <vdavydov at parallels.com>
Signed-off-by: Stanislav Kinsburskiy <skinsbursky at virtuozzo.com>
+++
kmapset: lost map->size update in kmapset_del_value()
If map->size is not properly updated
kmapset_cmp() can crash on access to non-exisitng links.
https://jira.sw.ru/browse/PSBM-127478
Signed-off-by: Vasily Averin <vvs at virtuozzo.com>
khorenko@: reordered "map->size--;" and "kfree_rcu(link, rcu_head);"
in kmapset_del_value() to keep the same order as in kmapset_unlink().
As both operations are under same lock, it does not change the logic,
just makes the code cleaner.
(cherry-picked from vz8 commit 72df2f1b656525392d1588d8307ea558c0742baa)
Signed-off-by: Pavel Tikhomirov <ptikhomirov at virtuozzo.com>
---
include/linux/kmapset.h | 105 +++++++++++++++
lib/Makefile | 2 +
lib/kmapset.c | 342 ++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 449 insertions(+)
diff --git a/include/linux/kmapset.h b/include/linux/kmapset.h
new file mode 100644
index 000000000000..947407ccd779
--- /dev/null
+++ b/include/linux/kmapset.h
@@ -0,0 +1,105 @@
+/*
+ * include/linux/kmapset.h
+ *
+ * Copyright (c) 2013-2015 Parallels IP Holdings GmbH
+ * Copyright (c) 2017-2021 Virtuozzo International GmbH. All rights reserved.
+ *
+ */
+
+#ifndef _LINUX_KMAPSET_H
+#define _LINUX_KMAPSET_H
+
+#include <linux/kernel.h>
+#include <linux/rbtree.h>
+#include <linux/rculist.h>
+#include <linux/kref.h>
+
+struct kmapset_map;
+
+struct kmapset_set {
+ struct mutex mutex;
+ struct rb_root tree;
+ unsigned long default_value;
+};
+
+struct kmapset_map {
+ struct kref kref;
+ unsigned size;
+ struct kmapset_set *set;
+ unsigned long default_value;
+ unsigned long hash;
+ struct hlist_head links;
+ union {
+ struct rb_node node;
+ struct rcu_head rcu_head;
+ };
+};
+
+struct kmapset_key {
+ struct hlist_head links;
+};
+
+struct kmapset_link {
+ struct kmapset_map *map;
+ struct kmapset_key *key;
+ unsigned long value;
+ struct hlist_node map_link;
+ union {
+ struct hlist_node key_link;
+ struct rcu_head rcu_head;
+ };
+};
+
+static inline void kmapset_lock(struct kmapset_set *set)
+{
+ mutex_lock(&set->mutex);
+}
+
+static inline void kmapset_unlock(struct kmapset_set *set)
+{
+ mutex_unlock(&set->mutex);
+}
+
+struct kmapset_map *kmapset_new(struct kmapset_set *set);
+
+static inline void kmapset_init_set(struct kmapset_set *set)
+{
+ mutex_init(&set->mutex);
+ set->tree = RB_ROOT;
+ set->default_value = 0;
+}
+
+static inline void kmapset_init_map(struct kmapset_map *map,
+ struct kmapset_set *set)
+{
+ kref_init(&map->kref);
+ map->size = 0;
+ map->set = set;
+ map->default_value = set->default_value;
+ INIT_HLIST_HEAD(&map->links);
+ RB_CLEAR_NODE(&map->node);
+}
+
+static inline void kmapset_init_key(struct kmapset_key *key)
+{
+ INIT_HLIST_HEAD(&key->links);
+}
+
+struct kmapset_map *kmapset_get(struct kmapset_map *map);
+void kmapset_put(struct kmapset_map *map);
+
+struct kmapset_map *kmapset_dup(struct kmapset_map *old);
+struct kmapset_map *kmapset_commit(struct kmapset_map *map);
+
+struct kmapset_link *kmapset_lookup(struct kmapset_map *map,
+ struct kmapset_key *key);
+unsigned long kmapset_get_value(struct kmapset_map *map,
+ struct kmapset_key *key);
+int kmapset_set_value(struct kmapset_map *map,
+ struct kmapset_key *key, unsigned long value);
+bool kmapset_del_value(struct kmapset_map *map, struct kmapset_key *key);
+void kmapset_set_default(struct kmapset_map *map, unsigned long value);
+
+void kmapset_unlink(struct kmapset_key *key, struct kmapset_set *set);
+
+#endif /* _LINUX_KMAPSET_H */
diff --git a/lib/Makefile b/lib/Makefile
index 5efd1b435a37..834393b7a964 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -130,6 +130,8 @@ obj-$(CONFIG_TEST_LIVEPATCH) += livepatch/
obj-$(CONFIG_KUNIT) += kunit/
+obj-y += kmapset.o
+
ifeq ($(CONFIG_DEBUG_KOBJECT),y)
CFLAGS_kobject.o += -DDEBUG
CFLAGS_kobject_uevent.o += -DDEBUG
diff --git a/lib/kmapset.c b/lib/kmapset.c
new file mode 100644
index 000000000000..5fc692149014
--- /dev/null
+++ b/lib/kmapset.c
@@ -0,0 +1,342 @@
+/*
+ * lib/kmapset.h
+ *
+ * Copyright (c) 2013-2015 Parallels IP Holdings GmbH
+ * Copyright (c) 2017-2021 Virtuozzo International GmbH. All rights reserved.
+ *
+ */
+
+#include <linux/mutex.h>
+#include <linux/kmapset.h>
+#include <linux/slab.h>
+#include <linux/hash.h>
+
+struct kmapset_map *kmapset_new(struct kmapset_set *set)
+{
+ struct kmapset_map *map;
+
+ map = kmalloc(sizeof(struct kmapset_map), GFP_KERNEL);
+ if (!map)
+ return NULL;
+ kmapset_init_map(map, set);
+ return map;
+}
+
+static void kmapset_free(struct kmapset_map *map)
+{
+ struct kmapset_link *link;
+ struct hlist_node *next;
+
+ hlist_for_each_entry_safe(link, next, &map->links, map_link)
+ kfree_rcu(link, rcu_head);
+ kfree_rcu(map, rcu_head);
+}
+
+static long kmapset_cmp(struct kmapset_map *map_a, struct kmapset_map *map_b)
+{
+ struct kmapset_link *link_a, *link_b;
+
+ if (map_a->hash != map_b->hash)
+ return map_a->hash - map_b->hash;
+
+ if (map_a->size != map_b->size)
+ return map_a->size - map_b->size;
+
+ link_a = hlist_entry(map_a->links.first,
+ struct kmapset_link, map_link);
+ link_b = hlist_entry(map_b->links.first,
+ struct kmapset_link, map_link);
+ while (&link_a->map_link) {
+ if (link_a->key != link_b->key)
+ return (long)link_a->key - (long)link_b->key;
+ if (link_a->value != link_b->value)
+ return link_a->value - link_b->value;
+ link_a = list_entry(link_a->map_link.next,
+ struct kmapset_link, map_link);
+ link_b = list_entry(link_b->map_link.next,
+ struct kmapset_link, map_link);
+ }
+
+ return map_a->default_value - map_b->default_value;
+}
+
+static inline bool kmapset_hashed(struct kmapset_map *map)
+{
+ return !RB_EMPTY_NODE(&map->node);
+}
+
+static bool kmapset_hash(struct kmapset_map *map, struct kmapset_map **old)
+{
+ struct rb_node **p = &map->set->tree.rb_node;
+ struct rb_node *parent = NULL;
+ struct kmapset_map *cur;
+ struct kmapset_link *link;
+ long diff;
+
+ map->hash = hash_long(map->default_value, BITS_PER_LONG);
+ hlist_for_each_entry(link, &map->links, map_link)
+ map->hash ^= hash_ptr(link->key, BITS_PER_LONG) *
+ hash_long(link->value, BITS_PER_LONG);
+
+ while (*p) {
+ parent = *p;
+ cur = rb_entry(parent, struct kmapset_map, node);
+ diff = kmapset_cmp(map, cur);
+ if (diff < 0)
+ p = &(*p)->rb_left;
+ else
+ p = &(*p)->rb_right;
+ if (!diff && old) {
+ *old = cur;
+ return true;
+ }
+ }
+ rb_link_node(&map->node, parent, p);
+ rb_insert_color(&map->node, &map->set->tree);
+ return false;
+}
+
+static void kmapset_unhash(struct kmapset_map *map)
+{
+ rb_erase(&map->node, &map->set->tree);
+ RB_CLEAR_NODE(&map->node);
+}
+
+static void kmapset_rehash(struct kmapset_map *map)
+{
+ if (kmapset_hashed(map)) {
+ kmapset_unhash(map);
+ kmapset_hash(map, NULL);
+ }
+}
+
+struct kmapset_map *kmapset_get(struct kmapset_map *map)
+{
+ if (map)
+ kref_get(&map->kref);
+ return map;
+}
+
+static void kmapset_release(struct kref *kref)
+{
+ struct kmapset_map *map = container_of(kref, struct kmapset_map, kref);
+ struct kmapset_set *set = map->set;
+ struct kmapset_link *link;
+
+ if (kmapset_hashed(map))
+ kmapset_unhash(map);
+ hlist_for_each_entry(link, &map->links, map_link)
+ hlist_del(&link->key_link);
+ mutex_unlock(&set->mutex);
+
+ kmapset_free(map);
+}
+
+void kmapset_put(struct kmapset_map *map)
+{
+ if (map)
+ kref_put_mutex(&map->kref, kmapset_release, &map->set->mutex);
+}
+
+/*
+ * kmapset_commit - hash new map into set or lookup existing copy\
+ *
+ * after committing map must stay immutable
+ */
+struct kmapset_map *kmapset_commit(struct kmapset_map *map)
+{
+ struct kmapset_set *set = map->set;
+ struct kmapset_map *ret = map;
+
+ kmapset_lock(set);
+ if (kmapset_hash(map, &ret)) {
+ kmapset_get(ret);
+ kmapset_release(&map->kref);
+ } else
+ kmapset_unlock(set);
+
+ return ret;
+}
+
+/*
+ * kmapset_copy - copy content of one set to another
+ */
+static int kmapset_copy(struct kmapset_map *dst, struct kmapset_map *src)
+{
+ struct kmapset_set *set = src->set;
+ struct kmapset_link *old_link, *new_link;
+ struct hlist_node *next;
+ int i;
+
+ for (i = src->size; i; i--) {
+ new_link = kmalloc(sizeof(struct kmapset_link), GFP_KERNEL);
+ if (!new_link)
+ return -ENOMEM;
+ hlist_add_head(&new_link->map_link, &dst->links);
+ }
+
+ kmapset_lock(set);
+ dst->default_value = src->default_value;
+ new_link = hlist_entry(dst->links.first, struct kmapset_link, map_link);
+ hlist_for_each_entry(old_link, &src->links, map_link) {
+ new_link->key = old_link->key;
+ new_link->value = old_link->value;
+ new_link->map = dst;
+ dst->size++;
+ hlist_add_head(&new_link->key_link, &new_link->key->links);
+ new_link = hlist_entry(new_link->map_link.next,
+ struct kmapset_link, map_link);
+ }
+ kmapset_unlock(set);
+
+ while (&new_link->map_link) {
+ next = new_link->map_link.next;
+ hlist_del(&new_link->map_link);
+ kfree(new_link);
+ new_link = hlist_entry(next, struct kmapset_link, map_link);
+ }
+
+ return 0;
+}
+
+struct kmapset_map *kmapset_dup(struct kmapset_map *map)
+{
+ struct kmapset_map *new;
+
+ new = kmapset_new(map->set);
+ if (!new)
+ return NULL;
+
+ if (kmapset_copy(new, map)) {
+ kmapset_free(new);
+ return NULL;
+ }
+
+ return new;
+}
+
+/*
+ * kmapset_value - lookup link object for given key
+ *
+ * requires kmapset_lock or rcu_read_lock
+ */
+struct kmapset_link *
+kmapset_lookup(struct kmapset_map *map, struct kmapset_key *key)
+{
+ struct kmapset_link *link;
+
+ hlist_for_each_entry_rcu(link, &map->links, map_link) {
+ if (link->key == key)
+ return link;
+ if (link->key > key)
+ break;
+ }
+ return NULL;
+}
+
+/*
+ * kmapset_get_value - retrieve value for given key
+ */
+unsigned long
+kmapset_get_value(struct kmapset_map *map, struct kmapset_key *key)
+{
+ struct kmapset_link *link;
+ unsigned long value;
+
+ rcu_read_lock();
+ link = kmapset_lookup(map, key);
+ value = link ? link->value : map->default_value;
+ rcu_read_unlock();
+ return value;
+}
+
+int kmapset_set_value(struct kmapset_map *map,
+ struct kmapset_key *key, unsigned long value)
+{
+ struct kmapset_set *set = map->set;
+ struct kmapset_link *new_link, *old_link, *last_link = NULL;
+
+ new_link = kmalloc(sizeof(struct kmapset_link), GFP_KERNEL);
+ if (!new_link)
+ return -ENOMEM;
+
+ new_link->key = key;
+ new_link->value = value;
+ new_link->map = map;
+
+ kmapset_lock(set);
+ if (hlist_empty(&map->links)) {
+ hlist_add_head_rcu(&new_link->map_link, &map->links);
+ } else {
+ hlist_for_each_entry(old_link, &map->links, map_link) {
+ last_link = old_link;
+ if (old_link->key < key)
+ continue;
+ if (old_link->key == key) {
+ old_link->value = value;
+ kfree(new_link);
+ goto out;
+ }
+ hlist_add_before_rcu(&new_link->map_link,
+ &old_link->map_link);
+ goto add;
+ }
+ hlist_add_behind_rcu(&new_link->map_link, &last_link->map_link);
+ }
+add:
+ hlist_add_head(&new_link->key_link, &new_link->key->links);
+ map->size++;
+out:
+ kmapset_unlock(set);
+
+ return 0;
+}
+
+bool kmapset_del_value(struct kmapset_map *map, struct kmapset_key *key)
+{
+ struct kmapset_set *set = map->set;
+ struct kmapset_link *link;
+ bool ret = false;
+
+ kmapset_lock(set);
+ link = kmapset_lookup(map, key);
+ if (link) {
+ hlist_del_rcu(&link->map_link);
+ hlist_del(&link->key_link);
+ map->size--;
+ kfree_rcu(link, rcu_head);
+ ret = true;
+ }
+ kmapset_unlock(set);
+ return ret;
+}
+
+void kmapset_set_default(struct kmapset_map *map, unsigned long value)
+{
+ struct kmapset_set *set = map->set;
+
+ kmapset_lock(set);
+ map->default_value = value;
+ kmapset_unlock(set);
+}
+
+/*
+ * kmapset_unlink - unlink key from all maps in set
+ */
+void kmapset_unlink(struct kmapset_key *key, struct kmapset_set *set)
+{
+ struct kmapset_link *link;
+ struct kmapset_map *map;
+ struct hlist_node *next;
+
+ kmapset_lock(set);
+ hlist_for_each_entry_safe(link, next, &key->links, key_link) {
+ map = link->map;
+ hlist_del(&link->key_link);
+ hlist_del_rcu(&link->map_link);
+ map->size--;
+ kfree_rcu(link, rcu_head);
+ kmapset_rehash(map);
+ }
+ kmapset_unlock(set);
+}
More information about the Devel
mailing list