[Devel] [PATCH RHEL7 COMMIT] ms/netfilter: ipset: fix race condition between swap/destroy and kernel side add/del/test

Konstantin Khorenko khorenko at virtuozzo.com
Mon Sep 30 15:21:53 MSK 2024


The commit is pushed to "branch-rh7-3.10.0-1160.119.1.vz7.224.x-ovz" and will appear at git at bitbucket.org:openvz/vzkernel.git
after rh7-3.10.0-1160.119.1.vz7.224.3
------>
commit f5d151859238b985d8ed566ca0d3abbe42215c40
Author: Jozsef Kadlecsik <kadlec at netfilter.org>
Date:   Wed Sep 25 17:35:40 2024 +0800

    ms/netfilter: ipset: fix race condition between swap/destroy and kernel side add/del/test
    
    Linkui Xiao reported that there's a race condition when ipset swap and destroy is
    called, which can lead to crash in add/del/test element operations. Swap then
    destroy are usual operations to replace a set with another one in a production
    system. The issue can in some cases be reproduced with the script:
    
    ipset create hash_ip1 hash:net family inet hashsize 1024 maxelem 1048576
    ipset add hash_ip1 172.20.0.0/16
    ipset add hash_ip1 192.168.0.0/16
    iptables -A INPUT -m set --match-set hash_ip1 src -j ACCEPT
    while [ 1 ]
    do
            # ... Ongoing traffic...
            ipset create hash_ip2 hash:net family inet hashsize 1024 maxelem 1048576
            ipset add hash_ip2 172.20.0.0/16
            ipset swap hash_ip1 hash_ip2
            ipset destroy hash_ip2
            sleep 0.05
    done
    
    In the race case the possible order of the operations are
    
            CPU0                    CPU1
            ip_set_test
                                    ipset swap hash_ip1 hash_ip2
                                    ipset destroy hash_ip2
            hash_net_kadt
    
    Swap replaces hash_ip1 with hash_ip2 and then destroy removes hash_ip2 which
    is the original hash_ip1. ip_set_test was called on hash_ip1 and because destroy
    removed it, hash_net_kadt crashes.
    
    The fix is to force ip_set_swap() to wait for all readers to finish accessing the
    old set pointers by calling synchronize_rcu().
    
    The first version of the patch was written by Linkui Xiao <xiaolinkui at kylinos.cn>.
    
    v2: synchronize_rcu() is moved into ip_set_swap() in order not to burden
        ip_set_destroy() unnecessarily when all sets are destroyed.
    v3: Florian Westphal pointed out that all netfilter hooks run with rcu_read_lock() held
        and em_ipset.c wraps the entire ip_set_test() in rcu read lock/unlock pair.
        So there's no need to extend the rcu read locked area in ipset itself.
    
    Closes: https://lore.kernel.org/all/69e7963b-e7f8-3ad0-210-7b86eebf7f78@netfilter.org/
    Reported by: Linkui Xiao <xiaolinkui at kylinos.cn>
    Signed-off-by: Jozsef Kadlecsik <kadlec at netfilter.org>
    Signed-off-by: Pablo Neira Ayuso <pablo at netfilter.org>
    
    https://virtuozzo.atlassian.net/browse/PSBM-155867
    (cherry picked from commit 28628fa952fefc7f2072ce6e8016968cc452b1ba)
    Signed-off-by: Pavel Tikhomirov <ptikhomirov at virtuozzo.com>
    
    =================
    Patchset description:
    netfilter: ipset: Fix possible cause of memory corruption
    
    Patch [1] fixes possible race between swap/destroy and add/del/test.
    
    Here is is possible order of events when this race can lead to double
    free with kfree_rcu on already freed hash bucket:
    
     # Thread 1
      +-> ip_set_add
        +-> set = ip_set_rcu_get(xt_net(par), index)
        < pause >
    
     # Thread 2
    ipset swap
    ipset destroy
      +-> mtype_destroy
        +-> mtype_ahash_destroy
          +-> n = __ipset_dereference(hbucket(t, i))
          +-> kfree(n)
    
     # Thread 1
        < unpause >
        +-> ip_set_lock(set)
        +-> hash_net4_kadt
          +-> mtype_variant->adt[adt]
            +-> net4_kadt_add (mtype_add)
              +-> n = rcu_dereference_bh(hbucket(t, key))
              +-> if (n->pos >= n->size)
                +-> old = n
              +-> if (old != ERR_PTR(-ENOENT))
                +-> kfree_rcu(old, rcu)
    
    That can in it's turn lead to possible rcu free list corruption if this
    double fried memory is reused just after rcu_free and before actual rcu
    callback.
    
    note1: The patch [1] has a reproducer but, sadly, I was unable to
    reproduce the situation, even adding mdelay(100) to ip_set_add/del/test.
    
    note2: All other patches are fixups to the original fixing patch.
    
    note3: We don't have proof that this a fix to original issue, we only
    know that ipset hash buckets were several times seen double freed just
    before crash, which might indirectly indicate that the original problem
    is related to ipset hash buckets.
    
    Signed-off-by: Pavel Tikhomirov <ptikhomirov at virtuozzo.com>
    https://virtuozzo.atlassian.net/browse/PSBM-155867
    
    Alexander Maltsev (1):
      netfilter: ipset: Add list flush to cancel_gc
    
    Eric Dumazet (1):
      netns: add pre_exit method to struct pernet_operations
    
    Jozsef Kadlecsik (5):
      netfilter: ipset: fix race condition between swap/destroy and kernel
        side add/del/test [1]
      netfilter: ipset: fix performance regression in swap operation
      netfilter: ipset: Missing gc cancellations fixed
      netfilter: ipset: Fix race between namespace cleanup and gc in the
        list:set type
      netfilter: ipset: Fix suspicious rcu_dereference_protected()
---
 net/netfilter/ipset/ip_set_core.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/net/netfilter/ipset/ip_set_core.c b/net/netfilter/ipset/ip_set_core.c
index d47d97839fa7..ca30e6495fc7 100644
--- a/net/netfilter/ipset/ip_set_core.c
+++ b/net/netfilter/ipset/ip_set_core.c
@@ -64,6 +64,8 @@ MODULE_ALIAS_NFNL_SUBSYS(NFNL_SUBSYS_IPSET);
 	ip_set_dereference((inst)->ip_set_list)[id]
 #define ip_set_ref_netlink(inst,id)	\
 	rcu_dereference_raw((inst)->ip_set_list)[id]
+#define ip_set_dereference_nfnl(p)	\
+	rcu_dereference_check(p, lockdep_nfnl_is_held(NFNL_SUBSYS_IPSET))
 
 /* The set types are implemented in modules and registered set types
  * can be found in ip_set_type_list. Adding/deleting types is
@@ -548,15 +550,10 @@ __ip_set_put_netlink(struct ip_set *set)
 static inline struct ip_set *
 ip_set_rcu_get(struct net *net, ip_set_id_t index)
 {
-	struct ip_set *set;
 	struct ip_set_net *inst = ip_set_pernet(net);
 
-	rcu_read_lock();
-	/* ip_set_list itself needs to be protected */
-	set = rcu_dereference(inst->ip_set_list)[index];
-	rcu_read_unlock();
-
-	return set;
+	/* ip_set_list and the set pointer need to be protected */
+	return ip_set_dereference_nfnl(inst->ip_set_list)[index];
 }
 
 static inline void
@@ -1259,6 +1256,9 @@ ip_set_swap(struct sock *ctnl, struct sk_buff *skb,
 	ip_set(inst, to_id) = from;
 	write_unlock_bh(&ip_set_ref_lock);
 
+	/* Make sure all readers of the old set pointers are completed. */
+	synchronize_rcu();
+
 	return 0;
 }
 


More information about the Devel mailing list