[Users] Some details of vSwap implementation in Virtuozzo 7: "tcache" in details

Thu Jul 30 15:50:14 MSK 2020

Konstantin,

Thanks for your previous information. We've set aside a good bit of time 
to dig into this issue further and while we are not kernel developers I 
believe we are fairly close to the path for determining the root cause 
of this. We likely need your assistance to fully isolate the majority of 
these issues.

So, here is an example of one of our nodes and the problem child that we 
have found:

Excerpt from /proc/meminfo:
MemTotal:       196485548 kB
MemFree:         4268192 kB
MemAvailable:   122103908 kB
Slab:           126500392 kB
SReclaimable:   113541672 kB

So, reviewing this information here, we have 192gb of ram in this 
system, we have supposedly 122GB of "MemAvaliable" and of that 113GB of 
stored in reclaimable slab space.

So based on this alone, we should be able to rely on memory pressure to 
reclaim that 113GB worth of memory and very little if any memory goes to 
swap. However, this does not happen, as is proof from dozens of nodes 
digging into swap. So let's dig further.

Looking at the Slab space in detail with slabtop we are provided with 
the following:

Active / Total Objects (% used)    : 752665180 / 754941816 (99.7%)
  Active / Total Slabs (% used)      : 31612963 / 31612963 (100.0%)
  Active / Total Caches (% used)     : 152 / 181 (84.0%)
  Active / Total Size (% used)       : 123299873.38K / 123780755.78K (99.6%)
  Minimum / Average / Maximum Object : 0.01K / 0.16K / 15.25K

   OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
570457209 569865854  99%    0.19K 27164629       21 108658516K dentry
166341504 166231690  99%    0.06K 2599086       64  10396344K kmalloc-64
3776724 3380646  89%    0.19K 179844       21    719376K kmalloc-192
3325284 3220909  96%    1.07K 1108428        3   4433712K ext4_inode_cache

So this is where things start to get interesting.

570457209 569865854  99%    0.19K 27164629       21 108658516K dentry

This claims we have 570 Million dentry cache entries in the kernel, and 
a whopping 569 Million of those are active?

But how is this possible, when taking a total inode count of all 
customers & containers we only have 34970144 total inodes. So somehow 
the kernel is caching loads non-existent dentry entries.

We can somewhat print this value by using the following file, however 
this was recently introduced in the 3.10 branch and does not appear to 
be entirely accurate.

cat /proc/sys/fs/dentry-state
569791041       569357616       45      0       47217194        0

So this claims 45 million negative entires out of 569 million, so it 
reports some, but not a huge percentage of them. Other servers this 
number actually reports drastically higher than the total dentry_cache 
so we will ignore the 5th column for now.

In testing we have replicated this behavior to a degree. You can create 
a snapshot of a VPS, mount it to a new location, stat all of the files 
and commit millions of additional entries to the dentry_cache. Ploop in 
this instance, does properly clean these entries up when the image is 
un-mounted so this part is good.

However, we kept testing common things we do. One common thing that we 
run from time to time is a live ploop compact, since ploop images can 
continually grow based on file usage within them and they must be 
re-sized from time to time. Here's an example:

Node has a total user space of 31205700 inodes.

Dentry cache starts out at the following:

cat /proc/sys/fs/dentry-state
56585091 56124555 45 0 54216828 0

We then ran a pcompact across all containers to look for orphaned space.

pcompact completed:

cat /proc/sys/fs/dentry-state
98663884 96136064 45 0 63732825 0

We have gained 42 million dentry entries from the pcompact process.

Overall, I believe in the beginning of our conversation you eluded to 
this problem without going into detail on it when you mentioned the 
following:
"Most often it's some kind of homebrew backup processes - just because 
it's their job to read
files from disk while performing a backup - and thus they generate a lot 
of pagecache."

Based on my research this is an over-simplification of the core issue. 
Once you begin to research dentry_cache and the issues surrounding it 
you begin to notice a pattern. Linux servers that deal with larges 
amounts of inodes and very frequent access schedules (i.e. lots of busy 
virtual machines running as web servers, temporary files, cache files 
etc) the problem with dentry_cache presents its self.

Now, in the dcache.c code base, you do find the following variable 
"vfs_cache_pressure" which can be adjusted to reclaim dentry_cache

Great, we have a solution.. except it does not work as confirmed here: 
https://access.redhat.com/solutions/55818

An excerpt from this solution specifically states:
"We can see high dentry_cache usage on the systems those who are running 
some programs which are opening and closing huge number of files. 
Sometimes high dentry_cache leads the system to run out of memory in 
such situations performance gets severly impacted as the system will 
start using swap space."

As well as:

Diagnostic Steps
Testing attempted using vm.vfs_cache_pressure values >100 has no effect.
Testing using the workaround of echo 2 > /proc/sys/vm/drop_caches 
immediately reclaims almost all dentry_cache memory.
use slabtop to monitor how much dentry_cache is in use:

So, let us continue down the rabbit hole of dentry_cache and we come 
across this discussion w/ Linus about the topic: 
https://patchwork.kernel.org/patch/9884869/#20866043

Long and short Linus suggests the responsibility to clean / prevent 
dentry_cache buildup of negative/bad entries falls mostly on the user 
space, as the example of rm was given.

This problem uniquely impacts para virtualized systems like VZ/Ploop / 
docker / LXC etc that share a common kernel and file system layer. 
Because this cache is intentionally not separated per cgroup you end up 
with a massive mess. *If* the kernel and memory pressure actually 
cleaned these entries properly we would not have any issue, but the fact 
is this does not work. No amount of cache_pressure removes any 
significant amount of the reclaimable slab space. If the cache pressure 
doesn't do it, that explains why an application requiring extra memory 
doesn't either.

As an example we took a similar node, used the above referenced 
/proc/sys/vm/drop_caches to purge the dentry_cache and out of 100~gb of 
slab space we have regained 80+GB of that. Now the dentry_cache is 
slowly creeping back up and the number of entries is growing 10-20 
million per day, but after 5 days so far I still have 80GB of free 
memory for customer applications and only 40~gb of buffers/cache , most 
of which still seems to be dentry_cache related.
Also, after dropping caches & freeing memory up the total consumed swap 
does begin to finally drop instead of continually grow. This particular 
node which had 100gb in the dentry_cache also had 20gb worth of swap. 
Yet now it operates beautifully with 80GB of free memory.

This problem seems to be big enough that, while unverified as of yet, 
RedHat has even patched their kernel to include limits for at least 
negative_dentry - https://access.redhat.com/solutions/4982351

There is also a separate suggested patch here that has had much 
discussion as well on the issue - https://lwn.net/Articles/814535/

I would appreciate your review and insight on this, I am not a kernel 
developer, much less a programmer but the key things I take away from 
this are as follows:

1) Systems are reporting a substantial amount of reclaimable memory that 
cannot be reclaimed

2) dentry_cache after 100~ days uptime and 30Mil inodes grows to an 
insane amount used numbering in the hundreds of millions of entries and 
consuming 100+GB of system memory. We have some nodes with close to 1 
billion entries.

3) vfs_cache_pressure has no impact in recovering this space

4) Real world active applications are somehow given a lower priority 
than dentry_cache / Recoverable Slab space and are forced into swap 
instead of dentry_cache / slab clearing.

5) ploop compact seems to dramatically increase the unrecoverable 
dentry_cache and may be one of the core applications adding to the 
bloat, but many applications appear to generate dentry_bloat.

We are attempting to dig further into this using systemtap and 
information provided here - https://access.redhat.com/articles/2850581 
however the vz kernels do not play well with this base parser and are 
preventing us from getting accurate details at this time. The temporary 
fix is to purge all of the inode/dentry cache but this is a band-aid to 
address and problem that shouldn't exist, but all current 
implementations that are supposed to automatically handle this, just 
quite frankly, don't work.

Thanks for your time and I look forward to your reply.

On 7/23/20 12:37 PM, Konstantin Khorenko wrote:
> On 07/23/2020 06:34 PM, CoolCold wrote:
>> Hello!
>>
>> 1st - great work guys! Dealign with LXC and even LXD makes me miss my 
>> old good OpenVZ box because of tech excellence! Keep going!
>> 2nd - my 2 cents for content - I"m not a native speaker, but still 
>> suggest some small fixes.
>
> 1. Thank you very much for the feedback!
> And you are very welcome back to use OpenVZ instead of LXC again. :)
>
> 2. And many thanks for content corrections!
> i've just created a wiki page for tcache - decided this info should be 
> saved somewhere publicly available.
> i've also added a section how to enabled/disable tcache for Containers.
>
> And you are very welcome to edit the wiki page as well. :)
>
> https://wiki.openvz.org/Tcache
>
> --
> Best regards,
>
> Konstantin Khorenko,
> Virtuozzo Linux Kernel Team
>
>> On Thu, Jul 23, 2020 at 9:52 PM Konstantin Khorenko 
>> <khorenko at virtuozzo.com <mailto:khorenko at virtuozzo.com>> wrote:
>>
>>     On 07/22/2020 03:04 PM, Daniel Pearson wrote:
>>
>>     >> b) you can disable tcache for this Container
>>     >> memcg::memory.disable_cleancache
>>     >>       (raise your hand if you wish me to explain what tcache is)
>>     >
>>     > I'm all for additional information as it can help to form proper
>>     > opinions if you don't mind providing it.
>>
>>     Hope after reading it you'll catch yourself on an idea that now
>>     you are aware of one more
>>     small feature which makes VZ is really cool and that there are a
>>     lot of things which
>>     just work somewhere in the background simply (and silently)
>>     making it possible for you
>>     to utilize the hardware at maximum. :)
>>
>>     Tcache
>>     ======
>>
>>     Brief tech explanation:
>>     =======================
>>     Transcendent file cache (tcache) is a driver for cleancache
>>     https://www.kernel.org/doc/html/v4.18/vm/cleancache.html ,
>>     which stores reclaimed pages in memory unmodified. Its purpose it to
>>     adopt pages evicted from a memory cgroup on _local_ pressure
>>     (inside a Container),
>>     so that they can be fetched back later without costly disk accesses.
>>
>>     Detailed explanation:
>>     =====================
>>     Tcache is intended increase the overall Hardware Node performance
>>     only
>>
>> Intented "to"
>>
>>     on undercommitted Nodes, i.e. sum of all Containers memory limits
>>     on the Node
>>
>> i.e. "where total sum of all Containers memory limit values placed on 
>> the Node"
>>
>>     is less than Hardware Node RAM size.
>>
>>     Imagine a situation: you have a Node with 1Tb of RAM,
>>     you run 500 Containers on it limited by 1Gb of memory each (no
>>     swap for simplicity).
>>     Let's consider Container to be more or less identical, similar
>>     load, similar activity inside.
>>     => normally those Containers must use 500Gb of physical RAM at
>>     max, right,
>>     and 500Gb will be just free on the Node.
>>
>>     You think it's simple situation - ok, the node is underloaded,
>>     let's put more Containers there,
>>     but that's not always true - it depends on what is the bottleneck
>>     on the Node,
>>     which depends on real workload of Containers running on the Node.
>>     But most often in real life - the disk becomes the bottleneck
>>     first, not the RAM, not the CPU.
>>
>>     Example: let's assume all those Containers run, say, cPanel,
>>     which by default collect some stats
>>     every, say, 15 minutes - the stat collection process is run via
>>     crontab.
>>
>>     (Side note: randomizing times of crontab jobs - is a good idea,
>>     but who usually does this
>>     for Containers? We did it for application templates we shipped in
>>     Virtuozzo, but lot of
>>     software is just installed and configured inside Containers, we
>>     cannot do this. And often
>>     Hosting Providers are not allowed to touch data in Containers -
>>     so most often cron jobs are
>>     not randomized.)
>>
>>     Ok, it does not matter how, but let's assume we get such a
>>     workload - every, say, 15 minutes
>>     (it's important that data access it quite rare), each Container
>>     accesses many small files,
>>     let it be just 100 small files to gather stats and save it somewhere.
>>     In 500 Containers. Simultaneously.
>>     In parallel with other regular i/o workload.
>>     On HDDs.
>>
>>     It's nightmare for disk subsystem, you know,  if an HDD provides
>>     100 IOPS,
>>     it will take 50000/100/60 = 8.(3) minutes(!) to handle.
>>     OK, there could be RAID, let it is able to handle 300 IOPS, it
>>     results in
>>     2.(7) minutes, and we forgot about other regular i/o,
>>     so it means every 15 minutes, the Node became almost unresponsive
>>     for several minutes
>>     until it handles all that random i/o generated by stats collection.
>>
>>     You can ask - but why _every_ 15 minutes? You've read once a file
>>     and it resides in the
>>     Container pagecache!
>>     That's true, but here comes _15 minutes_ period. The larger
>>     period - the worse.
>>     If a Container is active enough, it just reads more and more
>>     files - website data,
>>     pictures, video clips, files of a fileserver, don't know.
>>     The thing is in 15 minutes it's quite possible a Container reads
>>     more than its RAM limit
>>     (remember - only 1Gb in our case!), and thus all old pagecache is
>>     dropped, substituted
>>     with the fresh one.
>>     And thus in 15 minutes it's quite possible you'll have to read
>>     all those 100 files in each
>>     Container from disk.
>>
>>     And here comes tcache to save us: let's don't completely drop
>>     pagecache which is
>>     reclaimed from a Container (on local(!) reclaim), but save this
>>     pagecache in
>>     a special cache (tcache) on the Host in case there is free RAM on
>>     the Host.
>>
>>     And in 15 minutes when all Containers start to access lot of
>>     small files again -
>>     those files data will be get back into Container pagecache
>>     without reading from
>>     physical disk - viola, we saves IOPS, no Node stuck anymore.
>>
>>     Q: can a Container be so active (i.e. read so much from disk)
>>     that this "useful"
>>     pagecache is dropped even from tcache.
>>
>> missing question mark - ?
>>
>>     A: Yes. But tcache extends the "safe" period.
>>
>>     Q: mainstream? LXC/Proxmox?
>>     A: No, it's Virtuozzo/OpenVZ specific.
>>         "cleancache" - the base for tcache it in mainstream, it's
>>     used for Xen.
>>         But we (VZ) wrote a driver for it and use it for Containers
>>     as well.
>>
>>     Q: i use SSD, not HDD, does tcache help me?
>>     A: SSD can provide much more IOPS, thus the Node's performance
>>     increase caused by tcache
>>         is less, but still reading from RAM (tcache is in RAM) is
>>     faster than reading from SSD.
>>
>> is less "significant"
>>
>>
>>
>>     >> c) you can limit the max amount of memory which can be used for
>>     >> pagecache for this Container
>>     >>       memcg::memory.cache.limit_in_bytes
>>     >
>>     > This seems viable to test as well. Currently it seems to be
>>     utilizing a
>>     > high number 'unlimited' default. I assume the only way to set
>>     this is to
>>     > directly interact with the memory cgroup and not via a standard ve
>>     > config value?
>>
>>     Yes, you are right.
>>     We use this setting for some internal system cgroups running
>>     processes
>>     which are known to generate a lot of pagecache which won't be
>>     used later for sure.
>>
>>      From my perspective it's not fair to apply such a setting to a
>>     Container
>>     globally - well, CT owner pay for an amount of RAM, it should be
>>     able to use
>>     this RAM for whatever he wants to - even for pagecache,
>>     so limiting the pagecache for a Container is not a tweak we is
>>     advised to be used
>>     against a Container => no standard config parameter.
>>
>>     Note: disabling tcache for a Container is completely fair,
>>     you disable just an optimization for the whole Hardware Node
>>     performance,
>>     but all RAM configured for a Container - is still available to
>>     the Container.
>>     (but also no official config value for that - most often it
>>     helps, not hurts)
>>
>>
>>     > I assume regardless if we utilized vSwap or not, we would
>>     likely still
>>     > experience these additional swapping issues, presumably from
>>     pagecache
>>     > applications, or would the usage of vSwap intercept some of
>>     these items
>>     > thus preventing them from being swapped to disk?
>>
>>     vSwap - is the optimization for swapping process _local to a
>>     Container_,
>>     it can prevent some Container anonymous pages to be written to
>>     the physical swap,
>>     if _local_ Container reclaim decides to swapout something.
>>
>>     At the moment you experience swapping on the Node level.
>>     Even if some Container's processes are put to the physical swap,
>>     it's a decision of the global reclaim mechanism,
>>     so it's completely unrelated to vSwap =>
>>     even if you assign some swappages to Containers and thus enable
>>     vSwap for those Containers,
>>     i should not influence anyhow on global Node level memory
>>     pressure and
>>     will not result in any difference in the swapping rate into
>>     physical swap.
>>
>>     Hope that helps.
>>
>>     --
>>     Best regards,
>>
>>     Konstantin Khorenko,
>>     Virtuozzo Linux Kernel Team
>>     _______________________________________________
>>     Users mailing list
>>     Users at openvz.org <mailto:Users at openvz.org>
>>     https://lists.openvz.org/mailman/listinfo/users
>>
>>
>>
>> -- 
>> Best regards,
>> [COOLCOLD-RIPN]
>>
>>
>> _______________________________________________
>> Users mailing list
>> Users at openvz.org
>> https://lists.openvz.org/mailman/listinfo/users
>
>
> _______________________________________________
> Users mailing list
> Users at openvz.org
> https://lists.openvz.org/mailman/listinfo/users

-- 
Sincerely
Daniel C Pearson
COO KnownHost, LLC
https://www.knownhost.com

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openvz.org/pipermail/users/attachments/20200730/1c73989f/attachment-0001.html>