<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<div class="moz-cite-prefix">Konstantin,</div>
<div class="moz-cite-prefix"><br>
Thanks for your previous information. We've set aside a good bit
of time to dig into this issue further and while we are not kernel
developers I believe we are fairly close to the path for
determining the root cause of this. We likely need your assistance
to fully isolate the majority of these issues.</div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix">So, here is an example of one of our
nodes and the problem child that we have found:</div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix">Excerpt from /proc/meminfo:</div>
<div class="moz-cite-prefix">MemTotal: 196485548 kB<br>
MemFree: 4268192 kB<br>
MemAvailable: 122103908 kB<br>
</div>
<div class="moz-cite-prefix">Slab: 126500392 kB<br>
SReclaimable: 113541672 kB<br>
</div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix">So, reviewing this information here, we
have 192gb of ram in this system, we have supposedly 122GB of
"MemAvaliable" and of that 113GB of stored in reclaimable slab
space.</div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix">So based on this alone, we should be
able to rely on memory pressure to reclaim that 113GB worth of
memory and very little if any memory goes to swap. However, this
does not happen, as is proof from dozens of nodes digging into
swap. So let's dig further. <br>
</div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix">Looking at the Slab space in detail
with slabtop we are provided with the following:</div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix">Active / Total Objects (% used) :
752665180 / 754941816 (99.7%)<br>
Active / Total Slabs (% used) : 31612963 / 31612963 (100.0%)<br>
Active / Total Caches (% used) : 152 / 181 (84.0%)<br>
Active / Total Size (% used) : 123299873.38K /
123780755.78K (99.6%)<br>
Minimum / Average / Maximum Object : 0.01K / 0.16K / 15.25K<br>
<br>
OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE
NAME <br>
570457209 569865854 99% 0.19K 27164629 21 108658516K
dentry<br>
166341504 166231690 99% 0.06K 2599086 64 10396344K
kmalloc-64<br>
3776724 3380646 89% 0.19K 179844 21 719376K
kmalloc-192<br>
3325284 3220909 96% 1.07K 1108428 3 4433712K
ext4_inode_cache</div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix">So this is where things start to get
interesting. <br>
</div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix">570457209 569865854 99% 0.19K
27164629 21 108658516K dentry</div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix">This claims we have 570 Million dentry
cache entries in the kernel, and a whopping 569 Million of those
are active? <br>
</div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix">But how is this possible, when taking a
total inode count of all customers & containers we only have
34970144 total inodes. So somehow the kernel is caching loads
non-existent dentry entries. <br>
</div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix">We can somewhat print this value by
using the following file, however this was recently introduced in
the 3.10 branch and does not appear to be entirely accurate.<br>
</div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix">cat /proc/sys/fs/dentry-state <br>
569791041 569357616 45 0 47217194 0</div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix">So this claims 45 million negative
entires out of 569 million, so it reports some, but not a huge
percentage of them. Other servers this number actually reports
drastically higher than the total dentry_cache so we will ignore
the 5th column for now.</div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix">In testing we have replicated this
behavior to a degree. You can create a snapshot of a VPS, mount it
to a new location, stat all of the files and commit millions of
additional entries to the dentry_cache. Ploop in this instance,
does properly clean these entries up when the image is un-mounted
so this part is good.</div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix">However, we kept testing common things
we do. One common thing that we run from time to time is a live
ploop compact, since ploop images can continually grow based on
file usage within them and they must be re-sized from time to
time. Here's an example:</div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix">Node has a total user space of 31205700
inodes.</div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix">Dentry cache starts out at the
following:</div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix">cat /proc/sys/fs/dentry-state <br>
56585091 56124555 45 0 54216828 0 <br>
</div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix">We then ran a pcompact across all
containers to look for orphaned space.</div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix">pcompact completed:</div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix">cat /proc/sys/fs/dentry-state <br>
98663884 96136064 45 0 63732825 0</div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix">We have gained 42 million dentry
entries from the pcompact process. </div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix">Overall, I believe in the beginning of
our conversation you eluded to this problem without going into
detail on it when you mentioned the following:</div>
<div class="moz-cite-prefix">"Most often it's some kind of homebrew
backup processes - just because it's their job to read
<br>
files from disk while performing a backup - and thus they generate
a lot of pagecache."</div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix">Based on my research this is an
over-simplification of the core issue. Once you begin to research
dentry_cache and the issues surrounding it you begin to notice a
pattern. Linux servers that deal with larges amounts of inodes and
very frequent access schedules (i.e. lots of busy virtual machines
running as web servers, temporary files, cache files etc) the
problem with dentry_cache presents its self.</div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix">Now, in the dcache.c code base, you do
find the following variable "vfs_cache_pressure" which can be
adjusted to reclaim dentry_cache</div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix">Great, we have a solution.. except it
does not work as confirmed here: <a
href="https://access.redhat.com/solutions/55818">https://access.redhat.com/solutions/55818</a></div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix">An excerpt from this solution
specifically states:</div>
<div class="moz-cite-prefix">"We can see high dentry_cache usage on
the systems those who are running some programs which are opening
and closing huge number of files. Sometimes high dentry_cache
leads the system to run out of memory in such situations
performance gets severly impacted as the system will start using
swap space."</div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix">As well as:<br>
</div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix">Diagnostic Steps<br>
Testing attempted using vm.vfs_cache_pressure values >100 has
no effect.<br>
Testing using the workaround of echo 2 >
/proc/sys/vm/drop_caches immediately reclaims almost all
dentry_cache memory.<br>
use slabtop to monitor how much dentry_cache is in use:<br>
</div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix">So, let us continue down the rabbit
hole of dentry_cache and we come across this discussion w/ Linus
about the topic: <a
href="https://patchwork.kernel.org/patch/9884869/#20866043">https://patchwork.kernel.org/patch/9884869/#20866043</a></div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix">Long and short Linus suggests the
responsibility to clean / prevent dentry_cache buildup of
negative/bad entries falls mostly on the user space, as the
example of rm was given. <br>
</div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix">This problem uniquely impacts para
virtualized systems like VZ/Ploop / docker / LXC etc that share a
common kernel and file system layer. Because this cache is
intentionally not separated per cgroup you end up with a massive
mess. *If* the kernel and memory pressure actually cleaned these
entries properly we would not have any issue, but the fact is this
does not work. No amount of cache_pressure removes any significant
amount of the reclaimable slab space. If the cache pressure
doesn't do it, that explains why an application requiring extra
memory doesn't either.</div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix">As an example we took a similar node,
used the above referenced /proc/sys/vm/drop_caches to purge the
dentry_cache and out of 100~gb of slab space we have regained
80+GB of that. Now the dentry_cache is slowly creeping back up and
the number of entries is growing 10-20 million per day, but after
5 days so far I still have 80GB of free memory for customer
applications and only 40~gb of buffers/cache , most of which still
seems to be dentry_cache related.<br>
Also, after dropping caches & freeing memory up the total
consumed swap does begin to finally drop instead of continually
grow. This particular node which had 100gb in the dentry_cache
also had 20gb worth of swap. Yet now it operates beautifully with
80GB of free memory. <br>
</div>
<div class="moz-cite-prefix"><br>
</div>
<p>This problem seems to be big enough that, while unverified as of
yet, RedHat has even patched their kernel to include limits for at
least negative_dentry - <a
href="https://access.redhat.com/solutions/4982351">https://access.redhat.com/solutions/4982351
<br>
</a></p>
<p>There is also a separate suggested patch here that has had much
discussion as well on the issue - <a
href="https://lwn.net/Articles/814535/">https://lwn.net/Articles/814535/</a></p>
<p>I would appreciate your review and insight on this, I am not a
kernel developer, much less a programmer but the key things I take
away from this are as follows:</p>
<p><br>
</p>
<p>1) Systems are reporting a substantial amount of reclaimable
memory that cannot be reclaimed</p>
<p>2) dentry_cache after 100~ days uptime and 30Mil inodes grows to
an insane amount used numbering in the hundreds of millions of
entries and consuming 100+GB of system memory. We have some nodes
with close to 1 billion entries.<br>
</p>
<p>3) vfs_cache_pressure has no impact in recovering this space</p>
<p>4) Real world active applications are somehow given a lower
priority than dentry_cache / Recoverable Slab space and are forced
into swap instead of dentry_cache / slab clearing.</p>
<p>5) ploop compact seems to dramatically increase the unrecoverable
dentry_cache and may be one of the core applications adding to the
bloat, but many applications appear to generate dentry_bloat.<br>
</p>
<p><br>
</p>
<p>We are attempting to dig further into this using systemtap and
information provided here -
<a class="moz-txt-link-freetext" href="https://access.redhat.com/articles/2850581">https://access.redhat.com/articles/2850581</a> however the vz kernels
do not play well with this base parser and are preventing us from
getting accurate details at this time. The temporary fix is to
purge all of the inode/dentry cache but this is a band-aid to
address and problem that shouldn't exist, but all current
implementations that are supposed to automatically handle this,
just quite frankly, don't work. <br>
</p>
<p><br>
</p>
<p>Thanks for your time and I look forward to your reply. <br>
</p>
<br>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix">On 7/23/20 12:37 PM, Konstantin
Khorenko wrote:<br>
</div>
<blockquote type="cite"
cite="mid:eb7cdb3b-64a8-cabe-5ae6-400f4917c08c@virtuozzo.com">
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
On 07/23/2020 06:34 PM, CoolCold wrote:<br>
<blockquote
cite="mid:CAGqmV7rmsMc-Dd+p_pJxSGXA7oiv2q0QHnX9Ghm6Te=xjmfqLw@mail.gmail.com"
type="cite">
<div dir="ltr">
<div dir="ltr">
<div>Hello!</div>
<div><br>
</div>
<div>1st - great work guys! Dealign with LXC and even LXD
makes me miss my old good OpenVZ box because of tech
excellence! Keep going!<br>
</div>
<div>2nd - my 2 cents for content - I"m not a native
speaker, but still suggest some small fixes.</div>
</div>
</div>
</blockquote>
<br>
1. Thank you very much for the feedback!<br>
And you are very welcome back to use OpenVZ instead of LXC again.
:)<br>
<br>
2. And many thanks for content corrections!<br>
i've just created a wiki page for tcache - decided this info
should be saved somewhere publicly available.<br>
i've also added a section how to enabled/disable tcache for
Containers.<br>
<br>
And you are very welcome to edit the wiki page as well. :)<br>
<br>
<a href="https://wiki.openvz.org/Tcache" moz-do-not-send="true">https://wiki.openvz.org/Tcache</a><br>
<br>
<pre class="moz-signature" cols="179">--
Best regards,
Konstantin Khorenko,
Virtuozzo Linux Kernel Team
</pre>
<br>
<blockquote
cite="mid:CAGqmV7rmsMc-Dd+p_pJxSGXA7oiv2q0QHnX9Ghm6Te=xjmfqLw@mail.gmail.com"
type="cite">
<div dir="ltr">
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Thu, Jul 23, 2020 at
9:52 PM Konstantin Khorenko <<a moz-do-not-send="true"
href="mailto:khorenko@virtuozzo.com">khorenko@virtuozzo.com</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px
0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex">On 07/22/2020 03:04 PM,
Daniel Pearson wrote:<br>
<br>
>> b) you can disable tcache for this Container<br>
>> memcg::memory.disable_cleancache<br>
>> (raise your hand if you wish me to explain
what tcache is)<br>
><br>
> I'm all for additional information as it can help to
form proper<br>
> opinions if you don't mind providing it.<br>
<br>
Hope after reading it you'll catch yourself on an idea
that now you are aware of one more<br>
small feature which makes VZ is really cool and that there
are a lot of things which<br>
just work somewhere in the background simply (and
silently) making it possible for you<br>
to utilize the hardware at maximum. :)<br>
<br>
Tcache<br>
======<br>
<br>
Brief tech explanation:<br>
=======================<br>
Transcendent file cache (tcache) is a driver for
cleancache<br>
<a moz-do-not-send="true"
href="https://www.kernel.org/doc/html/v4.18/vm/cleancache.html"
rel="noreferrer" target="_blank">https://www.kernel.org/doc/html/v4.18/vm/cleancache.html</a>
,<br>
which stores reclaimed pages in memory unmodified. Its
purpose it to<br>
adopt pages evicted from a memory cgroup on _local_
pressure (inside a Container),<br>
so that they can be fetched back later without costly disk
accesses.<br>
<br>
Detailed explanation:<br>
=====================<br>
Tcache is intended increase the overall Hardware Node
performance only<br>
</blockquote>
<div>Intented "to"<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px
0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex"> on undercommitted
Nodes, i.e. sum of all Containers memory limits on the
Node<br>
</blockquote>
<div>i.e. "where total sum of all Containers memory limit
values placed on the Node" <br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px
0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex"> is less than Hardware
Node RAM size.<br>
<br>
Imagine a situation: you have a Node with 1Tb of RAM,<br>
you run 500 Containers on it limited by 1Gb of memory each
(no swap for simplicity).<br>
Let's consider Container to be more or less identical,
similar load, similar activity inside.<br>
=> normally those Containers must use 500Gb of physical
RAM at max, right,<br>
and 500Gb will be just free on the Node.<br>
<br>
You think it's simple situation - ok, the node is
underloaded, let's put more Containers there,<br>
but that's not always true - it depends on what is the
bottleneck on the Node,<br>
which depends on real workload of Containers running on
the Node.<br>
But most often in real life - the disk becomes the
bottleneck first, not the RAM, not the CPU.<br>
<br>
Example: let's assume all those Containers run, say,
cPanel, which by default collect some stats<br>
every, say, 15 minutes - the stat collection process is
run via crontab.<br>
<br>
(Side note: randomizing times of crontab jobs - is a good
idea, but who usually does this<br>
for Containers? We did it for application templates we
shipped in Virtuozzo, but lot of<br>
software is just installed and configured inside
Containers, we cannot do this. And often<br>
Hosting Providers are not allowed to touch data in
Containers - so most often cron jobs are<br>
not randomized.)<br>
<br>
Ok, it does not matter how, but let's assume we get such a
workload - every, say, 15 minutes<br>
(it's important that data access it quite rare), each
Container accesses many small files,<br>
let it be just 100 small files to gather stats and save it
somewhere.<br>
In 500 Containers. Simultaneously.<br>
In parallel with other regular i/o workload.<br>
On HDDs.<br>
<br>
It's nightmare for disk subsystem, you know, if an HDD
provides 100 IOPS,<br>
it will take 50000/100/60 = 8.(3) minutes(!) to handle.<br>
OK, there could be RAID, let it is able to handle 300
IOPS, it results in<br>
2.(7) minutes, and we forgot about other regular i/o,<br>
so it means every 15 minutes, the Node became almost
unresponsive for several minutes<br>
until it handles all that random i/o generated by stats
collection.<br>
<br>
You can ask - but why _every_ 15 minutes? You've read once
a file and it resides in the<br>
Container pagecache!<br>
That's true, but here comes _15 minutes_ period. The
larger period - the worse.<br>
If a Container is active enough, it just reads more and
more files - website data,<br>
pictures, video clips, files of a fileserver, don't know.<br>
The thing is in 15 minutes it's quite possible a Container
reads more than its RAM limit<br>
(remember - only 1Gb in our case!), and thus all old
pagecache is dropped, substituted<br>
with the fresh one.<br>
And thus in 15 minutes it's quite possible you'll have to
read all those 100 files in each<br>
Container from disk.<br>
<br>
And here comes tcache to save us: let's don't completely
drop pagecache which is<br>
reclaimed from a Container (on local(!) reclaim), but save
this pagecache in<br>
a special cache (tcache) on the Host in case there is free
RAM on the Host.<br>
<br>
And in 15 minutes when all Containers start to access lot
of small files again -<br>
those files data will be get back into Container pagecache
without reading from<br>
physical disk - viola, we saves IOPS, no Node stuck
anymore.<br>
<br>
Q: can a Container be so active (i.e. read so much from
disk) that this "useful"<br>
pagecache is dropped even from tcache.<br>
</blockquote>
<div>missing question mark - ? <br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px
0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex"> A: Yes. But tcache
extends the "safe" period.<br>
<br>
Q: mainstream? LXC/Proxmox?<br>
A: No, it's Virtuozzo/OpenVZ specific.<br>
"cleancache" - the base for tcache it in mainstream,
it's used for Xen.<br>
But we (VZ) wrote a driver for it and use it for
Containers as well.<br>
<br>
Q: i use SSD, not HDD, does tcache help me?<br>
A: SSD can provide much more IOPS, thus the Node's
performance increase caused by tcache<br>
is less, but still reading from RAM (tcache is in RAM)
is faster than reading from SSD.<br>
</blockquote>
<div>is less "significant"<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px
0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex"> <br>
<br>
>> c) you can limit the max amount of memory which
can be used for<br>
>> pagecache for this Container<br>
>> memcg::memory.cache.limit_in_bytes<br>
><br>
> This seems viable to test as well. Currently it seems
to be utilizing a<br>
> high number 'unlimited' default. I assume the only
way to set this is to<br>
> directly interact with the memory cgroup and not via
a standard ve<br>
> config value?<br>
<br>
Yes, you are right.<br>
We use this setting for some internal system cgroups
running processes<br>
which are known to generate a lot of pagecache which won't
be used later for sure.<br>
<br>
From my perspective it's not fair to apply such a setting
to a Container<br>
globally - well, CT owner pay for an amount of RAM, it
should be able to use<br>
this RAM for whatever he wants to - even for pagecache,<br>
so limiting the pagecache for a Container is not a tweak
we is advised to be used<br>
against a Container => no standard config parameter.<br>
<br>
Note: disabling tcache for a Container is completely fair,<br>
you disable just an optimization for the whole Hardware
Node performance,<br>
but all RAM configured for a Container - is still
available to the Container.<br>
(but also no official config value for that - most often
it helps, not hurts)<br>
<br>
<br>
> I assume regardless if we utilized vSwap or not, we
would likely still<br>
> experience these additional swapping issues,
presumably from pagecache<br>
> applications, or would the usage of vSwap intercept
some of these items<br>
> thus preventing them from being swapped to disk?<br>
<br>
vSwap - is the optimization for swapping process _local to
a Container_,<br>
it can prevent some Container anonymous pages to be
written to the physical swap,<br>
if _local_ Container reclaim decides to swapout something.<br>
<br>
At the moment you experience swapping on the Node level.<br>
Even if some Container's processes are put to the physical
swap,<br>
it's a decision of the global reclaim mechanism,<br>
so it's completely unrelated to vSwap =><br>
even if you assign some swappages to Containers and thus
enable vSwap for those Containers,<br>
i should not influence anyhow on global Node level memory
pressure and<br>
will not result in any difference in the swapping rate
into physical swap.<br>
<br>
Hope that helps.<br>
<br>
--<br>
Best regards,<br>
<br>
Konstantin Khorenko,<br>
Virtuozzo Linux Kernel Team<br>
_______________________________________________<br>
Users mailing list<br>
<a moz-do-not-send="true" href="mailto:Users@openvz.org"
target="_blank">Users@openvz.org</a><br>
<a moz-do-not-send="true"
href="https://lists.openvz.org/mailman/listinfo/users"
rel="noreferrer" target="_blank">https://lists.openvz.org/mailman/listinfo/users</a><br>
</blockquote>
</div>
<br clear="all">
<br>
-- <br>
<div dir="ltr" class="gmail_signature">Best regards,<br>
[COOLCOLD-RIPN] </div>
</div>
<br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<br>
<pre wrap="">_______________________________________________
Users mailing list
<a class="moz-txt-link-abbreviated" href="mailto:Users@openvz.org" moz-do-not-send="true">Users@openvz.org</a>
<a class="moz-txt-link-freetext" href="https://lists.openvz.org/mailman/listinfo/users" moz-do-not-send="true">https://lists.openvz.org/mailman/listinfo/users</a>
</pre>
</blockquote>
<br>
<br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<pre class="moz-quote-pre" wrap="">_______________________________________________
Users mailing list
<a class="moz-txt-link-abbreviated" href="mailto:Users@openvz.org">Users@openvz.org</a>
<a class="moz-txt-link-freetext" href="https://lists.openvz.org/mailman/listinfo/users">https://lists.openvz.org/mailman/listinfo/users</a>
</pre>
</blockquote>
<p><br>
</p>
<pre class="moz-signature" cols="72">--
Sincerely
Daniel C Pearson
COO KnownHost, LLC
<a class="moz-txt-link-freetext" href="https://www.knownhost.com">https://www.knownhost.com</a></pre>
</body>
</html>