<div dir="ltr"><div dir="ltr"><div>Hello!</div><div><br></div><div>1st - great work guys! Dealign with LXC and even LXD makes me miss my old good OpenVZ box because of tech excellence! Keep going!<br></div><div>2nd - my 2 cents for content - I"m not a native speaker, but still suggest some small fixes.</div><div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Jul 23, 2020 at 9:52 PM Konstantin Khorenko <<a href="mailto:khorenko@virtuozzo.com">khorenko@virtuozzo.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On 07/22/2020 03:04 PM, Daniel Pearson wrote:<br>
<br>
>> b) you can disable tcache for this Container<br>
>> memcg::memory.disable_cleancache<br>
>> (raise your hand if you wish me to explain what tcache is)<br>
><br>
> I'm all for additional information as it can help to form proper<br>
> opinions if you don't mind providing it.<br>
<br>
Hope after reading it you'll catch yourself on an idea that now you are aware of one more<br>
small feature which makes VZ is really cool and that there are a lot of things which<br>
just work somewhere in the background simply (and silently) making it possible for you<br>
to utilize the hardware at maximum. :)<br>
<br>
Tcache<br>
======<br>
<br>
Brief tech explanation:<br>
=======================<br>
Transcendent file cache (tcache) is a driver for cleancache<br>
<a href="https://www.kernel.org/doc/html/v4.18/vm/cleancache.html" rel="noreferrer" target="_blank">https://www.kernel.org/doc/html/v4.18/vm/cleancache.html</a> ,<br>
which stores reclaimed pages in memory unmodified. Its purpose it to<br>
adopt pages evicted from a memory cgroup on _local_ pressure (inside a Container),<br>
so that they can be fetched back later without costly disk accesses.<br>
<br>
Detailed explanation:<br>
=====================<br>
Tcache is intended increase the overall Hardware Node performance only<br></blockquote><div>Intented "to"<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
on undercommitted Nodes, i.e. sum of all Containers memory limits on the Node<br></blockquote><div>i.e. "where total sum of all Containers memory limit values placed on the Node" <br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
is less than Hardware Node RAM size.<br>
<br>
Imagine a situation: you have a Node with 1Tb of RAM,<br>
you run 500 Containers on it limited by 1Gb of memory each (no swap for simplicity).<br>
Let's consider Container to be more or less identical, similar load, similar activity inside.<br>
=> normally those Containers must use 500Gb of physical RAM at max, right,<br>
and 500Gb will be just free on the Node.<br>
<br>
You think it's simple situation - ok, the node is underloaded, let's put more Containers there,<br>
but that's not always true - it depends on what is the bottleneck on the Node,<br>
which depends on real workload of Containers running on the Node.<br>
But most often in real life - the disk becomes the bottleneck first, not the RAM, not the CPU.<br>
<br>
Example: let's assume all those Containers run, say, cPanel, which by default collect some stats<br>
every, say, 15 minutes - the stat collection process is run via crontab.<br>
<br>
(Side note: randomizing times of crontab jobs - is a good idea, but who usually does this<br>
for Containers? We did it for application templates we shipped in Virtuozzo, but lot of<br>
software is just installed and configured inside Containers, we cannot do this. And often<br>
Hosting Providers are not allowed to touch data in Containers - so most often cron jobs are<br>
not randomized.)<br>
<br>
Ok, it does not matter how, but let's assume we get such a workload - every, say, 15 minutes<br>
(it's important that data access it quite rare), each Container accesses many small files,<br>
let it be just 100 small files to gather stats and save it somewhere.<br>
In 500 Containers. Simultaneously.<br>
In parallel with other regular i/o workload.<br>
On HDDs.<br>
<br>
It's nightmare for disk subsystem, you know, if an HDD provides 100 IOPS,<br>
it will take 50000/100/60 = 8.(3) minutes(!) to handle.<br>
OK, there could be RAID, let it is able to handle 300 IOPS, it results in<br>
2.(7) minutes, and we forgot about other regular i/o,<br>
so it means every 15 minutes, the Node became almost unresponsive for several minutes<br>
until it handles all that random i/o generated by stats collection.<br>
<br>
You can ask - but why _every_ 15 minutes? You've read once a file and it resides in the<br>
Container pagecache!<br>
That's true, but here comes _15 minutes_ period. The larger period - the worse.<br>
If a Container is active enough, it just reads more and more files - website data,<br>
pictures, video clips, files of a fileserver, don't know.<br>
The thing is in 15 minutes it's quite possible a Container reads more than its RAM limit<br>
(remember - only 1Gb in our case!), and thus all old pagecache is dropped, substituted<br>
with the fresh one.<br>
And thus in 15 minutes it's quite possible you'll have to read all those 100 files in each<br>
Container from disk.<br>
<br>
And here comes tcache to save us: let's don't completely drop pagecache which is<br>
reclaimed from a Container (on local(!) reclaim), but save this pagecache in<br>
a special cache (tcache) on the Host in case there is free RAM on the Host.<br>
<br>
And in 15 minutes when all Containers start to access lot of small files again -<br>
those files data will be get back into Container pagecache without reading from<br>
physical disk - viola, we saves IOPS, no Node stuck anymore.<br>
<br>
Q: can a Container be so active (i.e. read so much from disk) that this "useful"<br>
pagecache is dropped even from tcache.<br></blockquote><div>missing question mark - ? <br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
A: Yes. But tcache extends the "safe" period.<br>
<br>
Q: mainstream? LXC/Proxmox?<br>
A: No, it's Virtuozzo/OpenVZ specific.<br>
"cleancache" - the base for tcache it in mainstream, it's used for Xen.<br>
But we (VZ) wrote a driver for it and use it for Containers as well.<br>
<br>
Q: i use SSD, not HDD, does tcache help me?<br>
A: SSD can provide much more IOPS, thus the Node's performance increase caused by tcache<br>
is less, but still reading from RAM (tcache is in RAM) is faster than reading from SSD.<br></blockquote><div>is less "significant"<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<br>
<br>
>> c) you can limit the max amount of memory which can be used for<br>
>> pagecache for this Container<br>
>> memcg::memory.cache.limit_in_bytes<br>
><br>
> This seems viable to test as well. Currently it seems to be utilizing a<br>
> high number 'unlimited' default. I assume the only way to set this is to<br>
> directly interact with the memory cgroup and not via a standard ve<br>
> config value?<br>
<br>
Yes, you are right.<br>
We use this setting for some internal system cgroups running processes<br>
which are known to generate a lot of pagecache which won't be used later for sure.<br>
<br>
From my perspective it's not fair to apply such a setting to a Container<br>
globally - well, CT owner pay for an amount of RAM, it should be able to use<br>
this RAM for whatever he wants to - even for pagecache,<br>
so limiting the pagecache for a Container is not a tweak we is advised to be used<br>
against a Container => no standard config parameter.<br>
<br>
Note: disabling tcache for a Container is completely fair,<br>
you disable just an optimization for the whole Hardware Node performance,<br>
but all RAM configured for a Container - is still available to the Container.<br>
(but also no official config value for that - most often it helps, not hurts)<br>
<br>
<br>
> I assume regardless if we utilized vSwap or not, we would likely still<br>
> experience these additional swapping issues, presumably from pagecache<br>
> applications, or would the usage of vSwap intercept some of these items<br>
> thus preventing them from being swapped to disk?<br>
<br>
vSwap - is the optimization for swapping process _local to a Container_,<br>
it can prevent some Container anonymous pages to be written to the physical swap,<br>
if _local_ Container reclaim decides to swapout something.<br>
<br>
At the moment you experience swapping on the Node level.<br>
Even if some Container's processes are put to the physical swap,<br>
it's a decision of the global reclaim mechanism,<br>
so it's completely unrelated to vSwap =><br>
even if you assign some swappages to Containers and thus enable vSwap for those Containers,<br>
i should not influence anyhow on global Node level memory pressure and<br>
will not result in any difference in the swapping rate into physical swap.<br>
<br>
Hope that helps.<br>
<br>
--<br>
Best regards,<br>
<br>
Konstantin Khorenko,<br>
Virtuozzo Linux Kernel Team<br>
_______________________________________________<br>
Users mailing list<br>
<a href="mailto:Users@openvz.org" target="_blank">Users@openvz.org</a><br>
<a href="https://lists.openvz.org/mailman/listinfo/users" rel="noreferrer" target="_blank">https://lists.openvz.org/mailman/listinfo/users</a><br>
</blockquote></div><br clear="all"><br>-- <br><div dir="ltr" class="gmail_signature">Best regards,<br>[COOLCOLD-RIPN] </div></div>