<div dir="ltr"><div dir="ltr"><div>Hello!</div><div><br></div><div>1st - great work guys! Dealign with LXC and even LXD makes me miss my old good OpenVZ box because of tech excellence! Keep going!<br></div><div>2nd - my 2 cents for content - I&quot;m not a native speaker, but still suggest some small fixes.</div><div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Jul 23, 2020 at 9:52 PM Konstantin Khorenko &lt;<a href="mailto:khorenko@virtuozzo.com">khorenko@virtuozzo.com</a>&gt; wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On 07/22/2020 03:04 PM, Daniel Pearson wrote:<br>

<br>

&gt;&gt; b) you can disable tcache for this Container<br>

&gt;&gt; memcg::memory.disable_cleancache<br>

&gt;&gt;       (raise your hand if you wish me to explain what tcache is)<br>

&gt;<br>

&gt; I&#39;m all for additional information as it can help to form proper<br>

&gt; opinions if you don&#39;t mind providing it.<br>

<br>

Hope after reading it you&#39;ll catch yourself on an idea that now you are aware of one more<br>

small feature which makes VZ is really cool and that there are a lot of things which<br>

just work somewhere in the background simply (and silently) making it possible for you<br>

to utilize the hardware at maximum. :)<br>

<br>

Tcache<br>

======<br>

<br>

Brief tech explanation:<br>

=======================<br>

Transcendent file cache (tcache) is a driver for cleancache<br>

<a href="https://www.kernel.org/doc/html/v4.18/vm/cleancache.html" rel="noreferrer" target="_blank">https://www.kernel.org/doc/html/v4.18/vm/cleancache.html</a> ,<br>

which stores reclaimed pages in memory unmodified. Its purpose it to<br>

adopt pages evicted from a memory cgroup on _local_ pressure (inside a Container),<br>

so that they can be fetched back later without costly disk accesses.<br>

<br>

Detailed explanation:<br>

=====================<br>

Tcache is intended increase the overall Hardware Node performance only<br></blockquote><div>Intented &quot;to&quot;<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

on undercommitted Nodes, i.e. sum of all Containers memory limits on the Node<br></blockquote><div>i.e. &quot;where total sum of all Containers memory limit values placed on the Node&quot; <br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

is less than Hardware Node RAM size.<br>

<br>

Imagine a situation: you have a Node with 1Tb of RAM,<br>

you run 500 Containers on it limited by 1Gb of memory each (no swap for simplicity).<br>

Let&#39;s consider Container to be more or less identical, similar load, similar activity inside.<br>

=&gt; normally those Containers must use 500Gb of physical RAM at max, right,<br>

and 500Gb will be just free on the Node.<br>

<br>

You think it&#39;s simple situation - ok, the node is underloaded, let&#39;s put more Containers there,<br>

but that&#39;s not always true - it depends on what is the bottleneck on the Node,<br>

which depends on real workload of Containers running on the Node.<br>

But most often in real life - the disk becomes the bottleneck first, not the RAM, not the CPU.<br>

<br>

Example: let&#39;s assume all those Containers run, say, cPanel, which by default collect some stats<br>

every, say, 15 minutes - the stat collection process is run via crontab.<br>

<br>

(Side note: randomizing times of crontab jobs - is a good idea, but who usually does this<br>

for Containers? We did it for application templates we shipped in Virtuozzo, but lot of<br>

software is just installed and configured inside Containers, we cannot do this. And often<br>

Hosting Providers are not allowed to touch data in Containers - so most often cron jobs are<br>

not randomized.)<br>

<br>

Ok, it does not matter how, but let&#39;s assume we get such a workload - every, say, 15 minutes<br>

(it&#39;s important that data access it quite rare), each Container accesses many small files,<br>

let it be just 100 small files to gather stats and save it somewhere.<br>

In 500 Containers. Simultaneously.<br>

In parallel with other regular i/o workload.<br>

On HDDs.<br>

<br>

It&#39;s nightmare for disk subsystem, you know,  if an HDD provides 100 IOPS,<br>

it will take 50000/100/60 = 8.(3) minutes(!) to handle.<br>

OK, there could be RAID, let it is able to handle 300 IOPS, it results in<br>

2.(7) minutes, and we forgot about other regular i/o,<br>

so it means every 15 minutes, the Node became almost unresponsive for several minutes<br>

until it handles all that random i/o generated by stats collection.<br>

<br>

You can ask - but why _every_ 15 minutes? You&#39;ve read once a file and it resides in the<br>

Container pagecache!<br>

That&#39;s true, but here comes _15 minutes_ period. The larger period - the worse.<br>

If a Container is active enough, it just reads more and more files - website data,<br>

pictures, video clips, files of a fileserver, don&#39;t know.<br>

The thing is in 15 minutes it&#39;s quite possible a Container reads more than its RAM limit<br>

(remember - only 1Gb in our case!), and thus all old pagecache is dropped, substituted<br>

with the fresh one.<br>

And thus in 15 minutes it&#39;s quite possible you&#39;ll have to read all those 100 files in each<br>

Container from disk.<br>

<br>

And here comes tcache to save us: let&#39;s don&#39;t completely drop pagecache which is<br>

reclaimed from a Container (on local(!) reclaim), but save this pagecache in<br>

a special cache (tcache) on the Host in case there is free RAM on the Host.<br>

<br>

And in 15 minutes when all Containers start to access lot of small files again -<br>

those files data will be get back into Container pagecache without reading from<br>

physical disk - viola, we saves IOPS, no Node stuck anymore.<br>

<br>

Q: can a Container be so active (i.e. read so much from disk) that this &quot;useful&quot;<br>

pagecache is dropped even from tcache.<br></blockquote><div>missing question mark - ? <br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

A: Yes. But tcache extends the &quot;safe&quot; period.<br>

<br>

Q: mainstream? LXC/Proxmox?<br>

A: No, it&#39;s Virtuozzo/OpenVZ specific.<br>

    &quot;cleancache&quot; - the base for tcache it in mainstream, it&#39;s used for Xen.<br>

    But we (VZ) wrote a driver for it and use it for Containers as well.<br>

<br>

Q: i use SSD, not HDD, does tcache help me?<br>

A: SSD can provide much more IOPS, thus the Node&#39;s performance increase caused by tcache<br>

    is less, but still reading from RAM (tcache is in RAM) is faster than reading from SSD.<br></blockquote><div>is less &quot;significant&quot;<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<br>

<br>

&gt;&gt; c) you can limit the max amount of memory which can be used for<br>

&gt;&gt; pagecache for this Container<br>

&gt;&gt;       memcg::memory.cache.limit_in_bytes<br>

&gt;<br>

&gt; This seems viable to test as well. Currently it seems to be utilizing a<br>

&gt; high number &#39;unlimited&#39; default. I assume the only way to set this is to<br>

&gt; directly interact with the memory cgroup and not via a standard ve<br>

&gt; config value?<br>

<br>

Yes, you are right.<br>

We use this setting for some internal system cgroups running processes<br>

which are known to generate a lot of pagecache which won&#39;t be used later for sure.<br>

<br>

 From my perspective it&#39;s not fair to apply such a setting to a Container<br>

globally - well, CT owner pay for an amount of RAM, it should be able to use<br>

this RAM for whatever he wants to - even for pagecache,<br>

so limiting the pagecache for a Container is not a tweak we is advised to be used<br>

against a Container =&gt; no standard config parameter.<br>

<br>

Note: disabling tcache for a Container is completely fair,<br>

you disable just an optimization for the whole Hardware Node performance,<br>

but all RAM configured for a Container - is still available to the Container.<br>

(but also no official config value for that - most often it helps, not hurts)<br>

<br>

<br>

&gt; I assume regardless if we utilized vSwap or not, we would likely still<br>

&gt; experience these additional swapping issues, presumably from pagecache<br>

&gt; applications, or would the usage of vSwap intercept some of these items<br>

&gt; thus preventing them from being swapped to disk?<br>

<br>

vSwap - is the optimization for swapping process _local to a Container_,<br>

it can prevent some Container anonymous pages to be written to the physical swap,<br>

if _local_ Container reclaim decides to swapout something.<br>

<br>

At the moment you experience swapping on the Node level.<br>

Even if some Container&#39;s processes are put to the physical swap,<br>

it&#39;s a decision of the global reclaim mechanism,<br>

so it&#39;s completely unrelated to vSwap =&gt;<br>

even if you assign some swappages to Containers and thus enable vSwap for those Containers,<br>

i should not influence anyhow on global Node level memory pressure and<br>

will not result in any difference in the swapping rate into physical swap.<br>

<br>

Hope that helps.<br>

<br>

--<br>

Best regards,<br>

<br>

Konstantin Khorenko,<br>

Virtuozzo Linux Kernel Team<br>

_______________________________________________<br>

Users mailing list<br>

<a href="mailto:Users@openvz.org" target="_blank">Users@openvz.org</a><br>

<a href="https://lists.openvz.org/mailman/listinfo/users" rel="noreferrer" target="_blank">https://lists.openvz.org/mailman/listinfo/users</a><br>

</blockquote></div><br clear="all"><br>-- <br><div dir="ltr" class="gmail_signature">Best regards,<br>[COOLCOLD-RIPN] </div></div>