[Users] Some details of vSwap implementation in Virtuozzo 7: "tcache" in details

Thu Jul 23 17:47:41 MSK 2020

On 07/22/2020 03:04 PM, Daniel Pearson wrote:

>> b) you can disable tcache for this Container
>> memcg::memory.disable_cleancache
>>       (raise your hand if you wish me to explain what tcache is)
>
> I'm all for additional information as it can help to form proper
> opinions if you don't mind providing it.

Hope after reading it you'll catch yourself on an idea that now you are aware of one more
small feature which makes VZ is really cool and that there are a lot of things which
just work somewhere in the background simply (and silently) making it possible for you
to utilize the hardware at maximum. :)

Tcache
======

Brief tech explanation:
=======================
Transcendent file cache (tcache) is a driver for cleancache
https://www.kernel.org/doc/html/v4.18/vm/cleancache.html ,
which stores reclaimed pages in memory unmodified. Its purpose it to
adopt pages evicted from a memory cgroup on _local_ pressure (inside a Container),
so that they can be fetched back later without costly disk accesses.

Detailed explanation:
=====================
Tcache is intended increase the overall Hardware Node performance only
on undercommitted Nodes, i.e. sum of all Containers memory limits on the Node
is less than Hardware Node RAM size.

Imagine a situation: you have a Node with 1Tb of RAM,
you run 500 Containers on it limited by 1Gb of memory each (no swap for simplicity).
Let's consider Container to be more or less identical, similar load, similar activity inside.
=> normally those Containers must use 500Gb of physical RAM at max, right,
and 500Gb will be just free on the Node.

You think it's simple situation - ok, the node is underloaded, let's put more Containers there,
but that's not always true - it depends on what is the bottleneck on the Node,
which depends on real workload of Containers running on the Node.
But most often in real life - the disk becomes the bottleneck first, not the RAM, not the CPU.

Example: let's assume all those Containers run, say, cPanel, which by default collect some stats
every, say, 15 minutes - the stat collection process is run via crontab.

(Side note: randomizing times of crontab jobs - is a good idea, but who usually does this
for Containers? We did it for application templates we shipped in Virtuozzo, but lot of
software is just installed and configured inside Containers, we cannot do this. And often
Hosting Providers are not allowed to touch data in Containers - so most often cron jobs are
not randomized.)

Ok, it does not matter how, but let's assume we get such a workload - every, say, 15 minutes
(it's important that data access it quite rare), each Container accesses many small files,
let it be just 100 small files to gather stats and save it somewhere.
In 500 Containers. Simultaneously.
In parallel with other regular i/o workload.
On HDDs.

It's nightmare for disk subsystem, you know,  if an HDD provides 100 IOPS,
it will take 50000/100/60 = 8.(3) minutes(!) to handle.
OK, there could be RAID, let it is able to handle 300 IOPS, it results in
2.(7) minutes, and we forgot about other regular i/o,
so it means every 15 minutes, the Node became almost unresponsive for several minutes
until it handles all that random i/o generated by stats collection.

You can ask - but why _every_ 15 minutes? You've read once a file and it resides in the
Container pagecache!
That's true, but here comes _15 minutes_ period. The larger period - the worse.
If a Container is active enough, it just reads more and more files - website data,
pictures, video clips, files of a fileserver, don't know.
The thing is in 15 minutes it's quite possible a Container reads more than its RAM limit
(remember - only 1Gb in our case!), and thus all old pagecache is dropped, substituted
with the fresh one.
And thus in 15 minutes it's quite possible you'll have to read all those 100 files in each
Container from disk.

And here comes tcache to save us: let's don't completely drop pagecache which is
reclaimed from a Container (on local(!) reclaim), but save this pagecache in
a special cache (tcache) on the Host in case there is free RAM on the Host.

And in 15 minutes when all Containers start to access lot of small files again -
those files data will be get back into Container pagecache without reading from
physical disk - viola, we saves IOPS, no Node stuck anymore.

Q: can a Container be so active (i.e. read so much from disk) that this "useful"
pagecache is dropped even from tcache.
A: Yes. But tcache extends the "safe" period.

Q: mainstream? LXC/Proxmox?
A: No, it's Virtuozzo/OpenVZ specific.
    "cleancache" - the base for tcache it in mainstream, it's used for Xen.
    But we (VZ) wrote a driver for it and use it for Containers as well.

Q: i use SSD, not HDD, does tcache help me?
A: SSD can provide much more IOPS, thus the Node's performance increase caused by tcache
    is less, but still reading from RAM (tcache is in RAM) is faster than reading from SSD.

>> c) you can limit the max amount of memory which can be used for
>> pagecache for this Container
>>       memcg::memory.cache.limit_in_bytes
>
> This seems viable to test as well. Currently it seems to be utilizing a
> high number 'unlimited' default. I assume the only way to set this is to
> directly interact with the memory cgroup and not via a standard ve
> config value?

Yes, you are right.
We use this setting for some internal system cgroups running processes
which are known to generate a lot of pagecache which won't be used later for sure.

 From my perspective it's not fair to apply such a setting to a Container
globally - well, CT owner pay for an amount of RAM, it should be able to use
this RAM for whatever he wants to - even for pagecache,
so limiting the pagecache for a Container is not a tweak we is advised to be used
against a Container => no standard config parameter.

Note: disabling tcache for a Container is completely fair,
you disable just an optimization for the whole Hardware Node performance,
but all RAM configured for a Container - is still available to the Container.
(but also no official config value for that - most often it helps, not hurts)

> I assume regardless if we utilized vSwap or not, we would likely still
> experience these additional swapping issues, presumably from pagecache
> applications, or would the usage of vSwap intercept some of these items
> thus preventing them from being swapped to disk?

vSwap - is the optimization for swapping process _local to a Container_,
it can prevent some Container anonymous pages to be written to the physical swap,
if _local_ Container reclaim decides to swapout something.

At the moment you experience swapping on the Node level.
Even if some Container's processes are put to the physical swap,
it's a decision of the global reclaim mechanism,
so it's completely unrelated to vSwap =>
even if you assign some swappages to Containers and thus enable vSwap for those Containers,
i should not influence anyhow on global Node level memory pressure and
will not result in any difference in the swapping rate into physical swap.

Hope that helps.

--
Best regards,

Konstantin Khorenko,
Virtuozzo Linux Kernel Team