[Users] Some details of vSwap implementation in Virtuozzo 7: "tcache" in details

Thu Jul 23 20:37:34 MSK 2020

On 07/23/2020 06:34 PM, CoolCold wrote:
> Hello!
>
> 1st - great work guys! Dealign with LXC and even LXD makes me miss my old good OpenVZ box because of tech excellence! Keep going!
> 2nd - my 2 cents for content - I"m not a native speaker, but still suggest some small fixes.

1. Thank you very much for the feedback!
And you are very welcome back to use OpenVZ instead of LXC again. :)

2. And many thanks for content corrections!
i've just created a wiki page for tcache - decided this info should be saved somewhere publicly available.
i've also added a section how to enabled/disable tcache for Containers.

And you are very welcome to edit the wiki page as well. :)

https://wiki.openvz.org/Tcache

--
Best regards,

Konstantin Khorenko,
Virtuozzo Linux Kernel Team

> On Thu, Jul 23, 2020 at 9:52 PM Konstantin Khorenko <khorenko at virtuozzo.com <mailto:khorenko at virtuozzo.com>> wrote:
>
>     On 07/22/2020 03:04 PM, Daniel Pearson wrote:
>
>     >> b) you can disable tcache for this Container
>     >> memcg::memory.disable_cleancache
>     >>       (raise your hand if you wish me to explain what tcache is)
>     >
>     > I'm all for additional information as it can help to form proper
>     > opinions if you don't mind providing it.
>
>     Hope after reading it you'll catch yourself on an idea that now you are aware of one more
>     small feature which makes VZ is really cool and that there are a lot of things which
>     just work somewhere in the background simply (and silently) making it possible for you
>     to utilize the hardware at maximum. :)
>
>     Tcache
>     ======
>
>     Brief tech explanation:
>     =======================
>     Transcendent file cache (tcache) is a driver for cleancache
>     https://www.kernel.org/doc/html/v4.18/vm/cleancache.html ,
>     which stores reclaimed pages in memory unmodified. Its purpose it to
>     adopt pages evicted from a memory cgroup on _local_ pressure (inside a Container),
>     so that they can be fetched back later without costly disk accesses.
>
>     Detailed explanation:
>     =====================
>     Tcache is intended increase the overall Hardware Node performance only
>
> Intented "to"
>
>     on undercommitted Nodes, i.e. sum of all Containers memory limits on the Node
>
> i.e. "where total sum of all Containers memory limit values placed on the Node"
>
>     is less than Hardware Node RAM size.
>
>     Imagine a situation: you have a Node with 1Tb of RAM,
>     you run 500 Containers on it limited by 1Gb of memory each (no swap for simplicity).
>     Let's consider Container to be more or less identical, similar load, similar activity inside.
>     => normally those Containers must use 500Gb of physical RAM at max, right,
>     and 500Gb will be just free on the Node.
>
>     You think it's simple situation - ok, the node is underloaded, let's put more Containers there,
>     but that's not always true - it depends on what is the bottleneck on the Node,
>     which depends on real workload of Containers running on the Node.
>     But most often in real life - the disk becomes the bottleneck first, not the RAM, not the CPU.
>
>     Example: let's assume all those Containers run, say, cPanel, which by default collect some stats
>     every, say, 15 minutes - the stat collection process is run via crontab.
>
>     (Side note: randomizing times of crontab jobs - is a good idea, but who usually does this
>     for Containers? We did it for application templates we shipped in Virtuozzo, but lot of
>     software is just installed and configured inside Containers, we cannot do this. And often
>     Hosting Providers are not allowed to touch data in Containers - so most often cron jobs are
>     not randomized.)
>
>     Ok, it does not matter how, but let's assume we get such a workload - every, say, 15 minutes
>     (it's important that data access it quite rare), each Container accesses many small files,
>     let it be just 100 small files to gather stats and save it somewhere.
>     In 500 Containers. Simultaneously.
>     In parallel with other regular i/o workload.
>     On HDDs.
>
>     It's nightmare for disk subsystem, you know,  if an HDD provides 100 IOPS,
>     it will take 50000/100/60 = 8.(3) minutes(!) to handle.
>     OK, there could be RAID, let it is able to handle 300 IOPS, it results in
>     2.(7) minutes, and we forgot about other regular i/o,
>     so it means every 15 minutes, the Node became almost unresponsive for several minutes
>     until it handles all that random i/o generated by stats collection.
>
>     You can ask - but why _every_ 15 minutes? You've read once a file and it resides in the
>     Container pagecache!
>     That's true, but here comes _15 minutes_ period. The larger period - the worse.
>     If a Container is active enough, it just reads more and more files - website data,
>     pictures, video clips, files of a fileserver, don't know.
>     The thing is in 15 minutes it's quite possible a Container reads more than its RAM limit
>     (remember - only 1Gb in our case!), and thus all old pagecache is dropped, substituted
>     with the fresh one.
>     And thus in 15 minutes it's quite possible you'll have to read all those 100 files in each
>     Container from disk.
>
>     And here comes tcache to save us: let's don't completely drop pagecache which is
>     reclaimed from a Container (on local(!) reclaim), but save this pagecache in
>     a special cache (tcache) on the Host in case there is free RAM on the Host.
>
>     And in 15 minutes when all Containers start to access lot of small files again -
>     those files data will be get back into Container pagecache without reading from
>     physical disk - viola, we saves IOPS, no Node stuck anymore.
>
>     Q: can a Container be so active (i.e. read so much from disk) that this "useful"
>     pagecache is dropped even from tcache.
>
> missing question mark - ?
>
>     A: Yes. But tcache extends the "safe" period.
>
>     Q: mainstream? LXC/Proxmox?
>     A: No, it's Virtuozzo/OpenVZ specific.
>         "cleancache" - the base for tcache it in mainstream, it's used for Xen.
>         But we (VZ) wrote a driver for it and use it for Containers as well.
>
>     Q: i use SSD, not HDD, does tcache help me?
>     A: SSD can provide much more IOPS, thus the Node's performance increase caused by tcache
>         is less, but still reading from RAM (tcache is in RAM) is faster than reading from SSD.
>
> is less "significant"
>
>
>
>     >> c) you can limit the max amount of memory which can be used for
>     >> pagecache for this Container
>     >>       memcg::memory.cache.limit_in_bytes
>     >
>     > This seems viable to test as well. Currently it seems to be utilizing a
>     > high number 'unlimited' default. I assume the only way to set this is to
>     > directly interact with the memory cgroup and not via a standard ve
>     > config value?
>
>     Yes, you are right.
>     We use this setting for some internal system cgroups running processes
>     which are known to generate a lot of pagecache which won't be used later for sure.
>
>      From my perspective it's not fair to apply such a setting to a Container
>     globally - well, CT owner pay for an amount of RAM, it should be able to use
>     this RAM for whatever he wants to - even for pagecache,
>     so limiting the pagecache for a Container is not a tweak we is advised to be used
>     against a Container => no standard config parameter.
>
>     Note: disabling tcache for a Container is completely fair,
>     you disable just an optimization for the whole Hardware Node performance,
>     but all RAM configured for a Container - is still available to the Container.
>     (but also no official config value for that - most often it helps, not hurts)
>
>
>     > I assume regardless if we utilized vSwap or not, we would likely still
>     > experience these additional swapping issues, presumably from pagecache
>     > applications, or would the usage of vSwap intercept some of these items
>     > thus preventing them from being swapped to disk?
>
>     vSwap - is the optimization for swapping process _local to a Container_,
>     it can prevent some Container anonymous pages to be written to the physical swap,
>     if _local_ Container reclaim decides to swapout something.
>
>     At the moment you experience swapping on the Node level.
>     Even if some Container's processes are put to the physical swap,
>     it's a decision of the global reclaim mechanism,
>     so it's completely unrelated to vSwap =>
>     even if you assign some swappages to Containers and thus enable vSwap for those Containers,
>     i should not influence anyhow on global Node level memory pressure and
>     will not result in any difference in the swapping rate into physical swap.
>
>     Hope that helps.
>
>     --
>     Best regards,
>
>     Konstantin Khorenko,
>     Virtuozzo Linux Kernel Team
>     _______________________________________________
>     Users mailing list
>     Users at openvz.org <mailto:Users at openvz.org>
>     https://lists.openvz.org/mailman/listinfo/users
>
>
>
> -- 
> Best regards,
> [COOLCOLD-RIPN]
>
>
> _______________________________________________
> Users mailing list
> Users at openvz.org
> https://lists.openvz.org/mailman/listinfo/users

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openvz.org/pipermail/users/attachments/20200723/24713349/attachment.html>