[Users] Some details of vSwap implementation in Virtuozzo 7: "tcache" in details

CoolCold coolthecold at gmail.com
Thu Jul 23 18:34:27 MSK 2020


Hello!

1st - great work guys! Dealign with LXC and even LXD makes me miss my old
good OpenVZ box because of tech excellence! Keep going!
2nd - my 2 cents for content - I"m not a native speaker, but still suggest
some small fixes.


On Thu, Jul 23, 2020 at 9:52 PM Konstantin Khorenko <khorenko at virtuozzo.com>
wrote:

> On 07/22/2020 03:04 PM, Daniel Pearson wrote:
>
> >> b) you can disable tcache for this Container
> >> memcg::memory.disable_cleancache
> >>       (raise your hand if you wish me to explain what tcache is)
> >
> > I'm all for additional information as it can help to form proper
> > opinions if you don't mind providing it.
>
> Hope after reading it you'll catch yourself on an idea that now you are
> aware of one more
> small feature which makes VZ is really cool and that there are a lot of
> things which
> just work somewhere in the background simply (and silently) making it
> possible for you
> to utilize the hardware at maximum. :)
>
> Tcache
> ======
>
> Brief tech explanation:
> =======================
> Transcendent file cache (tcache) is a driver for cleancache
> https://www.kernel.org/doc/html/v4.18/vm/cleancache.html ,
> which stores reclaimed pages in memory unmodified. Its purpose it to
> adopt pages evicted from a memory cgroup on _local_ pressure (inside a
> Container),
> so that they can be fetched back later without costly disk accesses.
>
> Detailed explanation:
> =====================
> Tcache is intended increase the overall Hardware Node performance only
>
Intented "to"

> on undercommitted Nodes, i.e. sum of all Containers memory limits on the
> Node
>
i.e. "where total sum of all Containers memory limit values placed on the
Node"

> is less than Hardware Node RAM size.
>
> Imagine a situation: you have a Node with 1Tb of RAM,
> you run 500 Containers on it limited by 1Gb of memory each (no swap for
> simplicity).
> Let's consider Container to be more or less identical, similar load,
> similar activity inside.
> => normally those Containers must use 500Gb of physical RAM at max, right,
> and 500Gb will be just free on the Node.
>
> You think it's simple situation - ok, the node is underloaded, let's put
> more Containers there,
> but that's not always true - it depends on what is the bottleneck on the
> Node,
> which depends on real workload of Containers running on the Node.
> But most often in real life - the disk becomes the bottleneck first, not
> the RAM, not the CPU.
>
> Example: let's assume all those Containers run, say, cPanel, which by
> default collect some stats
> every, say, 15 minutes - the stat collection process is run via crontab.
>
> (Side note: randomizing times of crontab jobs - is a good idea, but who
> usually does this
> for Containers? We did it for application templates we shipped in
> Virtuozzo, but lot of
> software is just installed and configured inside Containers, we cannot do
> this. And often
> Hosting Providers are not allowed to touch data in Containers - so most
> often cron jobs are
> not randomized.)
>
> Ok, it does not matter how, but let's assume we get such a workload -
> every, say, 15 minutes
> (it's important that data access it quite rare), each Container accesses
> many small files,
> let it be just 100 small files to gather stats and save it somewhere.
> In 500 Containers. Simultaneously.
> In parallel with other regular i/o workload.
> On HDDs.
>
> It's nightmare for disk subsystem, you know,  if an HDD provides 100 IOPS,
> it will take 50000/100/60 = 8.(3) minutes(!) to handle.
> OK, there could be RAID, let it is able to handle 300 IOPS, it results in
> 2.(7) minutes, and we forgot about other regular i/o,
> so it means every 15 minutes, the Node became almost unresponsive for
> several minutes
> until it handles all that random i/o generated by stats collection.
>
> You can ask - but why _every_ 15 minutes? You've read once a file and it
> resides in the
> Container pagecache!
> That's true, but here comes _15 minutes_ period. The larger period - the
> worse.
> If a Container is active enough, it just reads more and more files -
> website data,
> pictures, video clips, files of a fileserver, don't know.
> The thing is in 15 minutes it's quite possible a Container reads more than
> its RAM limit
> (remember - only 1Gb in our case!), and thus all old pagecache is dropped,
> substituted
> with the fresh one.
> And thus in 15 minutes it's quite possible you'll have to read all those
> 100 files in each
> Container from disk.
>
> And here comes tcache to save us: let's don't completely drop pagecache
> which is
> reclaimed from a Container (on local(!) reclaim), but save this pagecache
> in
> a special cache (tcache) on the Host in case there is free RAM on the Host.
>
> And in 15 minutes when all Containers start to access lot of small files
> again -
> those files data will be get back into Container pagecache without reading
> from
> physical disk - viola, we saves IOPS, no Node stuck anymore.
>
> Q: can a Container be so active (i.e. read so much from disk) that this
> "useful"
> pagecache is dropped even from tcache.
>
missing question mark - ?

> A: Yes. But tcache extends the "safe" period.
>
> Q: mainstream? LXC/Proxmox?
> A: No, it's Virtuozzo/OpenVZ specific.
>     "cleancache" - the base for tcache it in mainstream, it's used for Xen.
>     But we (VZ) wrote a driver for it and use it for Containers as well.
>
> Q: i use SSD, not HDD, does tcache help me?
> A: SSD can provide much more IOPS, thus the Node's performance increase
> caused by tcache
>     is less, but still reading from RAM (tcache is in RAM) is faster than
> reading from SSD.
>
is less "significant"

>
>
> >> c) you can limit the max amount of memory which can be used for
> >> pagecache for this Container
> >>       memcg::memory.cache.limit_in_bytes
> >
> > This seems viable to test as well. Currently it seems to be utilizing a
> > high number 'unlimited' default. I assume the only way to set this is to
> > directly interact with the memory cgroup and not via a standard ve
> > config value?
>
> Yes, you are right.
> We use this setting for some internal system cgroups running processes
> which are known to generate a lot of pagecache which won't be used later
> for sure.
>
>  From my perspective it's not fair to apply such a setting to a Container
> globally - well, CT owner pay for an amount of RAM, it should be able to
> use
> this RAM for whatever he wants to - even for pagecache,
> so limiting the pagecache for a Container is not a tweak we is advised to
> be used
> against a Container => no standard config parameter.
>
> Note: disabling tcache for a Container is completely fair,
> you disable just an optimization for the whole Hardware Node performance,
> but all RAM configured for a Container - is still available to the
> Container.
> (but also no official config value for that - most often it helps, not
> hurts)
>
>
> > I assume regardless if we utilized vSwap or not, we would likely still
> > experience these additional swapping issues, presumably from pagecache
> > applications, or would the usage of vSwap intercept some of these items
> > thus preventing them from being swapped to disk?
>
> vSwap - is the optimization for swapping process _local to a Container_,
> it can prevent some Container anonymous pages to be written to the
> physical swap,
> if _local_ Container reclaim decides to swapout something.
>
> At the moment you experience swapping on the Node level.
> Even if some Container's processes are put to the physical swap,
> it's a decision of the global reclaim mechanism,
> so it's completely unrelated to vSwap =>
> even if you assign some swappages to Containers and thus enable vSwap for
> those Containers,
> i should not influence anyhow on global Node level memory pressure and
> will not result in any difference in the swapping rate into physical swap.
>
> Hope that helps.
>
> --
> Best regards,
>
> Konstantin Khorenko,
> Virtuozzo Linux Kernel Team
> _______________________________________________
> Users mailing list
> Users at openvz.org
> https://lists.openvz.org/mailman/listinfo/users
>


-- 
Best regards,
[COOLCOLD-RIPN]
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openvz.org/pipermail/users/attachments/20200723/74f8daa1/attachment-0001.html>


More information about the Users mailing list