[Users] ZFS vs ploop

Kir Kolyshkin kir at openvz.org
Thu Jul 23 00:39:02 PDT 2015


On 07/22/2015 11:59 PM, Сергей Мамонов wrote:
> >1. creating then removing data (vzctl compact takes care of that)
> >So, #1 is solved
>
> Only partially in fact.
> 1. Compact "eat"a lot of resources, because of the heavy use of the disk.
> 2. You need compact your ploop very very regulary.
>
> On our nodes, when we run compact every day, with 3-5T /vz/ daily 
> delta about 4-20% of space!
> Every day it must clean 300 - 500+ Gb.
>
> And it clean not all, as example -
>
> [root at evo12 ~]# vzctl compact 75685
> Trying to find free extents bigger than 0 bytes
> Waiting
> Call FITRIM, for minlen=33554432
> Call FITRIM, for minlen=16777216
> Call FITRIM, for minlen=8388608
> Call FITRIM, for minlen=4194304
> Call FITRIM, for minlen=2097152
> Call FITRIM, for minlen=1048576
> 0 clusters have been relocated
> [root at evo12 ~]# ls -lhat /vz/private/75685/root.hdd/root.hdd
> -rw------- 1 root root 43G Июл 20 20:45 
> /vz/private/75685/root.hdd/root.hdd
> [root at evo12 ~]# vzctl exec 75685 df -h /
> Filesystem         Size  Used Avail Use% Mounted on
> /dev/ploop32178p1   50G   26G   21G  56% /
> [root at evo12 ~]# vzctl --version
> vzctl version 4.9.2

This is either #2 or #3 from my list, or both.

>
> >My point was, the feature works fine for many people despite this bug.
>
> Not fine, but we need it very much for migration and not. So anyway 
> whe use it, we have no alternative in fact.
> And it one of bugs. Live migration regulary failed, because vzctl 
> cannot restore container correctly after suspend.

You really need to file bugs in case you want fixes.

> Cpt is pain in fact. But I want to belive, that CRIU fix everything =)
>
> And ext4 only with ploop - not good  case, and not modern case too.
> As example on some big nodes we have few /vz/ partition, because raid 
> controller cannot push all disk in one raid10 logical device. And few 
> /vz/ partition it is not comfortable.
> And it is less flexible like one zpool as exapmle.
>
>
>
> 2015-07-23 5:44 GMT+03:00 Kir Kolyshkin <kir at openvz.org 
> <mailto:kir at openvz.org>>:
>
>
>
>     On 07/22/2015 10:08 AM, Gena Makhomed wrote:
>
>         On 22.07.2015 8:39, Kir Kolyshkin wrote:
>
>                 1) currently even suspend/resume not work reliable:
>                 https://bugzilla.openvz.org/show_bug.cgi?id=2470
>                 - I can't suspend and resume containers without bugs.
>                 and as result - I also can't use it for live migration.
>
>
>             Valid point, we need to figure it out. What I don't understand
>             is how lots of users are enjoying live migration despite
>             this bug.
>             Me, personally, I never came across this.
>
>
>         Nevertheless, steps to 100% reproduce bug provided in bugreport.
>
>
>     I was not saying anything about the bug report being bad/incomplete.
>     My point was, the feature works fine for many people despite this bug.
>
>
>                 2) I see in google many bugreports about this feature:
>                 "openvz live migration kernel panic" - so I prefer make
>                 planned downtime of containers at the night instead
>                 of unexpected and very painful kernel panics and
>                 complete reboots in the middle of the working day.
>                 (with data lost, data corruption and other "amenities")
>
>
>             Unlike the previous item, which is valid, this is pure FUD.
>
>
>         Compare two situations:
>
>         1) Live migration not used at all
>
>         2) Live migration used and containers migrated between HN
>
>         In which situation possibility to obtain kernel panic is higher?
>
>         If you say "possibility are equals" this means
>         what OpenVZ live migration code has no errors at all.
>
>         Is it feasible? Especially if you see OpenVZ live migration
>         code volume, code complexity and grandiosity if this task.
>
>         If you say "for (1) possibility is lower and for (2)
>         possibility is higher" - this is the same what I think.
>
>         I don't use live migration because I don't want kernel panics.
>
>
>     Following your logic, if you don't want kernel panics, you might want
>     to not use advanced filesystems such as ZFS, not use containers,
>     cgroups, namespaces, etc. The ultimate solution here, of course,
>     is to not use the kernel at all -- this will totally guarantee no
>     kernel
>     panics at all, ever.
>
>     On a serious note, I find your logic flawed.
>
>
>         And you say what "this is pure FUD" ? Why?
>
>
>     Because it is not based on your experience or correct statistics,
>     but rather on something you saw on Google followed by some
>     flawed logic.
>
>
>
>
>                 4) from technical point of view - it is possible
>                 to do live migration using ZFS, so "live migration"
>                 currently is only one advantage of ploop over ZFS
>
>
>             I wouldn't say so. If you have some real world comparison
>             of zfs vs ploop, feel free to share. Like density or
>             performance
>             measurements, done in a controlled environment.
>
>
>         Ok.
>
>         My experience with ploop:
>
>         DISKSPACE limited to 256 GiB, real data used inside container
>         was near 40-50% of limit 256 GiB, but ploop image is lot bigger,
>         it use near 256 GiB of space at hardware node. Overhead ~ 50-60%
>
>         I found workaround for this: run "/usr/sbin/vzctl compact $CT"
>         via cron every night, and now ploop image has less overhead.
>
>         current state:
>
>         on hardware node:
>
>         # du -b /vz/private/155/root.hdd
>         205963399961    /vz/private/155/root.hdd
>
>         inside container:
>
>         # df -B1
>         Filesystem               1B-blocks          Used Available
>         Use% Mounted on
>         /dev/ploop38149p1     270426705920  163129053184 94928560128 
>         64% /
>
>         ====================================
>
>         used space, bytes: 163129053184
>
>         image size, bytes: 205963399961
>
>         "ext4 over ploop over ext4" solution disk space overhead is
>         near 26%,
>         or is near 40 GiB, if see this disk space overhead in absolute
>         numbers.
>
>         This is main disadvantage of ploop.
>
>         And this disadvantage can't be avoided - it is "by design".
>
>
>     To anyone reading this, there are a few things here worth noting.
>
>     a. Such overhead is caused by three things:
>     1. creating then removing data (vzctl compact takes care of that)
>     2. filesystem fragmentation (we have some experimental patches to ext4
>         plus an ext4 defragmenter to solve it, but currently it's
>     still in research stage)
>     3. initial filesystem layout (which depends on initial ext4 fs
>     size, including inode requirement)
>
>     So, #1 is solved, #2 is solvable, and #3 is a limitation of the
>     used file system and can me mitigated
>     by properly choosing initial size of a newly created ploop.
>
>     A example of #3 effect is this: if you create a very large
>     filesystem initially (say, 16TB) and then
>     downsize it (say, to 1TB), filesystem metadata overhead will be
>     quite big. Same thing happens
>     if you ask for lots of inodes (here "lots" means more than a
>     default value which is 1 inode
>     per 16K of disk space). This happens because ext4 filesystem is
>     not designed to shrink.
>     Therefore, to have lowest possible overhead you have to choose the
>     initial filesystem size
>     carefully. Yes, this is not a solution but a workaround.
>
>     Also note, that ploop was not designed with any specific
>     filesystem in mind, it is
>     universal, so #3 can be solved by moving to a different fs in the
>     future.
>
>     Next thing, you can actually use shared base deltas for
>     containers, and although it is not
>     enabled by default, but quite possible and works in practice. The
>     key is to create a base delta
>     and use it for multiple containers (via hardlinks).
>
>     Here is a quick and dirty example:
>
>     SRCID=50 # "Donor" container ID
>     vztmpl-dl centos-7-x86_64 # to make sure we use the latest
>     vzctl create $SRCID --ostemplate centos-7-x86_64
>     vzctl snapshot $SRCID
>     for CT in $(seq 1000 2000); do \
>           mkdir -p /vz/private/$CT/root.hdd /vz/root/$CT; \
>           ln /vz/private/$SRCID/root.hdd/root.hdd
>     /vz/private/$CT/root.hdd/root.hdd; \
>           cp -nr /vz/private/$SRCID/root.hdd /vz/private/$CT/; \
>           cp /etc/vz/conf/$SRCID.conf /etc/vz/conf/$CT.conf; \
>        done
>     vzctl set $SRCID --disabled yes --save # make sure we don't use it
>
>     This will create 1000 containers (so make sure your host have
>     enough RAM),
>     each having about 650MB files, so 650GB in total. Host disk space
>     used will be
>     about 650 + 1000*1 MB before start (i.e. about 2GB) , or about 650
>     + 1000*30 MB
>     after start (i.e. about 32GB). So:
>
>     real data used inside containers near 650 GB
>     real space used on hard disk is near 32 GB
>
>     So, 20x disk space savings, and this result is reproducible.
>     Surely it will get worse
>     over time etc., and this way of using plooop is neither official
>     nor supported/recommended,
>     but it's not the point here. The points are:
>      - this is a demonstration of what you could do with ploop
>      - this shows why you shouldn't trust any numbers
>
>         =======================================================================
>
>         My experience with ZFS:
>
>         real data used inside container near 62 GiB,
>         real space used on hard disk is near 11 GiB.
>
>
>     So, you are not even comparing apples to apples here. You just
>     took two
>     different containers, certainly of different sizes, probably also
>     different data sets
>     and usage history. Not saying it's invalid, but if you want to
>     have a meaningful
>     (rather than anecdotal) comparison, you need to use same data
>     sets, same
>     operations on data etc., try to optimize each case, and compare
>
>
>
>
>     _______________________________________________
>     Users mailing list
>     Users at openvz.org <mailto:Users at openvz.org>
>     https://lists.openvz.org/mailman/listinfo/users
>
>
>
>
> _______________________________________________
> Users mailing list
> Users at openvz.org
> https://lists.openvz.org/mailman/listinfo/users

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openvz.org/pipermail/users/attachments/20150723/65f183c8/attachment-0001.html>


More information about the Users mailing list