[Users] ZFS vs ploop

Сергей Мамонов mrqwer88 at gmail.com
Thu Jul 23 06:22:53 PDT 2015


And many added to bugzilla. And many already fixed from you and other guys
from OpenVZ team.
But the all picture, unfortunately, it has not changed cardinally, yet.
Some people afraid use it, yet.

PS And suspend container failed without iptables-save since 2007 year )
https://bugzilla.openvz.org/show_bug.cgi?id=3154
When with not exist ip6tables-save it work correctly.

2015-07-23 10:39 GMT+03:00 Kir Kolyshkin <kir at openvz.org>:

>  On 07/22/2015 11:59 PM, Сергей Мамонов wrote:
>
>      >1. creating then removing data (vzctl compact takes care of that)
> >So, #1 is solved
>
> Only partially in fact.
>  1. Compact "eat" a lot of resources, because of the heavy use of the
> disk.
>  2. You need compact your ploop very very regulary.
>
>  On our nodes, when we run compact every day, with 3-5T /vz/ daily delta
> about 4-20% of space!
>  Every day it must clean 300 - 500+ Gb.
>
>  And it clean not all, as example -
>
>  [root at evo12 ~]# vzctl compact 75685
> Trying to find free extents bigger than 0 bytes
> Waiting
> Call FITRIM, for minlen=33554432
> Call FITRIM, for minlen=16777216
> Call FITRIM, for minlen=8388608
> Call FITRIM, for minlen=4194304
> Call FITRIM, for minlen=2097152
> Call FITRIM, for minlen=1048576
> 0 clusters have been relocated
> [root at evo12 ~]# ls -lhat /vz/private/75685/root.hdd/root.hdd
> -rw------- 1 root root 43G Июл 20 20:45 /vz/private/75685/root.hdd/root.hdd
> [root at evo12 ~]# vzctl exec 75685 df -h /
> Filesystem         Size  Used Avail Use% Mounted on
> /dev/ploop32178p1   50G   26G   21G  56% /
> [root at evo12 ~]# vzctl --version
> vzctl version 4.9.2
>
>
> This is either #2 or #3 from my list, or both.
>
>
> >My point was, the feature works fine for many people despite this bug.
>
>  Not fine, but we need it very much for migration and not. So anyway whe
> use it, we have no alternative in fact.
>  And it one of bugs. Live migration regulary failed, because vzctl cannot
> restore container correctly after suspend.
>
>
> You really need to file bugs in case you want fixes.
>
>
>  Cpt is pain in fact. But I want to belive, that CRIU fix everything =)
>
>  And ext4 only with ploop - not good  case, and not modern case too.
>  As example on some big nodes we have few /vz/ partition, because raid
> controller cannot push all disk in one raid10 logical device. And few /vz/
> partition it is not comfortable.
> And it is less flexible like one zpool as exapmle.
>
>
>
> 2015-07-23 5:44 GMT+03:00 Kir Kolyshkin <kir at openvz.org>:
>
>>
>>
>> On 07/22/2015 10:08 AM, Gena Makhomed wrote:
>>
>>> On 22.07.2015 8:39, Kir Kolyshkin wrote:
>>>
>>>  1) currently even suspend/resume not work reliable:
>>>>> https://bugzilla.openvz.org/show_bug.cgi?id=2470
>>>>> - I can't suspend and resume containers without bugs.
>>>>> and as result - I also can't use it for live migration.
>>>>>
>>>>
>>>> Valid point, we need to figure it out. What I don't understand
>>>> is how lots of users are enjoying live migration despite this bug.
>>>> Me, personally, I never came across this.
>>>>
>>>
>>> Nevertheless, steps to 100% reproduce bug provided in bugreport.
>>>
>>
>>  I was not saying anything about the bug report being bad/incomplete.
>> My point was, the feature works fine for many people despite this bug.
>>
>>
>>>  2) I see in google many bugreports about this feature:
>>>>> "openvz live migration kernel panic" - so I prefer make
>>>>> planned downtime of containers at the night instead
>>>>> of unexpected and very painful kernel panics and
>>>>> complete reboots in the middle of the working day.
>>>>> (with data lost, data corruption and other "amenities")
>>>>>
>>>>
>>>> Unlike the previous item, which is valid, this is pure FUD.
>>>>
>>>
>>> Compare two situations:
>>>
>>> 1) Live migration not used at all
>>>
>>> 2) Live migration used and containers migrated between HN
>>>
>>> In which situation possibility to obtain kernel panic is higher?
>>>
>>> If you say "possibility are equals" this means
>>> what OpenVZ live migration code has no errors at all.
>>>
>>> Is it feasible? Especially if you see OpenVZ live migration
>>> code volume, code complexity and grandiosity if this task.
>>>
>>> If you say "for (1) possibility is lower and for (2)
>>> possibility is higher" - this is the same what I think.
>>>
>>> I don't use live migration because I don't want kernel panics.
>>>
>>
>>  Following your logic, if you don't want kernel panics, you might want
>> to not use advanced filesystems such as ZFS, not use containers,
>> cgroups, namespaces, etc. The ultimate solution here, of course,
>> is to not use the kernel at all -- this will totally guarantee no kernel
>> panics at all, ever.
>>
>> On a serious note, I find your logic flawed.
>>
>>
>>> And you say what "this is pure FUD" ? Why?
>>>
>>
>>  Because it is not based on your experience or correct statistics,
>> but rather on something you saw on Google followed by some
>> flawed logic.
>>
>>
>>
>>>
>>>  4) from technical point of view - it is possible
>>>>> to do live migration using ZFS, so "live migration"
>>>>> currently is only one advantage of ploop over ZFS
>>>>>
>>>>
>>>> I wouldn't say so. If you have some real world comparison
>>>> of zfs vs ploop, feel free to share. Like density or performance
>>>> measurements, done in a controlled environment.
>>>>
>>>
>>> Ok.
>>>
>>> My experience with ploop:
>>>
>>> DISKSPACE limited to 256 GiB, real data used inside container
>>> was near 40-50% of limit 256 GiB, but ploop image is lot bigger,
>>> it use near 256 GiB of space at hardware node. Overhead ~ 50-60%
>>>
>>> I found workaround for this: run "/usr/sbin/vzctl compact $CT"
>>> via cron every night, and now ploop image has less overhead.
>>>
>>> current state:
>>>
>>> on hardware node:
>>>
>>> # du -b /vz/private/155/root.hdd
>>> 205963399961    /vz/private/155/root.hdd
>>>
>>> inside container:
>>>
>>> # df -B1
>>> Filesystem               1B-blocks          Used    Available Use%
>>> Mounted on
>>> /dev/ploop38149p1     270426705920  163129053184  94928560128  64% /
>>>
>>> ====================================
>>>
>>> used space, bytes: 163129053184
>>>
>>> image size, bytes: 205963399961
>>>
>>> "ext4 over ploop over ext4" solution disk space overhead is near 26%,
>>> or is near 40 GiB, if see this disk space overhead in absolute numbers.
>>>
>>> This is main disadvantage of ploop.
>>>
>>> And this disadvantage can't be avoided - it is "by design".
>>>
>>
>>  To anyone reading this, there are a few things here worth noting.
>>
>> a. Such overhead is caused by three things:
>> 1. creating then removing data (vzctl compact takes care of that)
>> 2. filesystem fragmentation (we have some experimental patches to ext4
>>     plus an ext4 defragmenter to solve it, but currently it's still in
>> research stage)
>> 3. initial filesystem layout (which depends on initial ext4 fs size,
>> including inode requirement)
>>
>> So, #1 is solved, #2 is solvable, and #3 is a limitation of the used file
>> system and can me mitigated
>> by properly choosing initial size of a newly created ploop.
>>
>> A example of #3 effect is this: if you create a very large filesystem
>> initially (say, 16TB) and then
>> downsize it (say, to 1TB), filesystem metadata overhead will be quite
>> big. Same thing happens
>> if you ask for lots of inodes (here "lots" means more than a default
>> value which is 1 inode
>> per 16K of disk space). This happens because ext4 filesystem is not
>> designed to shrink.
>> Therefore, to have lowest possible overhead you have to choose the
>> initial filesystem size
>> carefully. Yes, this is not a solution but a workaround.
>>
>> Also note, that ploop was not designed with any specific filesystem in
>> mind, it is
>> universal, so #3 can be solved by moving to a different fs in the future.
>>
>> Next thing, you can actually use shared base deltas for containers, and
>> although it is not
>> enabled by default, but quite possible and works in practice. The key is
>> to create a base delta
>> and use it for multiple containers (via hardlinks).
>>
>> Here is a quick and dirty example:
>>
>> SRCID=50 # "Donor" container ID
>> vztmpl-dl centos-7-x86_64 # to make sure we use the latest
>> vzctl create $SRCID --ostemplate centos-7-x86_64
>> vzctl snapshot $SRCID
>> for CT in $(seq 1000 2000); do \
>>       mkdir -p /vz/private/$CT/root.hdd /vz/root/$CT; \
>>       ln /vz/private/$SRCID/root.hdd/root.hdd
>> /vz/private/$CT/root.hdd/root.hdd; \
>>       cp -nr /vz/private/$SRCID/root.hdd /vz/private/$CT/; \
>>       cp /etc/vz/conf/$SRCID.conf /etc/vz/conf/$CT.conf; \
>>    done
>> vzctl set $SRCID --disabled yes --save # make sure we don't use it
>>
>> This will create 1000 containers (so make sure your host have enough RAM),
>> each having about 650MB files, so 650GB in total. Host disk space used
>> will be
>> about 650 + 1000*1 MB before start (i.e. about 2GB) , or about 650 +
>> 1000*30 MB
>> after start (i.e. about 32GB). So:
>>
>> real data used inside containers near 650 GB
>> real space used on hard disk is near 32 GB
>>
>> So, 20x disk space savings, and this result is reproducible. Surely it
>> will get worse
>> over time etc., and this way of using plooop is neither official nor
>> supported/recommended,
>> but it's not the point here. The points are:
>>  - this is a demonstration of what you could do with ploop
>>  - this shows why you shouldn't trust any numbers
>>
>>  =======================================================================
>>>
>>> My experience with ZFS:
>>>
>>> real data used inside container near 62 GiB,
>>> real space used on hard disk is near 11 GiB.
>>>
>>
>>  So, you are not even comparing apples to apples here. You just took two
>> different containers, certainly of different sizes, probably also
>> different data sets
>> and usage history. Not saying it's invalid, but if you want to have a
>> meaningful
>> (rather than anecdotal) comparison, you need to use same data sets, same
>> operations on data etc., try to optimize each case, and compare
>>
>>
>>
>>
>> _______________________________________________
>> Users mailing list
>> Users at openvz.org
>> https://lists.openvz.org/mailman/listinfo/users
>>
>
>
>
> _______________________________________________
> Users mailing listUsers at openvz.orghttps://lists.openvz.org/mailman/listinfo/users
>
>
>
> _______________________________________________
> Users mailing list
> Users at openvz.org
> https://lists.openvz.org/mailman/listinfo/users
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openvz.org/pipermail/users/attachments/20150723/76cfb7d0/attachment-0001.html>


More information about the Users mailing list