[Users] ZFS vs ploop

Wed Jul 22 19:44:52 PDT 2015

On 07/22/2015 10:08 AM, Gena Makhomed wrote:
> On 22.07.2015 8:39, Kir Kolyshkin wrote:
>
>>> 1) currently even suspend/resume not work reliable:
>>> https://bugzilla.openvz.org/show_bug.cgi?id=2470
>>> - I can't suspend and resume containers without bugs.
>>> and as result - I also can't use it for live migration.
>>
>> Valid point, we need to figure it out. What I don't understand
>> is how lots of users are enjoying live migration despite this bug.
>> Me, personally, I never came across this.
>
> Nevertheless, steps to 100% reproduce bug provided in bugreport.

I was not saying anything about the bug report being bad/incomplete.
My point was, the feature works fine for many people despite this bug.

>
>>> 2) I see in google many bugreports about this feature:
>>> "openvz live migration kernel panic" - so I prefer make
>>> planned downtime of containers at the night instead
>>> of unexpected and very painful kernel panics and
>>> complete reboots in the middle of the working day.
>>> (with data lost, data corruption and other "amenities")
>>
>> Unlike the previous item, which is valid, this is pure FUD.
>
> Compare two situations:
>
> 1) Live migration not used at all
>
> 2) Live migration used and containers migrated between HN
>
> In which situation possibility to obtain kernel panic is higher?
>
> If you say "possibility are equals" this means
> what OpenVZ live migration code has no errors at all.
>
> Is it feasible? Especially if you see OpenVZ live migration
> code volume, code complexity and grandiosity if this task.
>
> If you say "for (1) possibility is lower and for (2)
> possibility is higher" - this is the same what I think.
>
> I don't use live migration because I don't want kernel panics.

Following your logic, if you don't want kernel panics, you might want
to not use advanced filesystems such as ZFS, not use containers,
cgroups, namespaces, etc. The ultimate solution here, of course,
is to not use the kernel at all -- this will totally guarantee no kernel
panics at all, ever.

On a serious note, I find your logic flawed.

>
> And you say what "this is pure FUD" ? Why?

Because it is not based on your experience or correct statistics,
but rather on something you saw on Google followed by some
flawed logic.

>
>
>>> 4) from technical point of view - it is possible
>>> to do live migration using ZFS, so "live migration"
>>> currently is only one advantage of ploop over ZFS
>>
>> I wouldn't say so. If you have some real world comparison
>> of zfs vs ploop, feel free to share. Like density or performance
>> measurements, done in a controlled environment.
>
> Ok.
>
> My experience with ploop:
>
> DISKSPACE limited to 256 GiB, real data used inside container
> was near 40-50% of limit 256 GiB, but ploop image is lot bigger,
> it use near 256 GiB of space at hardware node. Overhead ~ 50-60%
>
> I found workaround for this: run "/usr/sbin/vzctl compact $CT"
> via cron every night, and now ploop image has less overhead.
>
> current state:
>
> on hardware node:
>
> # du -b /vz/private/155/root.hdd
> 205963399961    /vz/private/155/root.hdd
>
> inside container:
>
> # df -B1
> Filesystem               1B-blocks          Used    Available Use% 
> Mounted on
> /dev/ploop38149p1     270426705920  163129053184  94928560128  64% /
>
> ====================================
>
> used space, bytes: 163129053184
>
> image size, bytes: 205963399961
>
> "ext4 over ploop over ext4" solution disk space overhead is near 26%,
> or is near 40 GiB, if see this disk space overhead in absolute numbers.
>
> This is main disadvantage of ploop.
>
> And this disadvantage can't be avoided - it is "by design".

To anyone reading this, there are a few things here worth noting.

a. Such overhead is caused by three things:
1. creating then removing data (vzctl compact takes care of that)
2. filesystem fragmentation (we have some experimental patches to ext4
     plus an ext4 defragmenter to solve it, but currently it's still in 
research stage)
3. initial filesystem layout (which depends on initial ext4 fs size, 
including inode requirement)

So, #1 is solved, #2 is solvable, and #3 is a limitation of the used 
file system and can me mitigated
by properly choosing initial size of a newly created ploop.

A example of #3 effect is this: if you create a very large filesystem 
initially (say, 16TB) and then
downsize it (say, to 1TB), filesystem metadata overhead will be quite 
big. Same thing happens
if you ask for lots of inodes (here "lots" means more than a default 
value which is 1 inode
per 16K of disk space). This happens because ext4 filesystem is not 
designed to shrink.
Therefore, to have lowest possible overhead you have to choose the 
initial filesystem size
carefully. Yes, this is not a solution but a workaround.

Also note, that ploop was not designed with any specific filesystem in 
mind, it is
universal, so #3 can be solved by moving to a different fs in the future.

Next thing, you can actually use shared base deltas for containers, and 
although it is not
enabled by default, but quite possible and works in practice. The key is 
to create a base delta
and use it for multiple containers (via hardlinks).

Here is a quick and dirty example:

SRCID=50 # "Donor" container ID
vztmpl-dl centos-7-x86_64 # to make sure we use the latest
vzctl create $SRCID --ostemplate centos-7-x86_64
vzctl snapshot $SRCID
for CT in $(seq 1000 2000); do \
       mkdir -p /vz/private/$CT/root.hdd /vz/root/$CT; \
       ln /vz/private/$SRCID/root.hdd/root.hdd 
/vz/private/$CT/root.hdd/root.hdd; \
       cp -nr /vz/private/$SRCID/root.hdd /vz/private/$CT/; \
       cp /etc/vz/conf/$SRCID.conf /etc/vz/conf/$CT.conf; \
    done
vzctl set $SRCID --disabled yes --save # make sure we don't use it

This will create 1000 containers (so make sure your host have enough RAM),
each having about 650MB files, so 650GB in total. Host disk space used 
will be
about 650 + 1000*1 MB before start (i.e. about 2GB) , or about 650 + 
1000*30 MB
after start (i.e. about 32GB). So:

real data used inside containers near 650 GB
real space used on hard disk is near 32 GB

So, 20x disk space savings, and this result is reproducible. Surely it 
will get worse
over time etc., and this way of using plooop is neither official nor 
supported/recommended,
but it's not the point here. The points are:
  - this is a demonstration of what you could do with ploop
  - this shows why you shouldn't trust any numbers

> =======================================================================
>
> My experience with ZFS:
>
> real data used inside container near 62 GiB,
> real space used on hard disk is near 11 GiB.

So, you are not even comparing apples to apples here. You just took two
different containers, certainly of different sizes, probably also 
different data sets
and usage history. Not saying it's invalid, but if you want to have a 
meaningful
(rather than anecdotal) comparison, you need to use same data sets, same
operations on data etc., try to optimize each case, and compare