[Users] ZFS vs ploop
Kir Kolyshkin
kir at openvz.org
Thu Jul 23 00:39:02 PDT 2015
On 07/22/2015 11:59 PM, Сергей Мамонов wrote:
> >1. creating then removing data (vzctl compact takes care of that)
> >So, #1 is solved
>
> Only partially in fact.
> 1. Compact "eat"a lot of resources, because of the heavy use of the disk.
> 2. You need compact your ploop very very regulary.
>
> On our nodes, when we run compact every day, with 3-5T /vz/ daily
> delta about 4-20% of space!
> Every day it must clean 300 - 500+ Gb.
>
> And it clean not all, as example -
>
> [root at evo12 ~]# vzctl compact 75685
> Trying to find free extents bigger than 0 bytes
> Waiting
> Call FITRIM, for minlen=33554432
> Call FITRIM, for minlen=16777216
> Call FITRIM, for minlen=8388608
> Call FITRIM, for minlen=4194304
> Call FITRIM, for minlen=2097152
> Call FITRIM, for minlen=1048576
> 0 clusters have been relocated
> [root at evo12 ~]# ls -lhat /vz/private/75685/root.hdd/root.hdd
> -rw------- 1 root root 43G Июл 20 20:45
> /vz/private/75685/root.hdd/root.hdd
> [root at evo12 ~]# vzctl exec 75685 df -h /
> Filesystem Size Used Avail Use% Mounted on
> /dev/ploop32178p1 50G 26G 21G 56% /
> [root at evo12 ~]# vzctl --version
> vzctl version 4.9.2
This is either #2 or #3 from my list, or both.
>
> >My point was, the feature works fine for many people despite this bug.
>
> Not fine, but we need it very much for migration and not. So anyway
> whe use it, we have no alternative in fact.
> And it one of bugs. Live migration regulary failed, because vzctl
> cannot restore container correctly after suspend.
You really need to file bugs in case you want fixes.
> Cpt is pain in fact. But I want to belive, that CRIU fix everything =)
>
> And ext4 only with ploop - not good case, and not modern case too.
> As example on some big nodes we have few /vz/ partition, because raid
> controller cannot push all disk in one raid10 logical device. And few
> /vz/ partition it is not comfortable.
> And it is less flexible like one zpool as exapmle.
>
>
>
> 2015-07-23 5:44 GMT+03:00 Kir Kolyshkin <kir at openvz.org
> <mailto:kir at openvz.org>>:
>
>
>
> On 07/22/2015 10:08 AM, Gena Makhomed wrote:
>
> On 22.07.2015 8:39, Kir Kolyshkin wrote:
>
> 1) currently even suspend/resume not work reliable:
> https://bugzilla.openvz.org/show_bug.cgi?id=2470
> - I can't suspend and resume containers without bugs.
> and as result - I also can't use it for live migration.
>
>
> Valid point, we need to figure it out. What I don't understand
> is how lots of users are enjoying live migration despite
> this bug.
> Me, personally, I never came across this.
>
>
> Nevertheless, steps to 100% reproduce bug provided in bugreport.
>
>
> I was not saying anything about the bug report being bad/incomplete.
> My point was, the feature works fine for many people despite this bug.
>
>
> 2) I see in google many bugreports about this feature:
> "openvz live migration kernel panic" - so I prefer make
> planned downtime of containers at the night instead
> of unexpected and very painful kernel panics and
> complete reboots in the middle of the working day.
> (with data lost, data corruption and other "amenities")
>
>
> Unlike the previous item, which is valid, this is pure FUD.
>
>
> Compare two situations:
>
> 1) Live migration not used at all
>
> 2) Live migration used and containers migrated between HN
>
> In which situation possibility to obtain kernel panic is higher?
>
> If you say "possibility are equals" this means
> what OpenVZ live migration code has no errors at all.
>
> Is it feasible? Especially if you see OpenVZ live migration
> code volume, code complexity and grandiosity if this task.
>
> If you say "for (1) possibility is lower and for (2)
> possibility is higher" - this is the same what I think.
>
> I don't use live migration because I don't want kernel panics.
>
>
> Following your logic, if you don't want kernel panics, you might want
> to not use advanced filesystems such as ZFS, not use containers,
> cgroups, namespaces, etc. The ultimate solution here, of course,
> is to not use the kernel at all -- this will totally guarantee no
> kernel
> panics at all, ever.
>
> On a serious note, I find your logic flawed.
>
>
> And you say what "this is pure FUD" ? Why?
>
>
> Because it is not based on your experience or correct statistics,
> but rather on something you saw on Google followed by some
> flawed logic.
>
>
>
>
> 4) from technical point of view - it is possible
> to do live migration using ZFS, so "live migration"
> currently is only one advantage of ploop over ZFS
>
>
> I wouldn't say so. If you have some real world comparison
> of zfs vs ploop, feel free to share. Like density or
> performance
> measurements, done in a controlled environment.
>
>
> Ok.
>
> My experience with ploop:
>
> DISKSPACE limited to 256 GiB, real data used inside container
> was near 40-50% of limit 256 GiB, but ploop image is lot bigger,
> it use near 256 GiB of space at hardware node. Overhead ~ 50-60%
>
> I found workaround for this: run "/usr/sbin/vzctl compact $CT"
> via cron every night, and now ploop image has less overhead.
>
> current state:
>
> on hardware node:
>
> # du -b /vz/private/155/root.hdd
> 205963399961 /vz/private/155/root.hdd
>
> inside container:
>
> # df -B1
> Filesystem 1B-blocks Used Available
> Use% Mounted on
> /dev/ploop38149p1 270426705920 163129053184 94928560128
> 64% /
>
> ====================================
>
> used space, bytes: 163129053184
>
> image size, bytes: 205963399961
>
> "ext4 over ploop over ext4" solution disk space overhead is
> near 26%,
> or is near 40 GiB, if see this disk space overhead in absolute
> numbers.
>
> This is main disadvantage of ploop.
>
> And this disadvantage can't be avoided - it is "by design".
>
>
> To anyone reading this, there are a few things here worth noting.
>
> a. Such overhead is caused by three things:
> 1. creating then removing data (vzctl compact takes care of that)
> 2. filesystem fragmentation (we have some experimental patches to ext4
> plus an ext4 defragmenter to solve it, but currently it's
> still in research stage)
> 3. initial filesystem layout (which depends on initial ext4 fs
> size, including inode requirement)
>
> So, #1 is solved, #2 is solvable, and #3 is a limitation of the
> used file system and can me mitigated
> by properly choosing initial size of a newly created ploop.
>
> A example of #3 effect is this: if you create a very large
> filesystem initially (say, 16TB) and then
> downsize it (say, to 1TB), filesystem metadata overhead will be
> quite big. Same thing happens
> if you ask for lots of inodes (here "lots" means more than a
> default value which is 1 inode
> per 16K of disk space). This happens because ext4 filesystem is
> not designed to shrink.
> Therefore, to have lowest possible overhead you have to choose the
> initial filesystem size
> carefully. Yes, this is not a solution but a workaround.
>
> Also note, that ploop was not designed with any specific
> filesystem in mind, it is
> universal, so #3 can be solved by moving to a different fs in the
> future.
>
> Next thing, you can actually use shared base deltas for
> containers, and although it is not
> enabled by default, but quite possible and works in practice. The
> key is to create a base delta
> and use it for multiple containers (via hardlinks).
>
> Here is a quick and dirty example:
>
> SRCID=50 # "Donor" container ID
> vztmpl-dl centos-7-x86_64 # to make sure we use the latest
> vzctl create $SRCID --ostemplate centos-7-x86_64
> vzctl snapshot $SRCID
> for CT in $(seq 1000 2000); do \
> mkdir -p /vz/private/$CT/root.hdd /vz/root/$CT; \
> ln /vz/private/$SRCID/root.hdd/root.hdd
> /vz/private/$CT/root.hdd/root.hdd; \
> cp -nr /vz/private/$SRCID/root.hdd /vz/private/$CT/; \
> cp /etc/vz/conf/$SRCID.conf /etc/vz/conf/$CT.conf; \
> done
> vzctl set $SRCID --disabled yes --save # make sure we don't use it
>
> This will create 1000 containers (so make sure your host have
> enough RAM),
> each having about 650MB files, so 650GB in total. Host disk space
> used will be
> about 650 + 1000*1 MB before start (i.e. about 2GB) , or about 650
> + 1000*30 MB
> after start (i.e. about 32GB). So:
>
> real data used inside containers near 650 GB
> real space used on hard disk is near 32 GB
>
> So, 20x disk space savings, and this result is reproducible.
> Surely it will get worse
> over time etc., and this way of using plooop is neither official
> nor supported/recommended,
> but it's not the point here. The points are:
> - this is a demonstration of what you could do with ploop
> - this shows why you shouldn't trust any numbers
>
> =======================================================================
>
> My experience with ZFS:
>
> real data used inside container near 62 GiB,
> real space used on hard disk is near 11 GiB.
>
>
> So, you are not even comparing apples to apples here. You just
> took two
> different containers, certainly of different sizes, probably also
> different data sets
> and usage history. Not saying it's invalid, but if you want to
> have a meaningful
> (rather than anecdotal) comparison, you need to use same data
> sets, same
> operations on data etc., try to optimize each case, and compare
>
>
>
>
> _______________________________________________
> Users mailing list
> Users at openvz.org <mailto:Users at openvz.org>
> https://lists.openvz.org/mailman/listinfo/users
>
>
>
>
> _______________________________________________
> Users mailing list
> Users at openvz.org
> https://lists.openvz.org/mailman/listinfo/users
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openvz.org/pipermail/users/attachments/20150723/65f183c8/attachment-0001.html>
More information about the Users
mailing list