[Users] ZFS vs ploop

Fri Jul 24 15:06:04 PDT 2015

On 07/24/2015 05:41 AM, Gena Makhomed wrote:
>
>> To anyone reading this, there are a few things here worth noting.
>>
>> a. Such overhead is caused by three things:
>> 1. creating then removing data (vzctl compact takes care of that)
>> 2. filesystem fragmentation (we have some experimental patches to ext4
>>      plus an ext4 defragmenter to solve it, but currently it's still in
>> research stage)
>> 3. initial filesystem layout (which depends on initial ext4 fs size,
>> including inode requirement)
>>
>> So, #1 is solved, #2 is solvable, and #3 is a limitation of the used
>> file system and can me mitigated
>> by properly choosing initial size of a newly created ploop.
>
> this container is compacted every night, during working day
> only new static files added to container, this container does
> not contain many "creating then removing data" operations.
>
> current state:
>
> on hardware node:
>
> # du -b /vz/private/155/root.hdd
> 203547480857    /vz/private/155/root.hdd
>
> inside container:
>
> # df -B1
> Filesystem               1B-blocks          Used    Available Use% 
> Mounted on
> /dev/ploop55410p1     270426705920  163581190144  94476423168  64% /
>
>
> used space, bytes: 163581190144
>
> image size, bytes: 203547480857
>
> overhead: ~ 37 GiB, ~ 19.6%
>
> container was compacted at 03:00
> by command /usr/sbin/vzctl compact 155
>
> run container compacting right now:
> 9443 clusters have been relocated
>
> result:
>
> used space, bytes: 163604983808
>
> image size, bytes: 193740149529
>
> overhead: ~ 28 GiB, ~ 15.5%
>
> I think this is not good idea run ploop compaction more frequently,
> then one time per day at the night - so we need take into account
> not minimal value of overhead, but maximal one, after 24 hours
> of container work in normal mode - to planning disk space
> on hardware node for all ploop images.
>
> so real overhead of ploop can be accounted only
> after at lest 24h of container being in running state.
>
>> A example of #3 effect is this: if you create a very large filesystem
>> initially (say, 16TB) and then downsize it (say, to 1TB), filesystem
>> metadata overhead will be quite big. Same thing happens if you ask
>> for lots of inodes (here "lots" means more than a default value which
>> is 1 inode per 16K of disk space). This happens because ext4
>> filesystem is not designed to shrink. Therefore, to have lowest
>> possible overhead you have to choose the initial filesystem size
>> carefully. Yes, this is not a solution but a workaround.
>
> as you can see by inodes:
>
> # df -i
> Filesystem Inodes IUsed IFree IUse% Mounted on
> /dev/ploop55410p1 16777216 1198297 15578919 8% /
>
> initial filesystem size was 256 GiB:
>
> c (16777216 * 16 * 1024) / 1024.0/1024.0/1024.0 == 256 GiB.
>
> current filesystem size is also 256 GiB:
>
> # cat /etc/vz/conf/155.conf | grep DISKSPACE
> DISKSPACE="268435456:268435456"
>
> so there is no extra "filesystem metadata overhead".

Agree, this looks correct.

>
> what I am doing wrong, and how I can decrease ploop overhead here?

Most probably it's because of filesystem defragmentation (my item #2 above).
We are currently working on that. For example, see this report:

  https://lwn.net/Articles/637428/

>
> I found only one way: migrate to ZFS with turned on lz4 compression.
>
>> Also note, that ploop was not designed with any specific filesystem in
>> mind, it is universal, so #3 can be solved by moving to a different 
>> fs in the future.
>
> XFS currently not support filesystem snrinking at all:
> http://xfs.org/index.php/Shrinking_Support

Actually ext4 doesn't support shrinking either; in ploop we worked around it
using a hidden balloon file. It appears to work pretty well, with the only
downside is if you initially create a very large ploop and then shrink it
considerably, ext4 metadata overhead will be larger.

>
> BTRFS is not production-ready and no other variants
> except ext4 are available for using with ploop in near future.