[Users] flashcache

Thu Jul 10 01:34:40 PDT 2014

Hello!

You scheme is fine but you can't divide I/O load with cgroup blkio
(ioprio/iolimit/iopslimit) between different folders but between
different ZVOL you do.

I could imagine following problems for per folder scheme:
1) Can't limit number of inodes in different folders (but there are
not an inode limit for ZFS like ext4 but bug amount of files in
container could broke node;
http://serverfault.com/questions/503658/can-you-set-inode-quotas-in-zfs)
2) Problems with system cache which used by all containers in HWN together
3) Problems with live migration because you _should_ change inode
numbers on diffferent nodes
4) ZFS behaviour with linux software in some cases is very STRANGE (DIRECT_IO)
5) ext4 has good support from vzctl (fsck, resize2fs)

My ideas like simfs vs ploop comparison:
http://openvz.org/images/f/f3/Ct_in_a_file.pdf

On Thu, Jul 10, 2014 at 12:06 PM, Pavel Snajdr <lists at snajpa.net> wrote:
> On 07/09/2014 06:58 PM, Kir Kolyshkin wrote:
>> On 07/08/2014 11:54 PM, Pavel Snajdr wrote:
>>> On 07/08/2014 07:52 PM, Scott Dowdle wrote:
>>>> Greetings,
>>>>
>>>> ----- Original Message -----
>>>>> (offtopic) We can not use ZFS. Unfortunately, NAS with something like
>>>>> Nexenta is to expensive for us.
>>>> From what I've gathered from a few presentations, ZFS on Linux (http://zfsonlinux.org/) is as stable but more performant than it is on the OpenSolaris forks... so you can build your own if you can spare the people to learn the best practices.
>>>>
>>>> I don't have a use for ZFS myself so I'm not really advocating it.
>>>>
>>>> TYL,
>>>>
>>> Hi all,
>>>
>>> we run tens of OpenVZ nodes (bigger boxes, 256G RAM, 12cores+, 90 CTs at
>>> least). We've used to run ext4+flashcache, but ext4 has proven to be a
>>> bottleneck. That was the primary motivation behind ploop as far as I know.
>>>
>>> We've switched to ZFS on Linux around the time Ploop was announced and I
>>> didn't have second thoughts since. ZFS really *is* in my experience the
>>> best filesystem there is at the moment for this kind of deployment  -
>>> especially if you use dedicated SSDs for ZIL and L2ARC, although the
>>> latter is less important. You will know what I'm talking about when you
>>> try this on boxes with lots of CTs doing LAMP load - databases and their
>>> synchronous writes are the real problem, which ZFS with dedicated ZIL
>>> device solves.
>>>
>>> Also there is the ARC caching, which is smarter then linux VFS cache -
>>> we're able to achieve about 99% of hitrate at about 99% of the time,
>>> even under high loads.
>>>
>>> Having said all that, I recommend everyone to give ZFS a chance, but I'm
>>> aware this is yet another out-of-mainline code and that doesn't suit
>>> everyone that well.
>>>
>>
>> Are you using per-container ZVOL or something else?
>
> That would mean I'd need to do another filesystem on top of ZFS, which
> would in turn mean I'd add another unnecessary layer of indirection. ZFS
> is a pooled storage like BTRFS is, we're giving one dataset to each
> container.
>
> vzctl tries to move the VE_PRIVATE folder around, so we had to add one
> more directory to put the VE_PRIVATE data into (see the first ls).
>
> Example from production:
>
> [root at node2.prg.vpsfree.cz]
>  ~ # zpool status vz
>   pool: vz
>  state: ONLINE
>   scan: scrub repaired 0 in 1h24m with 0 errors on Tue Jul  8 16:22:17 2014
> config:
>
>         NAME        STATE     READ WRITE CKSUM
>         vz          ONLINE       0     0     0
>           mirror-0  ONLINE       0     0     0
>             sda     ONLINE       0     0     0
>             sdb     ONLINE       0     0     0
>           mirror-1  ONLINE       0     0     0
>             sde     ONLINE       0     0     0
>             sdf     ONLINE       0     0     0
>           mirror-2  ONLINE       0     0     0
>             sdg     ONLINE       0     0     0
>             sdh     ONLINE       0     0     0
>         logs
>           mirror-3  ONLINE       0     0     0
>             sdc3    ONLINE       0     0     0
>             sdd3    ONLINE       0     0     0
>         cache
>           sdc5      ONLINE       0     0     0
>           sdd5      ONLINE       0     0     0
>
> errors: No known data errors
>
> [root at node2.prg.vpsfree.cz]
>  ~ # zfs list
> NAME              USED  AVAIL  REFER  MOUNTPOINT
> vz                432G  2.25T    36K  /vz
> vz/private        427G  2.25T   111K  /vz/private
> vz/private/101   17.7G  42.3G  17.7G  /vz/private/101
> <snip>
> vz/root           104K  2.25T   104K  /vz/root
> vz/template      5.38G  2.25T  5.38G  /vz/template
>
> [root at node2.prg.vpsfree.cz]
>  ~ # zfs get compressratio vz/private/101
> NAME            PROPERTY       VALUE  SOURCE
> vz/private/101  compressratio  1.38x  -
>
> [root at node2.prg.vpsfree.cz]
>  ~ # ls /vz/private/101
> private
>
> [root at node2.prg.vpsfree.cz]
>  ~ # ls /vz/private/101/private/
> aquota.group  aquota.user  b  bin  boot  dev  etc  git  home  lib
> <snip>
>
> [root at node2.prg.vpsfree.cz]
>  ~ # cat /etc/vz/conf/101.conf | grep -P "PRIVATE|ROOT"
> VE_ROOT="/vz/root/101"
> VE_PRIVATE="/vz/private/101/private"
>
>
>> _______________________________________________
>> Users mailing list
>> Users at openvz.org
>> https://lists.openvz.org/mailman/listinfo/users
>>
>
> _______________________________________________
> Users mailing list
> Users at openvz.org
> https://lists.openvz.org/mailman/listinfo/users

-- 
Sincerely yours, Pavel Odintsov