[Users] flashcache

Pavel Snajdr lists at snajpa.net
Thu Jul 10 02:08:19 PDT 2014


On 07/10/2014 10:34 AM, Pavel Odintsov wrote:
> Hello!
> 
> You scheme is fine but you can't divide I/O load with cgroup blkio
> (ioprio/iolimit/iopslimit) between different folders but between
> different ZVOL you do.

Not true, IO limits are working as they should (if we're talking vzctl
set --iolimit/--iopslimit). I've kicked the ZoL guys around to add IO
accounting support, so it is there.

> 
> I could imagine following problems for per folder scheme:
> 1) Can't limit number of inodes in different folders (but there are
> not an inode limit for ZFS like ext4 but bug amount of files in
> container could broke node;

How? ZFS doesn't have a limit on number of files (2^48 isn't a limit really)

> http://serverfault.com/questions/503658/can-you-set-inode-quotas-in-zfs)
> 2) Problems with system cache which used by all containers in HWN together

This exactly isn't a problem, but a *HUGE* benefit, you'd need to see it
in practice :) Linux VFS cache is really dumb in comparison to ARC.
ARC's hitrates just can't be done with what linux currently offers.

> 3) Problems with live migration because you _should_ change inode
> numbers on diffferent nodes

Why? ZFS send/receive is able to do bit-by-bit identical copy of the FS,
I thought the point of migration is to don't have the CT notice any
change, I don't see why the inode numbers should change.

> 4) ZFS behaviour with linux software in some cases is very STRANGE (DIRECT_IO)

How exactly? I haven't seen a problem with any userspace software, other
than MySQL default setting to AIO (it fallbacks to older method), which
ZFS doesn't support (*yet*, they have it in their plans).

> 5) ext4 has good support from vzctl (fsck, resize2fs)

Yeah, but ext4 sucks big time. At least in my use-case.

We've implemented most of vzctl create/destroy/etc. functionality in our
vpsAdmin software instead.

Guys, can I ask you to keep your mind open instead of fighting with
pointless arguments? :) Give ZFS a try and then decide for yourselves.

I think the community would benefit greatly if ZFS woudn't be fought as
something alien in the Linux world, which I in my experience is what
every Linux zealot I talk to about ZFS is doing.
This is just not fair. It's primarily about technology, primarily about
the best tool for the job. If we can implement something like this in
Linux but without having ties to CDDL and possibly Oracle patents, that
would be awesome, yet nobody has done such a thing yet. BTRFS is nowhere
near ZFS when it comes to running larger scale deployments and in some
regards I don't think it will ever match ZFS, just looking at the way
it's been designed.

I'm not trying to flame here, I'm trying to open you guys to the fact,
that there really is a better alternative than you're currently seeing.
And if it has some technological drawbacks like these that you're trying
to point out, instead of pointing at them as something, which can't be
changed and thus everyone should use "your best solution(tm)", try to
think of ways how to change it for the better.

> 
> My ideas like simfs vs ploop comparison:
> http://openvz.org/images/f/f3/Ct_in_a_file.pdf

Again, you have to see ZFS doing its magic in production under a really
heavy load, otherwise you won't understand. Any arbitrary benchmarks
I've seen show ZFS is slower than ext4, but these are not tuned for such
use cases as I'm talking about.

/snajpa

> 
> On Thu, Jul 10, 2014 at 12:06 PM, Pavel Snajdr <lists at snajpa.net> wrote:
>> On 07/09/2014 06:58 PM, Kir Kolyshkin wrote:
>>> On 07/08/2014 11:54 PM, Pavel Snajdr wrote:
>>>> On 07/08/2014 07:52 PM, Scott Dowdle wrote:
>>>>> Greetings,
>>>>>
>>>>> ----- Original Message -----
>>>>>> (offtopic) We can not use ZFS. Unfortunately, NAS with something like
>>>>>> Nexenta is to expensive for us.
>>>>> From what I've gathered from a few presentations, ZFS on Linux (http://zfsonlinux.org/) is as stable but more performant than it is on the OpenSolaris forks... so you can build your own if you can spare the people to learn the best practices.
>>>>>
>>>>> I don't have a use for ZFS myself so I'm not really advocating it.
>>>>>
>>>>> TYL,
>>>>>
>>>> Hi all,
>>>>
>>>> we run tens of OpenVZ nodes (bigger boxes, 256G RAM, 12cores+, 90 CTs at
>>>> least). We've used to run ext4+flashcache, but ext4 has proven to be a
>>>> bottleneck. That was the primary motivation behind ploop as far as I know.
>>>>
>>>> We've switched to ZFS on Linux around the time Ploop was announced and I
>>>> didn't have second thoughts since. ZFS really *is* in my experience the
>>>> best filesystem there is at the moment for this kind of deployment  -
>>>> especially if you use dedicated SSDs for ZIL and L2ARC, although the
>>>> latter is less important. You will know what I'm talking about when you
>>>> try this on boxes with lots of CTs doing LAMP load - databases and their
>>>> synchronous writes are the real problem, which ZFS with dedicated ZIL
>>>> device solves.
>>>>
>>>> Also there is the ARC caching, which is smarter then linux VFS cache -
>>>> we're able to achieve about 99% of hitrate at about 99% of the time,
>>>> even under high loads.
>>>>
>>>> Having said all that, I recommend everyone to give ZFS a chance, but I'm
>>>> aware this is yet another out-of-mainline code and that doesn't suit
>>>> everyone that well.
>>>>
>>>
>>> Are you using per-container ZVOL or something else?
>>
>> That would mean I'd need to do another filesystem on top of ZFS, which
>> would in turn mean I'd add another unnecessary layer of indirection. ZFS
>> is a pooled storage like BTRFS is, we're giving one dataset to each
>> container.
>>
>> vzctl tries to move the VE_PRIVATE folder around, so we had to add one
>> more directory to put the VE_PRIVATE data into (see the first ls).
>>
>> Example from production:
>>
>> [root at node2.prg.vpsfree.cz]
>>  ~ # zpool status vz
>>   pool: vz
>>  state: ONLINE
>>   scan: scrub repaired 0 in 1h24m with 0 errors on Tue Jul  8 16:22:17 2014
>> config:
>>
>>         NAME        STATE     READ WRITE CKSUM
>>         vz          ONLINE       0     0     0
>>           mirror-0  ONLINE       0     0     0
>>             sda     ONLINE       0     0     0
>>             sdb     ONLINE       0     0     0
>>           mirror-1  ONLINE       0     0     0
>>             sde     ONLINE       0     0     0
>>             sdf     ONLINE       0     0     0
>>           mirror-2  ONLINE       0     0     0
>>             sdg     ONLINE       0     0     0
>>             sdh     ONLINE       0     0     0
>>         logs
>>           mirror-3  ONLINE       0     0     0
>>             sdc3    ONLINE       0     0     0
>>             sdd3    ONLINE       0     0     0
>>         cache
>>           sdc5      ONLINE       0     0     0
>>           sdd5      ONLINE       0     0     0
>>
>> errors: No known data errors
>>
>> [root at node2.prg.vpsfree.cz]
>>  ~ # zfs list
>> NAME              USED  AVAIL  REFER  MOUNTPOINT
>> vz                432G  2.25T    36K  /vz
>> vz/private        427G  2.25T   111K  /vz/private
>> vz/private/101   17.7G  42.3G  17.7G  /vz/private/101
>> <snip>
>> vz/root           104K  2.25T   104K  /vz/root
>> vz/template      5.38G  2.25T  5.38G  /vz/template
>>
>> [root at node2.prg.vpsfree.cz]
>>  ~ # zfs get compressratio vz/private/101
>> NAME            PROPERTY       VALUE  SOURCE
>> vz/private/101  compressratio  1.38x  -
>>
>> [root at node2.prg.vpsfree.cz]
>>  ~ # ls /vz/private/101
>> private
>>
>> [root at node2.prg.vpsfree.cz]
>>  ~ # ls /vz/private/101/private/
>> aquota.group  aquota.user  b  bin  boot  dev  etc  git  home  lib
>> <snip>
>>
>> [root at node2.prg.vpsfree.cz]
>>  ~ # cat /etc/vz/conf/101.conf | grep -P "PRIVATE|ROOT"
>> VE_ROOT="/vz/root/101"
>> VE_PRIVATE="/vz/private/101/private"
>>
>>
>>> _______________________________________________
>>> Users mailing list
>>> Users at openvz.org
>>> https://lists.openvz.org/mailman/listinfo/users
>>>
>>
>> _______________________________________________
>> Users mailing list
>> Users at openvz.org
>> https://lists.openvz.org/mailman/listinfo/users
> 
> 
> 



More information about the Users mailing list