[Users] flashcache

Pavel Odintsov pavel.odintsov at gmail.com
Thu Jul 10 02:35:15 PDT 2014


>Not true, IO limits are working as they should (if we're talking vzctl
>set --iolimit/--iopslimit). I've kicked the ZoL guys around to add IO
>accounting support, so it is there.

You can share tests with us? For standard folders like simfs this
limits works bad in big number of cases

>How? ZFS doesn't have a limit on number of files (2^48 isn't a limit really)

It's ok when your customer create 1 billion of small files on 10GB VPS
and you will try to archive it for backup? On slow disk system it's
really nightmare because a lot of disk operations which kills your
I/O.

>Why? ZFS send/receive is able to do bit-by-bit identical copy of the FS,
>I thought the point of migration is to don't have the CT notice any
>change, I don't see why the inode numbers should change.

Do you have really working zero downtime vzmigrate on ZFS?

>How exactly? I haven't seen a problem with any userspace software, other
>than MySQL default setting to AIO (it fallbacks to older method), which
>ZFS doesn't support (*yet*, they have it in their plans).

I speaks about MySQL primarily. I have thousands of containers and I
can tune MySQL for another mode for all customers, it's impossible.

> L2ARC cache really smart

Yep, fine, I knew. But can you account L2ARC cache usage per customer?
OpenVZ can it via flag:
sysctl -a|grep pagecache_isola
ubc.pagecache_isolation = 0

But one customer can eat almost all L2ARC cache and displace another
customers data.

I'm not agains ZFS but I'm against of usage ZFS as underlying system
for containers. We caught ~100 kernel bugs with simfs on EXT4 when
customers do some strange thinks.

But ext4 has about few thouasands developers and the fix this issues
asap but ZFS on Linux has only 3-5 developers which VERY slow.
Because of this I recommends using ext4 with ploop because this
solution is rock stable or ZFS with ZVOL's with ext4 because this
solution if more reliable and more predictable then placing ZFS
containers on ZFS volumes.


On Thu, Jul 10, 2014 at 1:08 PM, Pavel Snajdr <lists at snajpa.net> wrote:
> On 07/10/2014 10:34 AM, Pavel Odintsov wrote:
>> Hello!
>>
>> You scheme is fine but you can't divide I/O load with cgroup blkio
>> (ioprio/iolimit/iopslimit) between different folders but between
>> different ZVOL you do.
>
> Not true, IO limits are working as they should (if we're talking vzctl
> set --iolimit/--iopslimit). I've kicked the ZoL guys around to add IO
> accounting support, so it is there.
>
>>
>> I could imagine following problems for per folder scheme:
>> 1) Can't limit number of inodes in different folders (but there are
>> not an inode limit for ZFS like ext4 but bug amount of files in
>> container could broke node;
>
> How? ZFS doesn't have a limit on number of files (2^48 isn't a limit really)
>
>> http://serverfault.com/questions/503658/can-you-set-inode-quotas-in-zfs)
>> 2) Problems with system cache which used by all containers in HWN together
>
> This exactly isn't a problem, but a *HUGE* benefit, you'd need to see it
> in practice :) Linux VFS cache is really dumb in comparison to ARC.
> ARC's hitrates just can't be done with what linux currently offers.
>
>> 3) Problems with live migration because you _should_ change inode
>> numbers on diffferent nodes
>
> Why? ZFS send/receive is able to do bit-by-bit identical copy of the FS,
> I thought the point of migration is to don't have the CT notice any
> change, I don't see why the inode numbers should change.
>
>> 4) ZFS behaviour with linux software in some cases is very STRANGE (DIRECT_IO)
>
> How exactly? I haven't seen a problem with any userspace software, other
> than MySQL default setting to AIO (it fallbacks to older method), which
> ZFS doesn't support (*yet*, they have it in their plans).
>
>> 5) ext4 has good support from vzctl (fsck, resize2fs)
>
> Yeah, but ext4 sucks big time. At least in my use-case.
>
> We've implemented most of vzctl create/destroy/etc. functionality in our
> vpsAdmin software instead.
>
> Guys, can I ask you to keep your mind open instead of fighting with
> pointless arguments? :) Give ZFS a try and then decide for yourselves.
>
> I think the community would benefit greatly if ZFS woudn't be fought as
> something alien in the Linux world, which I in my experience is what
> every Linux zealot I talk to about ZFS is doing.
> This is just not fair. It's primarily about technology, primarily about
> the best tool for the job. If we can implement something like this in
> Linux but without having ties to CDDL and possibly Oracle patents, that
> would be awesome, yet nobody has done such a thing yet. BTRFS is nowhere
> near ZFS when it comes to running larger scale deployments and in some
> regards I don't think it will ever match ZFS, just looking at the way
> it's been designed.
>
> I'm not trying to flame here, I'm trying to open you guys to the fact,
> that there really is a better alternative than you're currently seeing.
> And if it has some technological drawbacks like these that you're trying
> to point out, instead of pointing at them as something, which can't be
> changed and thus everyone should use "your best solution(tm)", try to
> think of ways how to change it for the better.
>
>>
>> My ideas like simfs vs ploop comparison:
>> http://openvz.org/images/f/f3/Ct_in_a_file.pdf
>
> Again, you have to see ZFS doing its magic in production under a really
> heavy load, otherwise you won't understand. Any arbitrary benchmarks
> I've seen show ZFS is slower than ext4, but these are not tuned for such
> use cases as I'm talking about.
>
> /snajpa
>
>>
>> On Thu, Jul 10, 2014 at 12:06 PM, Pavel Snajdr <lists at snajpa.net> wrote:
>>> On 07/09/2014 06:58 PM, Kir Kolyshkin wrote:
>>>> On 07/08/2014 11:54 PM, Pavel Snajdr wrote:
>>>>> On 07/08/2014 07:52 PM, Scott Dowdle wrote:
>>>>>> Greetings,
>>>>>>
>>>>>> ----- Original Message -----
>>>>>>> (offtopic) We can not use ZFS. Unfortunately, NAS with something like
>>>>>>> Nexenta is to expensive for us.
>>>>>> From what I've gathered from a few presentations, ZFS on Linux (http://zfsonlinux.org/) is as stable but more performant than it is on the OpenSolaris forks... so you can build your own if you can spare the people to learn the best practices.
>>>>>>
>>>>>> I don't have a use for ZFS myself so I'm not really advocating it.
>>>>>>
>>>>>> TYL,
>>>>>>
>>>>> Hi all,
>>>>>
>>>>> we run tens of OpenVZ nodes (bigger boxes, 256G RAM, 12cores+, 90 CTs at
>>>>> least). We've used to run ext4+flashcache, but ext4 has proven to be a
>>>>> bottleneck. That was the primary motivation behind ploop as far as I know.
>>>>>
>>>>> We've switched to ZFS on Linux around the time Ploop was announced and I
>>>>> didn't have second thoughts since. ZFS really *is* in my experience the
>>>>> best filesystem there is at the moment for this kind of deployment  -
>>>>> especially if you use dedicated SSDs for ZIL and L2ARC, although the
>>>>> latter is less important. You will know what I'm talking about when you
>>>>> try this on boxes with lots of CTs doing LAMP load - databases and their
>>>>> synchronous writes are the real problem, which ZFS with dedicated ZIL
>>>>> device solves.
>>>>>
>>>>> Also there is the ARC caching, which is smarter then linux VFS cache -
>>>>> we're able to achieve about 99% of hitrate at about 99% of the time,
>>>>> even under high loads.
>>>>>
>>>>> Having said all that, I recommend everyone to give ZFS a chance, but I'm
>>>>> aware this is yet another out-of-mainline code and that doesn't suit
>>>>> everyone that well.
>>>>>
>>>>
>>>> Are you using per-container ZVOL or something else?
>>>
>>> That would mean I'd need to do another filesystem on top of ZFS, which
>>> would in turn mean I'd add another unnecessary layer of indirection. ZFS
>>> is a pooled storage like BTRFS is, we're giving one dataset to each
>>> container.
>>>
>>> vzctl tries to move the VE_PRIVATE folder around, so we had to add one
>>> more directory to put the VE_PRIVATE data into (see the first ls).
>>>
>>> Example from production:
>>>
>>> [root at node2.prg.vpsfree.cz]
>>>  ~ # zpool status vz
>>>   pool: vz
>>>  state: ONLINE
>>>   scan: scrub repaired 0 in 1h24m with 0 errors on Tue Jul  8 16:22:17 2014
>>> config:
>>>
>>>         NAME        STATE     READ WRITE CKSUM
>>>         vz          ONLINE       0     0     0
>>>           mirror-0  ONLINE       0     0     0
>>>             sda     ONLINE       0     0     0
>>>             sdb     ONLINE       0     0     0
>>>           mirror-1  ONLINE       0     0     0
>>>             sde     ONLINE       0     0     0
>>>             sdf     ONLINE       0     0     0
>>>           mirror-2  ONLINE       0     0     0
>>>             sdg     ONLINE       0     0     0
>>>             sdh     ONLINE       0     0     0
>>>         logs
>>>           mirror-3  ONLINE       0     0     0
>>>             sdc3    ONLINE       0     0     0
>>>             sdd3    ONLINE       0     0     0
>>>         cache
>>>           sdc5      ONLINE       0     0     0
>>>           sdd5      ONLINE       0     0     0
>>>
>>> errors: No known data errors
>>>
>>> [root at node2.prg.vpsfree.cz]
>>>  ~ # zfs list
>>> NAME              USED  AVAIL  REFER  MOUNTPOINT
>>> vz                432G  2.25T    36K  /vz
>>> vz/private        427G  2.25T   111K  /vz/private
>>> vz/private/101   17.7G  42.3G  17.7G  /vz/private/101
>>> <snip>
>>> vz/root           104K  2.25T   104K  /vz/root
>>> vz/template      5.38G  2.25T  5.38G  /vz/template
>>>
>>> [root at node2.prg.vpsfree.cz]
>>>  ~ # zfs get compressratio vz/private/101
>>> NAME            PROPERTY       VALUE  SOURCE
>>> vz/private/101  compressratio  1.38x  -
>>>
>>> [root at node2.prg.vpsfree.cz]
>>>  ~ # ls /vz/private/101
>>> private
>>>
>>> [root at node2.prg.vpsfree.cz]
>>>  ~ # ls /vz/private/101/private/
>>> aquota.group  aquota.user  b  bin  boot  dev  etc  git  home  lib
>>> <snip>
>>>
>>> [root at node2.prg.vpsfree.cz]
>>>  ~ # cat /etc/vz/conf/101.conf | grep -P "PRIVATE|ROOT"
>>> VE_ROOT="/vz/root/101"
>>> VE_PRIVATE="/vz/private/101/private"
>>>
>>>
>>>> _______________________________________________
>>>> Users mailing list
>>>> Users at openvz.org
>>>> https://lists.openvz.org/mailman/listinfo/users
>>>>
>>>
>>> _______________________________________________
>>> Users mailing list
>>> Users at openvz.org
>>> https://lists.openvz.org/mailman/listinfo/users
>>
>>
>>
>
> _______________________________________________
> Users mailing list
> Users at openvz.org
> https://lists.openvz.org/mailman/listinfo/users



-- 
Sincerely yours, Pavel Odintsov


More information about the Users mailing list