[Users] flashcache
Pavel Odintsov
pavel.odintsov at gmail.com
Thu Jul 10 03:25:52 PDT 2014
Thank you for your answers! It's really useful information.
On Thu, Jul 10, 2014 at 2:08 PM, Pavel Snajdr <lists at snajpa.net> wrote:
> On 07/10/2014 11:35 AM, Pavel Odintsov wrote:
>>> Not true, IO limits are working as they should (if we're talking vzctl
>>> set --iolimit/--iopslimit). I've kicked the ZoL guys around to add IO
>>> accounting support, so it is there.
>>
>> You can share tests with us? For standard folders like simfs this
>> limits works bad in big number of cases
>
> If you can give me concrete tests to run, sure, I'm curious to see if
> you're right - then we'd have something concrete to fix :)
>
>>
>>> How? ZFS doesn't have a limit on number of files (2^48 isn't a limit really)
>>
>> It's ok when your customer create 1 billion of small files on 10GB VPS
>> and you will try to archive it for backup? On slow disk system it's
>> really nightmare because a lot of disk operations which kills your
>> I/O.
>
> zfs snapshot <dataset>@<snapname>
> zfs send <dataset>@<snapname> > your-file or | ssh backuper zfs recv
> <backupdataset>
>
> That's done on block level. No need to run rsync anymore, it's a lot
> faster this way.
>
>>
>>> Why? ZFS send/receive is able to do bit-by-bit identical copy of the FS,
>>> I thought the point of migration is to don't have the CT notice any
>>> change, I don't see why the inode numbers should change.
>>
>> Do you have really working zero downtime vzmigrate on ZFS?
>
> Nope, vzmigrate isn't zero downtime. Due to vzctl/vzmigrate not
> supporting ZFS, we're implementing this our own way in vpsAdmin, which
> in it's 2.0 re-implementation will go opensource under GPL.
>
>>
>>> How exactly? I haven't seen a problem with any userspace software, other
>>> than MySQL default setting to AIO (it fallbacks to older method), which
>>> ZFS doesn't support (*yet*, they have it in their plans).
>>
>> I speaks about MySQL primarily. I have thousands of containers and I
>> can tune MySQL for another mode for all customers, it's impossible.
>
> As I said, this is under development and will improve.
>
>>
>>> L2ARC cache really smart
>>
>> Yep, fine, I knew. But can you account L2ARC cache usage per customer?
>> OpenVZ can it via flag:
>> sysctl -a|grep pagecache_isola
>> ubc.pagecache_isolation = 0
>
> I can't account for caches per CT, but I didn't have any need to do so.
>
> L2ARC != ARC, ARC is in system RAM, L2ARC is intended to be on SSD for
> the content of ARC that is the least significant in case of low memory -
> it gets pushed from ARC to L2ARC.
>
> ARC has two primary lists of cached data - most frequently used and most
> recently used and these two lists are divided by a boundary marking
> which data can be pushed away in low mem situation.
>
> It doesn't happen like with Linux VFS cache that you're copying one big
> file and it pushes out all of the other useful data there.
>
> Thanks to this distinction of MRU and MFU ARC achieves far better hitrates.
>
>>
>> But one customer can eat almost all L2ARC cache and displace another
>> customers data.
>
> Yes, but ZFS keeps track on what's being used, so useful data can't be
> pushed away that easily, things naturally balance themselves due to the
> way how ARC mechanism works.
>
>>
>> I'm not agains ZFS but I'm against of usage ZFS as underlying system
>> for containers. We caught ~100 kernel bugs with simfs on EXT4 when
>> customers do some strange thinks.
>
> I haven't encountered any problems especially with vzquota disabled (no
> need for it, ZFS has its own quotas, which never need to be recalculated
> as with vzquota).
>
>>
>> But ext4 has about few thouasands developers and the fix this issues
>> asap but ZFS on Linux has only 3-5 developers which VERY slow.
>> Because of this I recommends using ext4 with ploop because this
>> solution is rock stable or ZFS with ZVOL's with ext4 because this
>> solution if more reliable and more predictable then placing ZFS
>> containers on ZFS volumes.
>
> ZFS itself is a stable and mature filesystem, it first shipped as
> production with Solaris in 2006.
> And it's still being developed upstream as OpenZFS, that code is shared
> between the primary version - Illumos and the ports - FreeBSD, OS X, Linux.
>
> So what really needs and still is being developed is the way how ZFS is
> run under Linux kernel, but with recent release of 0.6.3, things have
> gotten mature enough to be used in production without any fears. Of
> course, no software is without bugs, but I can say with absolute
> certainty that ZFS will never eat your data, the only problem you can
> encounter is with the memory management, which is done really
> differently in Linux than in ZFS's original habitat - Solaris.
>
> /snajpa
>
>>
>>
>> On Thu, Jul 10, 2014 at 1:08 PM, Pavel Snajdr <lists at snajpa.net> wrote:
>>> On 07/10/2014 10:34 AM, Pavel Odintsov wrote:
>>>> Hello!
>>>>
>>>> You scheme is fine but you can't divide I/O load with cgroup blkio
>>>> (ioprio/iolimit/iopslimit) between different folders but between
>>>> different ZVOL you do.
>>>
>>> Not true, IO limits are working as they should (if we're talking vzctl
>>> set --iolimit/--iopslimit). I've kicked the ZoL guys around to add IO
>>> accounting support, so it is there.
>>>
>>>>
>>>> I could imagine following problems for per folder scheme:
>>>> 1) Can't limit number of inodes in different folders (but there are
>>>> not an inode limit for ZFS like ext4 but bug amount of files in
>>>> container could broke node;
>>>
>>> How? ZFS doesn't have a limit on number of files (2^48 isn't a limit really)
>>>
>>>> http://serverfault.com/questions/503658/can-you-set-inode-quotas-in-zfs)
>>>> 2) Problems with system cache which used by all containers in HWN together
>>>
>>> This exactly isn't a problem, but a *HUGE* benefit, you'd need to see it
>>> in practice :) Linux VFS cache is really dumb in comparison to ARC.
>>> ARC's hitrates just can't be done with what linux currently offers.
>>>
>>>> 3) Problems with live migration because you _should_ change inode
>>>> numbers on diffferent nodes
>>>
>>> Why? ZFS send/receive is able to do bit-by-bit identical copy of the FS,
>>> I thought the point of migration is to don't have the CT notice any
>>> change, I don't see why the inode numbers should change.
>>>
>>>> 4) ZFS behaviour with linux software in some cases is very STRANGE (DIRECT_IO)
>>>
>>> How exactly? I haven't seen a problem with any userspace software, other
>>> than MySQL default setting to AIO (it fallbacks to older method), which
>>> ZFS doesn't support (*yet*, they have it in their plans).
>>>
>>>> 5) ext4 has good support from vzctl (fsck, resize2fs)
>>>
>>> Yeah, but ext4 sucks big time. At least in my use-case.
>>>
>>> We've implemented most of vzctl create/destroy/etc. functionality in our
>>> vpsAdmin software instead.
>>>
>>> Guys, can I ask you to keep your mind open instead of fighting with
>>> pointless arguments? :) Give ZFS a try and then decide for yourselves.
>>>
>>> I think the community would benefit greatly if ZFS woudn't be fought as
>>> something alien in the Linux world, which I in my experience is what
>>> every Linux zealot I talk to about ZFS is doing.
>>> This is just not fair. It's primarily about technology, primarily about
>>> the best tool for the job. If we can implement something like this in
>>> Linux but without having ties to CDDL and possibly Oracle patents, that
>>> would be awesome, yet nobody has done such a thing yet. BTRFS is nowhere
>>> near ZFS when it comes to running larger scale deployments and in some
>>> regards I don't think it will ever match ZFS, just looking at the way
>>> it's been designed.
>>>
>>> I'm not trying to flame here, I'm trying to open you guys to the fact,
>>> that there really is a better alternative than you're currently seeing.
>>> And if it has some technological drawbacks like these that you're trying
>>> to point out, instead of pointing at them as something, which can't be
>>> changed and thus everyone should use "your best solution(tm)", try to
>>> think of ways how to change it for the better.
>>>
>>>>
>>>> My ideas like simfs vs ploop comparison:
>>>> http://openvz.org/images/f/f3/Ct_in_a_file.pdf
>>>
>>> Again, you have to see ZFS doing its magic in production under a really
>>> heavy load, otherwise you won't understand. Any arbitrary benchmarks
>>> I've seen show ZFS is slower than ext4, but these are not tuned for such
>>> use cases as I'm talking about.
>>>
>>> /snajpa
>>>
>>>>
>>>> On Thu, Jul 10, 2014 at 12:06 PM, Pavel Snajdr <lists at snajpa.net> wrote:
>>>>> On 07/09/2014 06:58 PM, Kir Kolyshkin wrote:
>>>>>> On 07/08/2014 11:54 PM, Pavel Snajdr wrote:
>>>>>>> On 07/08/2014 07:52 PM, Scott Dowdle wrote:
>>>>>>>> Greetings,
>>>>>>>>
>>>>>>>> ----- Original Message -----
>>>>>>>>> (offtopic) We can not use ZFS. Unfortunately, NAS with something like
>>>>>>>>> Nexenta is to expensive for us.
>>>>>>>> From what I've gathered from a few presentations, ZFS on Linux (http://zfsonlinux.org/) is as stable but more performant than it is on the OpenSolaris forks... so you can build your own if you can spare the people to learn the best practices.
>>>>>>>>
>>>>>>>> I don't have a use for ZFS myself so I'm not really advocating it.
>>>>>>>>
>>>>>>>> TYL,
>>>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> we run tens of OpenVZ nodes (bigger boxes, 256G RAM, 12cores+, 90 CTs at
>>>>>>> least). We've used to run ext4+flashcache, but ext4 has proven to be a
>>>>>>> bottleneck. That was the primary motivation behind ploop as far as I know.
>>>>>>>
>>>>>>> We've switched to ZFS on Linux around the time Ploop was announced and I
>>>>>>> didn't have second thoughts since. ZFS really *is* in my experience the
>>>>>>> best filesystem there is at the moment for this kind of deployment -
>>>>>>> especially if you use dedicated SSDs for ZIL and L2ARC, although the
>>>>>>> latter is less important. You will know what I'm talking about when you
>>>>>>> try this on boxes with lots of CTs doing LAMP load - databases and their
>>>>>>> synchronous writes are the real problem, which ZFS with dedicated ZIL
>>>>>>> device solves.
>>>>>>>
>>>>>>> Also there is the ARC caching, which is smarter then linux VFS cache -
>>>>>>> we're able to achieve about 99% of hitrate at about 99% of the time,
>>>>>>> even under high loads.
>>>>>>>
>>>>>>> Having said all that, I recommend everyone to give ZFS a chance, but I'm
>>>>>>> aware this is yet another out-of-mainline code and that doesn't suit
>>>>>>> everyone that well.
>>>>>>>
>>>>>>
>>>>>> Are you using per-container ZVOL or something else?
>>>>>
>>>>> That would mean I'd need to do another filesystem on top of ZFS, which
>>>>> would in turn mean I'd add another unnecessary layer of indirection. ZFS
>>>>> is a pooled storage like BTRFS is, we're giving one dataset to each
>>>>> container.
>>>>>
>>>>> vzctl tries to move the VE_PRIVATE folder around, so we had to add one
>>>>> more directory to put the VE_PRIVATE data into (see the first ls).
>>>>>
>>>>> Example from production:
>>>>>
>>>>> [root at node2.prg.vpsfree.cz]
>>>>> ~ # zpool status vz
>>>>> pool: vz
>>>>> state: ONLINE
>>>>> scan: scrub repaired 0 in 1h24m with 0 errors on Tue Jul 8 16:22:17 2014
>>>>> config:
>>>>>
>>>>> NAME STATE READ WRITE CKSUM
>>>>> vz ONLINE 0 0 0
>>>>> mirror-0 ONLINE 0 0 0
>>>>> sda ONLINE 0 0 0
>>>>> sdb ONLINE 0 0 0
>>>>> mirror-1 ONLINE 0 0 0
>>>>> sde ONLINE 0 0 0
>>>>> sdf ONLINE 0 0 0
>>>>> mirror-2 ONLINE 0 0 0
>>>>> sdg ONLINE 0 0 0
>>>>> sdh ONLINE 0 0 0
>>>>> logs
>>>>> mirror-3 ONLINE 0 0 0
>>>>> sdc3 ONLINE 0 0 0
>>>>> sdd3 ONLINE 0 0 0
>>>>> cache
>>>>> sdc5 ONLINE 0 0 0
>>>>> sdd5 ONLINE 0 0 0
>>>>>
>>>>> errors: No known data errors
>>>>>
>>>>> [root at node2.prg.vpsfree.cz]
>>>>> ~ # zfs list
>>>>> NAME USED AVAIL REFER MOUNTPOINT
>>>>> vz 432G 2.25T 36K /vz
>>>>> vz/private 427G 2.25T 111K /vz/private
>>>>> vz/private/101 17.7G 42.3G 17.7G /vz/private/101
>>>>> <snip>
>>>>> vz/root 104K 2.25T 104K /vz/root
>>>>> vz/template 5.38G 2.25T 5.38G /vz/template
>>>>>
>>>>> [root at node2.prg.vpsfree.cz]
>>>>> ~ # zfs get compressratio vz/private/101
>>>>> NAME PROPERTY VALUE SOURCE
>>>>> vz/private/101 compressratio 1.38x -
>>>>>
>>>>> [root at node2.prg.vpsfree.cz]
>>>>> ~ # ls /vz/private/101
>>>>> private
>>>>>
>>>>> [root at node2.prg.vpsfree.cz]
>>>>> ~ # ls /vz/private/101/private/
>>>>> aquota.group aquota.user b bin boot dev etc git home lib
>>>>> <snip>
>>>>>
>>>>> [root at node2.prg.vpsfree.cz]
>>>>> ~ # cat /etc/vz/conf/101.conf | grep -P "PRIVATE|ROOT"
>>>>> VE_ROOT="/vz/root/101"
>>>>> VE_PRIVATE="/vz/private/101/private"
>>>>>
>>>>>
>>>>>> _______________________________________________
>>>>>> Users mailing list
>>>>>> Users at openvz.org
>>>>>> https://lists.openvz.org/mailman/listinfo/users
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Users mailing list
>>>>> Users at openvz.org
>>>>> https://lists.openvz.org/mailman/listinfo/users
>>>>
>>>>
>>>>
>>>
>>> _______________________________________________
>>> Users mailing list
>>> Users at openvz.org
>>> https://lists.openvz.org/mailman/listinfo/users
>>
>>
>>
>
> _______________________________________________
> Users mailing list
> Users at openvz.org
> https://lists.openvz.org/mailman/listinfo/users
--
Sincerely yours, Pavel Odintsov
More information about the Users
mailing list