[Users] flashcache
Pavel Odintsov
pavel.odintsov at gmail.com
Thu Jul 10 01:34:40 PDT 2014
Hello!
You scheme is fine but you can't divide I/O load with cgroup blkio
(ioprio/iolimit/iopslimit) between different folders but between
different ZVOL you do.
I could imagine following problems for per folder scheme:
1) Can't limit number of inodes in different folders (but there are
not an inode limit for ZFS like ext4 but bug amount of files in
container could broke node;
http://serverfault.com/questions/503658/can-you-set-inode-quotas-in-zfs)
2) Problems with system cache which used by all containers in HWN together
3) Problems with live migration because you _should_ change inode
numbers on diffferent nodes
4) ZFS behaviour with linux software in some cases is very STRANGE (DIRECT_IO)
5) ext4 has good support from vzctl (fsck, resize2fs)
My ideas like simfs vs ploop comparison:
http://openvz.org/images/f/f3/Ct_in_a_file.pdf
On Thu, Jul 10, 2014 at 12:06 PM, Pavel Snajdr <lists at snajpa.net> wrote:
> On 07/09/2014 06:58 PM, Kir Kolyshkin wrote:
>> On 07/08/2014 11:54 PM, Pavel Snajdr wrote:
>>> On 07/08/2014 07:52 PM, Scott Dowdle wrote:
>>>> Greetings,
>>>>
>>>> ----- Original Message -----
>>>>> (offtopic) We can not use ZFS. Unfortunately, NAS with something like
>>>>> Nexenta is to expensive for us.
>>>> From what I've gathered from a few presentations, ZFS on Linux (http://zfsonlinux.org/) is as stable but more performant than it is on the OpenSolaris forks... so you can build your own if you can spare the people to learn the best practices.
>>>>
>>>> I don't have a use for ZFS myself so I'm not really advocating it.
>>>>
>>>> TYL,
>>>>
>>> Hi all,
>>>
>>> we run tens of OpenVZ nodes (bigger boxes, 256G RAM, 12cores+, 90 CTs at
>>> least). We've used to run ext4+flashcache, but ext4 has proven to be a
>>> bottleneck. That was the primary motivation behind ploop as far as I know.
>>>
>>> We've switched to ZFS on Linux around the time Ploop was announced and I
>>> didn't have second thoughts since. ZFS really *is* in my experience the
>>> best filesystem there is at the moment for this kind of deployment -
>>> especially if you use dedicated SSDs for ZIL and L2ARC, although the
>>> latter is less important. You will know what I'm talking about when you
>>> try this on boxes with lots of CTs doing LAMP load - databases and their
>>> synchronous writes are the real problem, which ZFS with dedicated ZIL
>>> device solves.
>>>
>>> Also there is the ARC caching, which is smarter then linux VFS cache -
>>> we're able to achieve about 99% of hitrate at about 99% of the time,
>>> even under high loads.
>>>
>>> Having said all that, I recommend everyone to give ZFS a chance, but I'm
>>> aware this is yet another out-of-mainline code and that doesn't suit
>>> everyone that well.
>>>
>>
>> Are you using per-container ZVOL or something else?
>
> That would mean I'd need to do another filesystem on top of ZFS, which
> would in turn mean I'd add another unnecessary layer of indirection. ZFS
> is a pooled storage like BTRFS is, we're giving one dataset to each
> container.
>
> vzctl tries to move the VE_PRIVATE folder around, so we had to add one
> more directory to put the VE_PRIVATE data into (see the first ls).
>
> Example from production:
>
> [root at node2.prg.vpsfree.cz]
> ~ # zpool status vz
> pool: vz
> state: ONLINE
> scan: scrub repaired 0 in 1h24m with 0 errors on Tue Jul 8 16:22:17 2014
> config:
>
> NAME STATE READ WRITE CKSUM
> vz ONLINE 0 0 0
> mirror-0 ONLINE 0 0 0
> sda ONLINE 0 0 0
> sdb ONLINE 0 0 0
> mirror-1 ONLINE 0 0 0
> sde ONLINE 0 0 0
> sdf ONLINE 0 0 0
> mirror-2 ONLINE 0 0 0
> sdg ONLINE 0 0 0
> sdh ONLINE 0 0 0
> logs
> mirror-3 ONLINE 0 0 0
> sdc3 ONLINE 0 0 0
> sdd3 ONLINE 0 0 0
> cache
> sdc5 ONLINE 0 0 0
> sdd5 ONLINE 0 0 0
>
> errors: No known data errors
>
> [root at node2.prg.vpsfree.cz]
> ~ # zfs list
> NAME USED AVAIL REFER MOUNTPOINT
> vz 432G 2.25T 36K /vz
> vz/private 427G 2.25T 111K /vz/private
> vz/private/101 17.7G 42.3G 17.7G /vz/private/101
> <snip>
> vz/root 104K 2.25T 104K /vz/root
> vz/template 5.38G 2.25T 5.38G /vz/template
>
> [root at node2.prg.vpsfree.cz]
> ~ # zfs get compressratio vz/private/101
> NAME PROPERTY VALUE SOURCE
> vz/private/101 compressratio 1.38x -
>
> [root at node2.prg.vpsfree.cz]
> ~ # ls /vz/private/101
> private
>
> [root at node2.prg.vpsfree.cz]
> ~ # ls /vz/private/101/private/
> aquota.group aquota.user b bin boot dev etc git home lib
> <snip>
>
> [root at node2.prg.vpsfree.cz]
> ~ # cat /etc/vz/conf/101.conf | grep -P "PRIVATE|ROOT"
> VE_ROOT="/vz/root/101"
> VE_PRIVATE="/vz/private/101/private"
>
>
>> _______________________________________________
>> Users mailing list
>> Users at openvz.org
>> https://lists.openvz.org/mailman/listinfo/users
>>
>
> _______________________________________________
> Users mailing list
> Users at openvz.org
> https://lists.openvz.org/mailman/listinfo/users
--
Sincerely yours, Pavel Odintsov
More information about the Users
mailing list