[Devel] double faults in Virtuozzo KVM

Fri Sep 29 10:04:43 MSK 2017

On Fri, Sep 29, 2017 at 12:02:37AM +0300, Denis Kirjanov wrote:
> On Thursday, September 28, 2017, Roman Kagan <rkagan at virtuozzo.com> wrote:
> > On Thu, Sep 28, 2017 at 05:55:51PM +0300, Denis Kirjanov wrote:
> > > Hi, we're seeing double faults in async_page_fault.
> >
> > async_page_fault is the #PF handler in KVM guests.  It filters out
> > specially crafted #PF's from the host; the rest fall through to the
> > regular #PF handler.  So most likely you're seeing genuine #PFs,
> > unrelated to virtualization.
> >
> > > _Some_ of them related to the fact that during the faults RSP points
> > > to userspace and it leads to double-fault scenario.
> >
> > The postmortem you quote doesn't support that.
> 
> 
> I'll post a relevant trace
> 
> >
> > > Is it known problem?
> >
> > There used to be a bug in async pagefault machinery which caused L0
> > hypervisor to inject async pagefaults into L2 guest instead of L1.  This
> > must've been fixed in sufficiently recent
> 
> 
> Yep, I saw the patch and it's imho about the different thing. The patch
> fixes the wrong PF injected to an unrelated guest and thus a guest ends up
> with the 'CPU stuck' messages since it can't get the requested page

Not quite.

The idea of async_pf is that when a guest task hits a page which is
present in the guest page tables but absent in the hypervisor ones, the
hypervisor, instead of descheduling the whole vCPU thread until the
fault is resolved, injects a specially crafted #PF into the guest so
that the guest can deschedule that task and put it on a waiting list,
but otherwise continue working.  Once the fault is resolved in the
hypervisor, it injects another #PF matching the first one, and the guest
looks up the task and resumes it.  The bug was that those special #PF's
were occasionally injected into L2 guest instead of L1.  If the guest
received the first kind of async_pf, but not the second, the task will
remain stuck forever.  If, vice versa, the first one was missing, the
second one wouldn't match any suspended task and would be considered a
regular #PF by the guest kernel, so an arbitrary task would receieve a
bogus #PF.

Anyway every #PF in linux guests, including genuine guest ones, goes
through async_page_fault, so its presence in the stacktraces is
expected.

And that bug is only relevant in presence of nested KVM, and it was
fixed in vzlinux.

> > I'd guess the problem is with your kernel.  Doesn't it reproduce on bare
> > metal?

I still hold on this.  What guest kernel are you using?  What are your
reasons to blame the hypervisor and not the kernel?

Roman.