[Devel] RE: i2o hardware hangs (ASR-2010S)

Mon Aug 14 07:28:47 PDT 2006

Others calls in the driver to shost_for_each_device unlock the host_lock
while in the loop, makes sense to do the same in that loop as well given
that these actions are taken when the adapter is quiesced. I worry,
though, completion of the commands with QUEUE_FULL may result in them
being turned around immediately which could clutter up the list. Could
you experiment with this change:

static void adpt_fail_posted_scbs(adpt_hba* pHba)
{
        struct scsi_cmnd*       cmd = NULL;
        struct scsi_device*     d;

 #if (LINUX_VERSION_CODE >= KERNEL_VERSION(2,5,65))
 # if ((LINUX_VERSION_CODE > KERNEL_VERSION(2,6,0)) ||
defined(shost_for_each_device))
+        spin_unlock(pHba->host->host_lock);
         shost_for_each_device(d, pHba->host) {
 # else
         list_for_each_entry(d, &pHba->host->my_devices, siblings) {
 # endif
                unsigned long flags;
                spin_lock_irqsave(&d->list_lock, flags);
                list_for_each_entry(cmd, &d->cmd_list, list) {
                        if (cmd->serial_number == 0) {
                                continue;
                        }
                        cmd->result = (DID_OK << 16) | (QUEUE_FULL <<
1);
                        cmd->scsi_done(cmd);
                }
                spin_unlock_irqrestore(&d->list_lock, flags);
        }
+# if ((LINUX_VERSION_CODE > KERNEL_VERSION(2,6,0)) ||
defined(shost_for_each_device))
+       spin_lock(pHba->host->host_lock);
+# endif
 #else
        d = pHba->host->host_queue;

Sincerely -- Mark Salyzyn

> -----Original Message-----
> From: Vasily Averin [mailto:vvs at sw.ru] 
> Sent: Monday, August 14, 2006 10:02 AM
> To: Salyzyn, Mark
> Cc: Markus Lidel; devel at openvz.org
> Subject: Re: i2o hardware hangs (ASR-2010S)
> 
> 
> Hello Mark,
> 
> I've tested your driver and unfortunately found bug in scsi 
> host reset handler:
> 
> adpt_reset (on kernels <= KERNEL_VERSION(2,6,12) it called 
> with host_lock taken)
>  adpt_hba_reset
>   adpt_fail_posted_scbs
>    shost_for_each_device
>     __scsi_iterate_devices
>      spin_lock_irqsave(shost->host_lock, flags); <<<<< deadlock
> 
> Also I've noticed that adpt_hba_reset() can be called also 
> from adpt_ioctl() and
> it have taken host_lock too on the kernel >= KERNEL_VERSION(2,5,65).
> 
> However currently I do not understand how to fix this issue correctly.
> 
> Thank you,
> 	Vasily Averin
> 
> Salyzyn, Mark wrote:
> > I had sent you the driver source in a previous email, I am 
> sending it
> > again. Please keep me in the loop since latest model 
> kernels (we have
> > customers that confirm 2.6.16) may require changes in the driver to
> > compile.
> > 
> > Since the kernel.org policy is to focus on the i2o driver 
> being beefed
> > up, no patches or changes are accepted for the dpt_i2o 
> driver into the
> > kernel. Sad that we had just finished a stint beefing up the dpt_i2o
> > driver just before that decision was made ...
> > 
> > The comments about error recovery were meant as a starting point, it
> > looks like Markus will have the final say.
> > 
> > As for the timeouts, I referred to DASD (Disk) targets. 3 minute for
> > RAID devices in a rolling timeout  is used to deal with 
> situations that
> > require a complete spin up of all component drives, or to deal with
> > worst case error recovery scenarios. Individual DASD targets, on the
> > other hand, should report back within 30 seconds for I/O. None DASD
> > targets are all direct, and thus should respect any 
> timeouts set by the
> > system (if any).
> > 
> > Sincerely -- Mark Salyzyn
> > 
> >>-----Original Message-----
> >>From: Vasily Averin [mailto:vvs at sw.ru] 
> >>Sent: Tuesday, August 08, 2006 5:48 AM
> >>To: Salyzyn, Mark
> >>Cc: Markus Lidel; devel at openvz.org
> >>Subject: Re: i2o hardware hangs (ASR-2010S)
> >>
> >>
> >>Mark,
> >>
> >>Salyzyn, Mark wrote:
> >>>Vasily, it will necessarily be up to you as to whether you 
> switch to
> >>>dpt_i2o to get the hardening you require today, or work out 
> >>a deal with
> >>>Markus to add timeout/reset functionality to the i2o driver.
> >>Of course, you are right. Currently our customers have bad 2 
> >>alternatives:
> >>- be tolerate to these hangs
> >>- if they can't bear it -- replace i2o hardware
> >>
> >>Therefore first at all I'm going to add third possible 
> >>alternative, dpt_i2o driver.
> >>
> >>Mark, could you please send me latest version of your driver 
> >>directly? Or can I
> >>probably take it from mainstream?
> >>
> >>The next task is help Markus in i2o error/reset handler 
> >>implementation.
> >>
> >>>My recommendations for the i2o driver reset procedure is to use a
> >>>rolling timeout, every new command completion resets the 
> >>global timer.
> >>>This will allow starved or long commands to process. Once 
> >>the timer hits
> >>>3 minutes for RAID (Block or SCSI) targets that have multiple
> >>>inheritances, 30 seconds for SCSI DASD targets, or some 
> >>insmod tunable,
> >>>it resets the adapter. I recommend that when we hit ten 
> >>seconds, or some
> >>>insmod tunable, that we call a card specific health check 
> >>routine. I do
> >>>not recommend health check polling because we have noticed 
> >>a reduction
> >>>in Adapter performance in some systems and generic i2o cards would
> >>>require a command to check, so that is why I tie it to the 
> >>ten seconds
> >>>past last completion. For the DPT/Adaptec series of 
> >>adapters, it checks
> >>>the BlinkLED status (code fragment in dpt_i2o driver at
> >>>adpt_read_blink_led), and if set, immediately record the 
> >>fact and resets
> >>>the adapter. For cards other than the DPT/Adaptec series, I 
> >>recommend a
> >>>short timeout Get Status request to see if the Firmware is in a run
> >>>state and is responsive to this simple command. The reset 
> >>code will need
> >>>to retry all commands itself, I do not believe the block 
> >>system has an
> >>>error status that can be used for it to retry the commands. 
> >>If the Reset
> >>>Iop in the reset adapter code is unresponsive, then the 
> >>known targets
> >>>need to be placed offline.
> >>Sorry, I do not have your big experience in scsi and do not 
> >>know nothing in i2o.
> >>However are you sure than 3 min is enough for timeout? As far 
> >>as I know some
> >>scsi commands (for example rewind on tapes) can last during a 
> >>very long time.
> >>
> >>Also I have some other questions but currently I'm not fell 
> >>that I'm ready for
> >>this discussion.
> >>
> >>Thank you,
> >>	Vasily Averin
> >>
> >>SWsoft Virtuozzo/OpenVZ Linux kernel team
> > 
> 
>