[Devel] Re: [Bugme-new] [Bug 14150] New: BUG: soft lockup - CPU#3 stuck for 61s!, while running cpu controller latency testcase on two containers parallaly

Rishikesh risrajak at linux.vnet.ibm.com
Thu Sep 17 23:54:31 PDT 2009


Rishikesh wrote:
> Dhaval Giani wrote:
>> On Fri, Sep 11, 2009 at 02:28:13PM -0700, Andrew Morton wrote:
>>  
>>> (switched to email.  Please respond via emailed reply-to-all, not 
>>> via the
>>> bugzilla web interface).
>>>
>>> On Thu, 10 Sep 2009 09:32:30 GMT
>>> bugzilla-daemon at bugzilla.kernel.org wrote:
>>>
>>>    
>>>> http://bugzilla.kernel.org/show_bug.cgi?id=14150
>>>>
>>>>            Summary: BUG: soft lockup - CPU#3 stuck for 61s!, while 
>>>> running
>>>>                     cpu controller latency testcase on two containers
>>>>                     parallaly
>>>>            Product: Process Management
>>>>            Version: 2.5
>>>>     Kernel Version: 2.6.31-rc7
>>>>           Platform: All
>>>>         OS/Version: Linux
>>>>               Tree: Mainline
>>>>             Status: NEW
>>>>           Severity: high
>>>>           Priority: P1
>>>>          Component: Scheduler
>>>>         AssignedTo: mingo at elte.hu
>>>>         ReportedBy: risrajak at linux.vnet.ibm.com
>>>>                 CC: serue at us.ibm.com, iranna.ankad at in.ibm.com,
>>>>                     risrajak at in.ibm.com
>>>>         Regression: No
>>>>
>>>>
>>>> Created an attachment (id=23055)
>>>>  --> (http://bugzilla.kernel.org/attachment.cgi?id=23055)
>>>> Config-file-used
>>>>
>>>> Hitting this soft lock issue while running this scenario on 
>>>> 2.6.31-rc7 kernel
>>>> on SystemX 32 bit on multiple machines.
>>>>
>>>> Scenario:
>>>>     - While running cpu controller latency testcase from LTP same 
>>>> time on two
>>>> containers.
>>>>
>>>> Steps:
>>>> 1. Compile ltp-full-20090731.tgz on host.
>>>> 2. Create two container (Used lxc tool
>>>> (http://sourceforge.net/projects/lxc/lxc-0.6.3.tar.gz) for creating 
>>>> container )
>>>> e.g:
>>>>     lxc-create -n foo1
>>>>     lxc-create -n foo2
>>>> On first shell:
>>>>     lxc-execute -n foo1 -f /usr/etc/lxc/lxc-macvlan.conf /bin/bash
>>>> on Second shell:
>>>>     lxc-execute -n foo2 -f /usr/etc/lxc/lxc-macvlan.conf /bin/bash
>>>>
>>>> 3. Either you run cpu_latency testcase alone or run "./runltp -f 
>>>> controllers"
>>>> at same time on both the containers.
>>>> 4. After testcase execution completes, you can see this message in 
>>>> dmesg.
>>>>
>>>> Expected Result:
>>>>     - Should not reproduce soft lock up issue.
>>>> - This reproduces 3 times out of 5 tries.
>>>>
>>>> hrtimer: interrupt too slow, forcing clock min delta to 5843235 ns
>>>> hrtimer: interrupt too slow, forcing clock min delta to 5842476 ns
>>>> Clocksource tsc unstable (delta = 18749057581 ns)
>>>> BUG: soft lockup - CPU#3 stuck for 61s! [cpuctl_latency_:17174]
>>>> Modules linked in: bridge stp llc bnep sco l2cap bluetooth sunrpc ipv6
>>>> p4_clockmod dm_multipath uinput qla2xxx ata_generic pata_acpi 
>>>> usb_storage e1000
>>>> scsi_transport_fc joydev scsi_tgt i2c_piix4 pata_serverworks pcspkr 
>>>> serio_raw
>>>> mptspi mptscsih mptbase scsi_transport_spi radeon ttm drm 
>>>> i2c_algo_bit i2c_core
>>>> [last unloaded: scsi_wait_scan]
>>>>
>>>> Pid: 17174, comm: cpuctl_latency_ Tainted: G        W  (2.6.31-rc7 
>>>> #1) IBM
>>>> eServer BladeCenter HS40 -[883961X]-                    EIP: 
>>>> 0060:[<c058aded>] EFLAGS: 00000283 CPU: 3
>>>> EIP is at find_next_bit+0x9/0x79
>>>> EAX: c2c437a0 EBX: f3d433c0 ECX: 00000000 EDX: 00000020
>>>> ESI: c2c436bc EDI: 00000000 EBP: f063be6c ESP: f063be64
>>>>  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
>>>> CR0: 80050033 CR2: 008765a4 CR3: 314d7000 CR4: 000006d0
>>>> DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
>>>> DR6: ffff0ff0 DR7: 00000400
>>>> Call Trace:
>>>>  [<c0427b6e>] cpumask_next+0x17/0x19
>>>>  [<c042c28d>] tg_shares_up+0x53/0x149
>>>>  [<c0424082>] ? tg_nop+0x0/0xc
>>>>  [<c0424082>] ? tg_nop+0x0/0xc
>>>>  [<c042406e>] walk_tg_tree+0x63/0x77
>>>>  [<c042c23a>] ? tg_shares_up+0x0/0x149
>>>>  [<c042e836>] update_shares+0x5d/0x65
>>>>  [<c0432af3>] rebalance_domains+0x114/0x460
>>>>  [<c0403393>] ? restore_all_notrace+0x0/0x18
>>>>  [<c0432e75>] run_rebalance_domains+0x36/0xa3
>>>>  [<c043c324>] __do_softirq+0xbc/0x173
>>>>  [<c043c416>] do_softirq+0x3b/0x5f
>>>>  [<c043c52d>] irq_exit+0x3a/0x68
>>>>  [<c0417846>] smp_apic_timer_interrupt+0x6d/0x7b
>>>>  [<c0403c9b>] apic_timer_interrupt+0x2f/0x34
>>>> BUG: soft lockup - CPU#2 stuck for 61s! [watchdog/2:11]
>>>> Modules linked in: bridge stp llc bnep sco l2cap bluetooth sunrpc ipv6
>>>> p4_clockmod dm_multipath uinput qla2xxx ata_generic pata_acpi 
>>>> usb_storage e1000
>>>> scsi_transport_fc joydev scsi_tgt i2c_piix4 pata_serverworks pcspkr 
>>>> serio_raw
>>>> mptspi mptscsih mptbase scsi_transport_spi radeon ttm drm 
>>>> i2c_algo_bit i2c_core
>>>> [last unloaded: scsi_wait_scan]
>>>>
>>>> Pid: 11, comm: watchdog/2 Tainted: G        W  (2.6.31-rc7 #1) IBM 
>>>> eServer
>>>> BladeCenter HS40 -[883961X]-                    EIP: 
>>>> 0060:[<c042c313>] EFLAGS: 00000246 CPU: 2
>>>> EIP is at tg_shares_up+0xd9/0x149
>>>> EAX: 00000000 EBX: f09b3c00 ECX: f0baac00 EDX: 00000100
>>>> ESI: 00000002 EDI: 00000400 EBP: f6cb7de0 ESP: f6cb7db8
>>>>  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
>>>> CR0: 8005003b CR2: 08070680 CR3: 009c8000 CR4: 000006d0
>>>> DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
>>>> DR6: ffff0ff0 DR7: 00000400
>>>> Call Trace:
>>>>  [<c0424082>] ? tg_nop+0x0/0xc
>>>>  [<c0424082>] ? tg_nop+0x0/0xc
>>>>  [<c042406e>] walk_tg_tree+0x63/0x77
>>>>  [<c042c23a>] ? tg_shares_up+0x0/0x149
>>>>  [<c042e836>] update_shares+0x5d/0x65
>>>>  [<c0432af3>] rebalance_domains+0x114/0x460
>>>>  [<c0432e75>] run_rebalance_domains+0x36/0xa3
>>>>  [<c043c324>] __do_softirq+0xbc/0x173
>>>>  [<c043c416>] do_softirq+0x3b/0x5f
>>>>  [<c043c52d>] irq_exit+0x3a/0x68
>>>>  [<c0417846>] smp_apic_timer_interrupt+0x6d/0x7b
>>>>  [<c0403c9b>] apic_timer_interrupt+0x2f/0x34
>>>>  [<c0430d37>] ? finish_task_switch+0x5d/0xc4
>>>>  [<c0744b11>] schedule+0x74c/0x7b2
>>>>  [<c0590e28>] ? trace_hardirqs_on_thunk+0xc/0x10
>>>>  [<c0403393>] ? restore_all_notrace+0x0/0x18
>>>>  [<c0471e19>] ? watchdog+0x0/0x79
>>>>  [<c0471e19>] ? watchdog+0x0/0x79
>>>>  [<c0471e63>] watchdog+0x4a/0x79
>>>>  [<c0449a53>] kthread+0x70/0x75
>>>>  [<c04499e3>] ? kthread+0x0/0x75
>>>>  [<c0403e93>] kernel_thread_helper+0x7/0x10
>>>> [root at hs40 ltp-full-20090731]# uname -a
>>>> Linux hs40.in.ibm.com 2.6.31-rc7 #1 SMP Thu Sep 3 10:14:41 IST 2009 
>>>> i686 i686
>>>> i386 GNU/Linux
>>>> [root at hs40 ltp-full-20090731]#
>>>>
>>>>       
>>
>> We have been unable to reproduce it on current -tip. Rishi, are you able
>> to reproduce it on -tip?
>>
>> thanks,
>>   
> I am not able to create container with lxc on -tip kernel with config 
> file attached. As soon as i am executing "lxc-execute ..." it hangs 
> and only way to recover is to hard reboot system.

enabled the nmi_watchdog on tip kernel and found this call trace on 
system while executing "lxc-execute ..." command ( creating container).

x335a.in.ibm.com login: BUG: NMI Watchdog detected LOCKUP on CPU1, ip 
c075f208, registers:
Modules linked in: nfs lockd nfs_acl auth_rpcgss bridge stp llc bnep sco 
l2cap bluetooth autofs4 sunrpc ipv6 p4_clockmod dm_multipath uinput 
ata_generic pata_acpi pata_serverworks i2c_piix4 floppy tg3 i2c_core 
pcspkr serio_raw mptspi mptscsih mptbase scsi_transport_spi [last 
unloaded: scsi_wait_scan]

Pid: 0, comm: swapper Not tainted (2.6.31-tip #2) eserver xSeries 335 
-[867641X]-
EIP: 0060:[<c075f208>] EFLAGS: 00000097 CPU: 1
EIP is at _spin_lock_irqsave+0x2b/0x39
EAX: 00002625 EBX: c3128e00 ECX: 00000000 EDX: 00000200
ESI: 00000086 EDI: 00000400 EBP: f70a9d58 ESP: f70a9d50
 DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
Process swapper (pid: 0, ti=f70a8000 task=f7086780 task.ti=f70a8000)
Stack:
 f6461a80 00000001 f70a9d94 c043335a 00000200 00000400 c31243d8 c3124350
<0> c31244c0 c3128e00 c09f3e00 00000086 00000001 00000800 f6461a80 c31243d8
<0> c042c0d4 f70a9dac c042c0c0 c0433226 c31243d8 000b71b0 00000000 f70a9dd0
Call Trace:
 [<c043335a>] ? tg_shares_up+0x134/0x1bb
 [<c042c0d4>] ? tg_nop+0x0/0xc
 [<c042c0c0>] ? walk_tg_tree+0x63/0x77
 [<c0433226>] ? tg_shares_up+0x0/0x1bb
 [<c04328f4>] ? update_shares+0x69/0x71
 [<c04335ec>] ? select_task_rq_fair+0x153/0x5dc
 [<c043b905>] ? try_to_wake_up+0x7f/0x28a
 [<c043bb20>] ? default_wake_function+0x10/0x12
 [<c042ce2d>] ? __wake_up_common+0x37/0x5e
 [<c0430d62>] ? complete+0x30/0x43
 [<c0453281>] ? wakeme_after_rcu+0x10/0x12
 [<c0482997>] ? __rcu_process_callbacks+0x141/0x201
 [<c0482a7c>] ? rcu_process_callbacks+0x25/0x44
 [<c0445167>] ? __do_softirq+0xbc/0x173
 [<c0445259>] ? do_softirq+0x3b/0x5f
 [<c0445371>] ? irq_exit+0x3a/0x6d
 [<c041dd82>] ? smp_apic_timer_interrupt+0x6d/0x7b
 [<c0408d7b>] ? apic_timer_interrupt+0x2f/0x34
 [<c0425854>] ? native_safe_halt+0xa/0xc
 [<c040e63b>] ? default_idle+0x4a/0x7c
 [<c0460b8d>] ? tick_nohz_restart_sched_tick+0x115/0x123
 [<c0407637>] ? cpu_idle+0x58/0x79
 [<c075a578>] ? start_secondary+0x19c/0x1a1
Code: 55 89 e5 56 53 0f 1f 44 00 00 89 c3 9c 58 8d 74 26 00 89 c6 fa 90 
8d 74 26 00 e8 96 2b d3 ff b8 00 01 00 00 f0 66 0f c1 03 38 e0 <74> 06 
f3 90 8a 03 eb f6 89 f0 5b 5e 5d c3 55 89 e5 53 0f 1f 44
---[ end trace 8a14be6828557ade ]---
Kernel panic - not syncing: Non maskable interrupt
Pid: 0, comm: swapper Tainted: G      D    2.6.31-tip #2
Call Trace:
 [<c075d4b7>] ? printk+0x14/0x1d
 [<c075d3fc>] panic+0x3e/0xe5
 [<c075fe2a>] die_nmi+0x86/0xd8
 [<c0760324>] nmi_watchdog_tick+0x107/0x16e
 [<c075f8bf>] do_nmi+0xa3/0x27f
 [<c075f485>] nmi_stack_correct+0x28/0x2d
 [<c075f208>] ? _spin_lock_irqsave+0x2b/0x39
 [<c043335a>] tg_shares_up+0x134/0x1bb
 [<c042c0d4>] ? tg_nop+0x0/0xc
 [<c042c0c0>] walk_tg_tree+0x63/0x77
 [<c0433226>] ? tg_shares_up+0x0/0x1bb
 [<c04328f4>] update_shares+0x69/0x71
 [<c04335ec>] select_task_rq_fair+0x153/0x5dc
 [<c043b905>] try_to_wake_up+0x7f/0x28a
 [<c043bb20>] default_wake_function+0x10/0x12
 [<c042ce2d>] __wake_up_common+0x37/0x5e
 [<c0430d62>] complete+0x30/0x43
 [<c0453281>] wakeme_after_rcu+0x10/0x12
 [<c0482997>] __rcu_process_callbacks+0x141/0x201
 [<c0482a7c>] rcu_process_callbacks+0x25/0x44
 [<c0445167>] __do_softirq+0xbc/0x173
 [<c0445259>] do_softirq+0x3b/0x5f
 [<c0445371>] irq_exit+0x3a/0x6d
 [<c041dd82>] smp_apic_timer_interrupt+0x6d/0x7b
 [<c0408d7b>] apic_timer_interrupt+0x2f/0x34
 [<c0425854>] ? native_safe_halt+0xa/0xc
 [<c040e63b>] default_idle+0x4a/0x7c
 [<c0460b8d>] ? tick_nohz_restart_sched_tick+0x115/0x123
 [<c0407637>] cpu_idle+0x58/0x79
 [<c075a578>] start_secondary+0x19c/0x1a1
Rebooting in 1 seconds..


> I am not sure about tip but i am able to create the problem pretty 
> easily on 2.6.31-rc7 with that config file.
>
> Only the changes i have done in the config file from (2.6.31-rc7) is :
>    - Disabled KVM as it was giving me error on -tip kernel.
>    - Applied following patch : 
> http://www.gossamer-threads.com/lists/linux/kernel/1129527
>
> Please let me know if you are able to recreate it on -tip with 
> following config.
> ------------------------------------------------------------------------
>
> _______________________________________________
> Containers mailing list
> Containers at lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/containers

_______________________________________________
Containers mailing list
Containers at lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers




More information about the Devel mailing list