from calling __do_IRQ with a NULL pt_regs: Unable to handle kernel NULL pointer dereference (address 0000000000000040) Call Trace: [] show_stack+0x50/0xa0 sp=e0000009c922fa00 bsp=e0000009c92214d0 [] show_regs+0x820/0x860 sp=e0000009c922fbd0 bsp=e0000009c9221478 [] die+0x1a0/0x2e0 sp=e0000009c922fbd0 bsp=e0000009c9221438 [] ia64_do_page_fault+0x950/0xa80 sp=e0000009c922fbd0 bsp=e0000009c92213d8 [] ia64_native_leave_kernel+0x0/0x270 sp=e0000009c922fc60 bsp=e0000009c92213d8 [] profile_tick+0xd0/0x1c0 sp=e0000009c922fe30 bsp=e0000009c9221398 [] timer_interrupt+0x170/0x3e0 sp=e0000009c922fe30 bsp=e0000009c9221330 [] handle_IRQ_event+0x80/0x120 sp=e0000009c922fe30 bsp=e0000009c92212f8 [] __do_IRQ+0x160/0x4a0 sp=e0000009c922fe30 bsp=e0000009c9221290 [] ia64_process_pending_intr+0x2b0/0x360 sp=e0000009c922fe30 bsp=e0000009c9221208 [] fixup_irqs+0xf0/0x2a0 sp=e0000009c922fe30 bsp=e0000009c92211a8 [] __cpu_disable+0x140/0x240 sp=e0000009c922fe30 bsp=e0000009c9221168 [] take_cpu_down+0x50/0xa0 sp=e0000009c922fe30 bsp=e0000009c9221148 [] stop_cpu+0xd0/0x200 sp=e0000009c922fe30 bsp=e0000009c92210f0 [] kthread+0xc0/0x140 sp=e0000009c922fe30 bsp=e0000009c92210c8 [] kernel_thread_helper+0xd0/0x100 sp=e0000009c922fe30 bsp=e0000009c92210a0 [] start_kernel_thread+0x20/0x40 sp=e0000009c922fe30 bsp=e0000009c92210a0 I don't like this revert because it is fragile. ia64 is getting lucky because we seem to only ever process timer interrupts in this path, but if we ever race with an IPI here, we definitely use RCU and have the potential of hitting an oops that Paul describes above. Patching ia64's timer_interrupt() to check for NULL pt_regs is insufficient though, as we still hit the above oops. As a short term solution, I do think that this revert is the right answer. The revert hold up under repeated testing (24+ hour test runs) with this setup: - 8-way rx6600 - randomly toggling CPU online/offline state every 2 seconds - running CPU exercisers, memory hog, disk exercisers, and network stressors - average system load around ~160 In the long term, we really need to figure out why we set pt_regs = NULL in ia64_process_pending_intr(). If it turns out that it is unnecessary to do so, then we could safely re-introduce e7b14036 (along with some other logic to be smarter about migrating interrupts). One final note: x86 also removes the disabled CPU from cpu_online_map and then re-enables interrupts for 1ms, presumably to handle any pending interrupts: arch/x86/kernel/irq_32.c (and irq_64.c): cpu_disable_common: [remove cpu from cpu_online_map] fixup_irqs(): for_each_irq: [break CPU affinities] local_irq_enable(); mdelay(1); local_irq_disable(); So they are doing implicitly what ia64 is doing explicitly. Signed-off-by: Alex Chiang Signed-off-by: Tony Luck a†Þ¯”-x