# git rev-parse -q --verify 801980f6497946048709b9b09771a1729551d705^{commit} 801980f6497946048709b9b09771a1729551d705 already have revision, skipping fetch # git checkout -q -f -B kisskb 801980f6497946048709b9b09771a1729551d705 # git clean -qxdf # < git log -1 # commit 801980f6497946048709b9b09771a1729551d705 # Author: Michael Roth # Date: Tue Aug 11 11:15:44 2020 -0500 # # powerpc/pseries/hotplug-cpu: wait indefinitely for vCPU death # # For a power9 KVM guest with XIVE enabled, running a test loop # where we hotplug 384 vcpus and then unplug them, the following traces # can be seen (generally within a few loops) either from the unplugged # vcpu: # # cpu 65 (hwid 65) Ready to die... # Querying DEAD? cpu 66 (66) shows 2 # list_del corruption. next->prev should be c00a000002470208, but was c00a000002470048 # ------------[ cut here ]------------ # kernel BUG at lib/list_debug.c:56! # Oops: Exception in kernel mode, sig: 5 [#1] # LE SMP NR_CPUS=2048 NUMA pSeries # Modules linked in: fuse nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 ... # CPU: 66 PID: 0 Comm: swapper/66 Kdump: loaded Not tainted 4.18.0-221.el8.ppc64le #1 # NIP: c0000000007ab50c LR: c0000000007ab508 CTR: 00000000000003ac # REGS: c0000009e5a17840 TRAP: 0700 Not tainted (4.18.0-221.el8.ppc64le) # MSR: 800000000282b033 CR: 28000842 XER: 20040000 # ... # NIP __list_del_entry_valid+0xac/0x100 # LR __list_del_entry_valid+0xa8/0x100 # Call Trace: # __list_del_entry_valid+0xa8/0x100 (unreliable) # free_pcppages_bulk+0x1f8/0x940 # free_unref_page+0xd0/0x100 # xive_spapr_cleanup_queue+0x148/0x1b0 # xive_teardown_cpu+0x1bc/0x240 # pseries_mach_cpu_die+0x78/0x2f0 # cpu_die+0x48/0x70 # arch_cpu_idle_dead+0x20/0x40 # do_idle+0x2f4/0x4c0 # cpu_startup_entry+0x38/0x40 # start_secondary+0x7bc/0x8f0 # start_secondary_prolog+0x10/0x14 # # or on the worker thread handling the unplug: # # pseries-hotplug-cpu: Attempting to remove CPU , drc index: 1000013a # Querying DEAD? cpu 314 (314) shows 2 # BUG: Bad page state in process kworker/u768:3 pfn:95de1 # cpu 314 (hwid 314) Ready to die... # page:c00a000002577840 refcount:0 mapcount:-128 mapping:0000000000000000 index:0x0 # flags: 0x5ffffc00000000() # raw: 005ffffc00000000 5deadbeef0000100 5deadbeef0000200 0000000000000000 # raw: 0000000000000000 0000000000000000 00000000ffffff7f 0000000000000000 # page dumped because: nonzero mapcount # Modules linked in: kvm xt_CHECKSUM ipt_MASQUERADE xt_conntrack ... # CPU: 0 PID: 548 Comm: kworker/u768:3 Kdump: loaded Not tainted 4.18.0-224.el8.bz1856588.ppc64le #1 # Workqueue: pseries hotplug workque pseries_hp_work_fn # Call Trace: # dump_stack+0xb0/0xf4 (unreliable) # bad_page+0x12c/0x1b0 # free_pcppages_bulk+0x5bc/0x940 # page_alloc_cpu_dead+0x118/0x120 # cpuhp_invoke_callback.constprop.5+0xb8/0x760 # _cpu_down+0x188/0x340 # cpu_down+0x5c/0xa0 # cpu_subsys_offline+0x24/0x40 # device_offline+0xf0/0x130 # dlpar_offline_cpu+0x1c4/0x2a0 # dlpar_cpu_remove+0xb8/0x190 # dlpar_cpu_remove_by_index+0x12c/0x150 # dlpar_cpu+0x94/0x800 # pseries_hp_work_fn+0x128/0x1e0 # process_one_work+0x304/0x5d0 # worker_thread+0xcc/0x7a0 # kthread+0x1ac/0x1c0 # ret_from_kernel_thread+0x5c/0x80 # # The latter trace is due to the following sequence: # # page_alloc_cpu_dead # drain_pages # drain_pages_zone # free_pcppages_bulk # # where drain_pages() in this case is called under the assumption that # the unplugged cpu is no longer executing. To ensure that is the case, # and early call is made to __cpu_die()->pseries_cpu_die(), which runs a # loop that waits for the cpu to reach a halted state by polling its # status via query-cpu-stopped-state RTAS calls. It only polls for 25 # iterations before giving up, however, and in the trace above this # results in the following being printed only .1 seconds after the # hotplug worker thread begins processing the unplug request: # # pseries-hotplug-cpu: Attempting to remove CPU , drc index: 1000013a # Querying DEAD? cpu 314 (314) shows 2 # # At that point the worker thread assumes the unplugged CPU is in some # unknown/dead state and procedes with the cleanup, causing the race # with the XIVE cleanup code executed by the unplugged CPU. # # Fix this by waiting indefinitely, but also making an effort to avoid # spurious lockup messages by allowing for rescheduling after polling # the CPU status and printing a warning if we wait for longer than 120s. # # Fixes: eac1e731b59ee ("powerpc/xive: guest exploitation of the XIVE interrupt controller") # Suggested-by: Michael Ellerman # Signed-off-by: Michael Roth # Tested-by: Greg Kurz # Reviewed-by: Thiago Jung Bauermann # Reviewed-by: Greg Kurz # [mpe: Trim oopses in change log slightly for readability] # Signed-off-by: Michael Ellerman # Link: https://lore.kernel.org/r/20200811161544.10513-1-mdroth@linux.vnet.ibm.com # < /opt/cross/kisskb/korg/gcc-4.9.4-nolibc/powerpc64-linux/bin/powerpc64-linux-gcc --version # < /opt/cross/kisskb/korg/gcc-4.9.4-nolibc/powerpc64-linux/bin/powerpc64-linux-ld --version # < git log --format=%s --max-count=1 801980f6497946048709b9b09771a1729551d705 # < make -s -j 48 ARCH=powerpc O=/kisskb/build/powerpc-fixes_40x_walnut_defconfig_powerpc-gcc4.9 CROSS_COMPILE=/opt/cross/kisskb/korg/gcc-4.9.4-nolibc/powerpc64-linux/bin/powerpc64-linux- 40x/walnut_defconfig # < make -s -j 48 ARCH=powerpc O=/kisskb/build/powerpc-fixes_40x_walnut_defconfig_powerpc-gcc4.9 CROSS_COMPILE=/opt/cross/kisskb/korg/gcc-4.9.4-nolibc/powerpc64-linux/bin/powerpc64-linux- help # make -s -j 48 ARCH=powerpc O=/kisskb/build/powerpc-fixes_40x_walnut_defconfig_powerpc-gcc4.9 CROSS_COMPILE=/opt/cross/kisskb/korg/gcc-4.9.4-nolibc/powerpc64-linux/bin/powerpc64-linux- olddefconfig # make -s -j 48 ARCH=powerpc O=/kisskb/build/powerpc-fixes_40x_walnut_defconfig_powerpc-gcc4.9 CROSS_COMPILE=/opt/cross/kisskb/korg/gcc-4.9.4-nolibc/powerpc64-linux/bin/powerpc64-linux- /kisskb/src/block/genhd.c: In function 'diskstats_show': /kisskb/src/block/genhd.c:1667:1: warning: the frame size of 1160 bytes is larger than 1024 bytes [-Wframe-larger-than=] } ^ Completed OK # rm -rf /kisskb/build/powerpc-fixes_40x_walnut_defconfig_powerpc-gcc4.9 # Build took: 0:00:38.455061