Skip to content

Commit

Permalink
Unset CONFIG_THERMAL_STATISTICS to prevent kernel crash (#199)
Browse files Browse the repository at this point in the history
Fix sonic-net/sonic-buildimage#6866

Unset CONFIG_THERMAL_STATISTICS.
Reason:
Kernel thermal zones binding to the cooling device together with CONFIG_THERMAL_STATISTICS=y causes kernel crash as out of boundary:
trans_table is two-dimensional table allocated per max cooling state (10).
If statistics is configured, thermal_cooling_device_stats_update() will be called and will try to update out of boundary:
stats->trans_table[stats->state * stats->max_states + new_state]++

Kernel crash with the following stack trace:

```
[  269.474092] watchdog: watchdog1: watchdog did not stop!
[  269.533625] list_del corruption. prev->next should be ffff9e136bd57418, but was 677ac660ffffffff

[  269.543482] kernel BUG at lib/list_debug.c:53!
[  269.548458] invalid opcode: 0000 [#1] SMP PTI
[  269.553326] CPU: 1 PID: 8890 Comm: kexec Tainted: G           OE     4.19.0-9-2-amd64 #1 Debian 4.19.118-2+deb10u1
[  269.564891] Hardware name: Mellanox Technologies Ltd. MSN4700/VMOD0010, BIOS 5.11 11/03/2020
[  269.574323] RIP: 0010:__list_del_entry_valid.cold.1+0x34/0x4c
[  269.580740] Code: 9f 29 a5 e8 68 7a d0 ff 0f 0b 48 c7 c7 20 a0 29 a5 e8 5a 7a d0 ff 0f 0b 48 89 f2 48 89 fe 48 c7 c7 e0 9f 29 a5 e8 46 7a d0 ff <0f> 0b 48 89 fe 48 c7 c7 a8 9f 29 a5 e8 35 7a d0 ff 0f 0b 90 90 90
[  269.601726] RSP: 0018:ffffaddb83b5fdc0 EFLAGS: 00010246
[  269.607561] RAX: 0000000000000054 RBX: ffff9e136bd57418 RCX: 0000000000000000
[  269.615531] RDX: 0000000000000000 RSI: ffff9e136fa566b8 RDI: ffff9e136fa566b8
[  269.623500] RBP: ffff9e1364bd5070 R08: 00000000000005ce R09: 0000000000000004
[  269.631470] R10: 0000000000000766 R11: ffffffffa59f66ad R12: ffff9e136bd57400
[  269.639440] R13: ffffffffa52c6a12 R14: ffff9e1364bd30d0 R15: 0000000000000000
[  269.647410] FS:  00007f97227af740(0000) GS:ffff9e136fa40000(0000) knlGS:0000000000000000
[  269.656441] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  269.662857] CR2: 000055cfdb69e158 CR3: 00000004677f6001 CR4: 00000000003606e0
[  269.670820] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  269.678790] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  269.686760] Call Trace:
[  269.689489]  device_shutdown+0xc1/0x210
[  269.693773]  kernel_kexec+0x51/0x96
[  269.697666]  __do_sys_reboot+0x1be/0x210
[  269.702045]  ? kmem_cache_free+0x1aa/0x1d0
[  269.706618]  ? __dentry_kill+0x121/0x170
[  269.710998]  ? _cond_resched+0x15/0x30
[  269.715181]  ? dentry_kill+0x4d/0x190
[  269.719260]  ? _cond_resched+0x15/0x30
[  269.723444]  do_syscall_64+0x53/0x110
[  269.727531]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  269.733172] RIP: 0033:0x7f97228a3373
[  269.737161] Code: 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 89 fa be 69 19 12 28 bf ad de e1 fe b8 a9 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 05 c3 0f 1f 40 00 48 8b 15 e9 9a 0c 00 f7 d8
[  269.758147] RSP: 002b:00007ffe11d30fa8 EFLAGS: 00000202 ORIG_RAX: 00000000000000a9
[  269.766602] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f97228a3373
[  269.774572] RDX: 0000000045584543 RSI: 0000000028121969 RDI: 00000000fee1dead
[  269.782541] RBP: 0000000000000002 R08: 0000000000000004 R09: 000055cfdb69e160
[  269.790511] R10: fffffffffffffb8e R11: 0000000000000202 R12: 00007ffe11d31238
[  269.798482] R13: 0000000000000000 R14: 0000000000000000 R15: 00000000ffffffff
[  269.806443] Modules linked in: nft_chain_route_ipv4(E) xt_TCPMSS(E) sx_bfd(OE) sx_netdev(OE) psample(E) dummy(E) sx_core(OE) 8021q(E) garp(E) mrp(E) mst_pciconf(OE) mst_pci(OE) xt_hl(E) xt_tcpudp(E) ip6_tables(E) nft_compat(E) nft_counter(E) xt_conntrack(E) nf_nat(E) nf_conntrack_netlink(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) libcrc32c(E) xfrm_user(E) xfrm_algo(E) intel_rapl(E) mlxsw_minimal(E) sb_edac(E) mlxsw_i2c(E) x86_pkg_temp_thermal(E) mlxsw_core(E) intel_powerclamp(E) devlink(E) kvm_intel(E) bonding(E) kvm(E) i2c_mux_reg(E) i2c_mux(E) mlxreg_hotplug(E) mlxreg_io(E) leds_mlxreg(E) i2c_mlxcpld(E) mlxreg_fan(E) mxm_wmi(E) irqbypass(E) crct10dif_pclmul(E) crc32_pclmul(E) evdev(E) mlx_platform(E) ghash_clmulni_intel(E) intel_cstate(E) sg(E) intel_uncore(E) iTCO_wdt(E) pcspkr(E)
[  269.885239]  intel_rapl_perf(E) ioatdma(E) iTCO_vendor_support(E) pcc_cpufreq(E) wmi(E) ebt_vlan(E) ebtable_broute(E) bridge(E) stp(E) llc(E) ebtable_nat(E) nf_tables(E) button(E) nfnetlink(E) ebtable_filter(E) ebtables(E) xdpe12284(E) at24(E) ledtrig_timer(E) tmp102(E) lm75(E) coretemp(E) max1363(E) industrialio_triggered_buffer(E) kfifo_buf(E) industrialio(E) tps53679(E) pmbus(E) pmbus_core(E) i2c_dev(E) ip_tables(E) x_tables(E) autofs4(E) loop(E) ext4(E) crc16(E) mbcache(E) jbd2(E) crc32c_generic(E) fscrypto(E) ecb(E) sd_mod(E) nvme(E) nvme_core(E) nls_utf8(E) nls_cp437(E) nls_ascii(E) vfat(E) fat(E) overlay(E) squashfs(E) zstd_decompress(E) xxhash(E) crc32c_intel(E) gpio_ich(E) ahci(E) aesni_intel(E) libahci(E) aes_x86_64(E) crypto_simd(E) xhci_pci(E) ehci_pci(E) libata(E) igb(E) ehci_hcd(E)
[  269.964036]  xhci_hcd(E) cryptd(E) glue_helper(E) scsi_mod(E) i2c_algo_bit(E) i2c_i801(E) lpc_ich(E) dca(E) mfd_core(E) usbcore(E) usb_common(E)
[  269.978536] ---[ end trace 8f56c678b52f9aee ]---
[  269.983698] RIP: 0010:__list_del_entry_valid.cold.1+0x34/0x4c
[  269.990123] Code: 9f 29 a5 e8 68 7a d0 ff 0f 0b 48 c7 c7 20 a0 29 a5 e8 5a 7a d0 ff 0f 0b 48 89 f2 48 89 fe 48 c7 c7 e0 9f 29 a5 e8 46 7a d0 ff <0f> 0b 48 89 fe 48 c7 c7 a8 9f 29 a5 e8 35 7a d0 ff 0f 0b 90 90 90
[  270.011117] RSP: 0018:ffffaddb83b5fdc0 EFLAGS: 00010246
[  270.016958] RAX: 0000000000000054 RBX: ffff9e136bd57418 RCX: 0000000000000000
[  270.024935] RDX: 0000000000000000 RSI: ffff9e136fa566b8 RDI: ffff9e136fa566b8
[  270.032912] RBP: ffff9e1364bd5070 R08: 00000000000005ce R09: 0000000000000004
[  270.040890] R10: 0000000000000766 R11: ffffffffa59f66ad R12: ffff9e136bd57400
[  270.048866] R13: ffffffffa52c6a12 R14: ffff9e1364bd30d0 R15: 0000000000000000
[  270.056844] FS:  00007f97227af740(0000) GS:ffff9e136fa40000(0000) knlGS:0000000000000000
[  270.065889] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  270.072312] CR2: 000055cfdb69e158 CR3: 00000004677f6001 CR4: 00000000003606e0
[  270.080289] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  270.088268] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
```

A temporary solution is to disable this config and to work with the linux community on fixing it. 
The solution requires fan driver update which is not trivial and will take some time to have it available on next-net before can be backported to SONiC linux-kernel.

It was tested on:
HwSKU: ACS-MSN2410
HwSKU: Mellanox-SN2700
  • Loading branch information
allas-nvidia authored and daall committed Mar 10, 2021
1 parent 21b7aa0 commit 3e191df
Showing 1 changed file with 1 addition and 0 deletions.
1 change: 1 addition & 0 deletions patch/kconfig-exclusions
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ CONFIG_STRICT_DEVMEM
# Unset X86_PAT according to Broadcom's requirement
CONFIG_X86_PAT
CONFIG_MLXSW_PCI
CONFIG_THERMAL_STATISTICS

[arm64]

Expand Down

0 comments on commit 3e191df

Please sign in to comment.