Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the pddf_custom_wdt driver rarely reports kernel dump issue while reboot in belgite platform #12322

Merged
merged 4 commits into from
Nov 4, 2022
Merged

Fix the pddf_custom_wdt driver rarely reports kernel dump issue while reboot in belgite platform #12322

merged 4 commits into from
Nov 4, 2022

Conversation

jerseyang
Copy link
Contributor

…r dump during reboot

Why I did it

SONiC will report the kernel dump while system reboot in Belgite platform as the following shows:

2022-08-10 21:16:01,600 T0000: INFO SCMD: reboot
2022-08-10 21:18:01,849 T0000: INFO reboot in process .....
2022-08-10 21:18:01,849 T0000: INFO Waiting for the reboot operation to complete
2022-08-10 21:18:01,850 T0000: INFO [ 302.549829] kdump-tools[14989]: Stopping kdump-tools: unloaded kdump kernel.
2022-08-10 21:18:01,850 T0000: INFO [ 303.882742] watchdog: watchdog0: watchdog did not stop!
2022-08-10 21:18:01,850 T0000: INFO [ 304.781981] pddf_custom_wdt: CPLD Watchdog did not Stop!
2022-08-10 21:18:01,850 T0000: INFO [ 304.858387] BUG: unable to handle kernel paging request at 000000000002bf20
2022-08-10 21:18:01,850 T0000: INFO [ 304.865490] PGD 0 P4D 0
2022-08-10 21:18:01,850 T0000: INFO [ 304.868159] Oops: 0010 [#1] SMP NOPTI
2022-08-10 21:18:01,851 T0000: INFO [ 304.871954] CPU: 0 PID: 1 Comm: systemd-shutdow Tainted: GF OE 4.19.0-9-2-amd64 #1 Debian 4.19.118-2+deb10u1
2022-08-10 21:18:01,851 T0000: INFO [ 304.883044] Hardware name: Celestica Belgite/Belgite, BIOS COMe-Dnvt.2.02.00 11/02/2021
2022-08-10 21:18:01,851 T0000: INFO [ 304.891183] RIP: 0010:0x2bf20
2022-08-10 21:18:01,851 T0000: INFO [ 304.894286] Code: Bad RIP value.
2022-08-10 21:18:01,851 T0000: INFO [ 304.897647] RSP: 0018:ffffa65bc063fdc0 EFLAGS: 00010246
2022-08-10 21:18:01,851 T0000: INFO [ 304.903002] RAX: 000000000002bf20 RBX: ffff8c8df8cd0418 RCX: dead000000000200
2022-08-10 21:18:01,851 T0000: INFO [ 304.910268] RDX: 0000000000000001 RSI: ffff8c8df8cd0418 RDI: ffff8c8df8cd0400
2022-08-10 21:18:01,852 T0000: INFO [ 304.917534] RBP: 0000000000000000 R08: ffff8c8dfba21900 R09: ffff8c8dfb000000
2022-08-10 21:18:01,852 T0000: INFO [ 304.924799] R10: ffff8c8dfb000028 R11: 0000000000000000 R12: ffff8c8df8cd0400
2022-08-10 21:18:01,852 T0000: INFO [ 304.932066] R13: ffffffff838c660a R14: ffff8c8df8cd0460 R15: 0000000000000000
2022-08-10 21:18:01,852 T0000: INFO [ 304.939333] FS: 00007fdf112ea940(0000) GS:ffff8c8dfba00000(0000) knlGS:0000000000000000
2022-08-10 21:18:01,852 T0000: INFO [ 304.947556] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
2022-08-10 21:18:01,852 T0000: INFO [ 304.953432] CR2: 000000000002bef6 CR3: 0000000179908000 CR4: 00000000003406f0
2022-08-10 21:18:01,853 T0000: INFO [ 304.960698] Call Trace:
2022-08-10 21:18:01,853 T0000: INFO [ 304.963282] ? device_shutdown+0x13f/0x210
2022-08-10 21:18:01,853 T0000: INFO [ 304.967510] ? kernel_restart+0xe/0x30
2022-08-10 21:18:01,853 T0000: INFO [ 304.971388] ? __do_sys_reboot+0x1cf/0x210
2022-08-10 21:18:01,853 T0000: INFO [ 304.975618] ? vfs_writev+0xc5/0x100
2022-08-10 21:18:01,853 T0000: INFO [ 304.979328] ? do_writev+0x5f/0x100
2022-08-10 21:18:01,853 T0000: INFO [ 304.982949] ? do_syscall_64+0x53/0x110
2022-08-10 21:18:01,854 T0000: INFO [ 304.986919] ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
2022-08-10 21:18:01,854 T0000: INFO [ 304.992275] Modules linked in: tcp_diag(E) udp_diag(E) raw_diag(E) inet_diag(E) unix_diag(E) af_packet_diag(E) netlink_diag(E) veth(E) dummy(E) nft_chain_nat_ipv6(E) nf_nat_ipv6(E) act_police(E) cls_u32(E) ip6t_REJECT(E) nf_reject_ipv6(E) ipt_REJECT(E) nf_reject_ipv4(E) xt_tcpudp(E) sch_ingress(E) lm75(E) ebt_vlan(E) pddf_custom_wdt(FOE) pddf_custom_psu(FOE) nft_compat(E) pddf_fan_module(OE) pddf_fan_driver_module(OE) pddf_led_module(OE) nft_counter(E) pddf_xcvr_driver_module(OE) pddf_xcvr_module(OE) pddf_gpio_module(OE) pddf_psu_module(OE) pddf_psu_driver_module(OE) pddf_fpgai2c_driver(OE) pddf_fpgai2c_module(OE) pddf_cpld_driver(OE) pddf_cpld_module(OE) pddf_mux_module(OE) pddf_client_module(OE) optoe(E) mc24lc64t(OE) gpio_pca953x(E) i2c_mux_pca954x(E) i2c_mux(E) i2c_dev(E) i2c_i801(E) i2c_ismt(E)
2022-08-10 21:18:01,854 T0000: INFO [ 305.063221] bridge(E) stp(E) llc(E) nf_conntrack_netlink(E) xfrm_user(E) xfrm_algo(E) intel_rapl(E) pnd2_edac(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) kvm_intel(E) bonding(E) kvm(E) ramdisk(OE) evdev(E) irqbypass(E) qat_c3xxx(E) crct10dif_pclmul(E) tpm_crb(E) crc32_pclmul(E) ghash_clmulni_intel(E) intel_cstate(E) intel_rapl_perf(E) intel_qat(E) wdat_wdt(E) efi_pstore(E) pcspkr(E) sg(E) efivars(E) authenc(E) tpm_tis(E) tpm_tis_core(E) tpm(E) rng_core(E) button(E) pcc_cpufreq(E) acpi_cpufreq(E) nft_chain_nat_ipv4(E) nf_nat_ipv4(E) nf_nat(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) libcrc32c(E) nf_tables(E) nfnetlink(E) efivarfs(E) ip_tables(E) x_tables(E) autofs4(E) loop(E) ext4(E) crc16(E) mbcache(E) jbd2(E) crc32c_generic(E) fscrypto(E) ecb(E) nvme(E) nvme_core(E) nls_utf8(E)
2022-08-10 21:18:01,854 T0000: INFO [ 305.134697] nls_cp437(E) nls_ascii(E) vfat(E) fat(E) overlay(E) squashfs(E) zstd_decompress(E) xxhash(E) sd_mod(E) uas(E) usb_storage(E) crc32c_intel(E) aesni_intel(E) aes_x86_64(E) crypto_simd(E) ixgbe(E) cryptd(E) glue_helper(E) mdio(E) ahci(E) igb(E) libahci(E) xhci_pci(E) i2c_algo_bit(E) dca(E) xhci_hcd(E) libata(E) usbcore(E) usb_common(E) scsi_mod(E) thermal(E) [last unloaded: linux_kernel_bde]
2022-08-10 21:18:01,854 T0000: INFO [ 305.170379] rb: panic handler for ramdisk sync
2022-08-10 21:18:01,855 T0000: INFO [ 305.174954] CR2: 000000000002bf20
2022-08-10 21:18:01,855 T0000: INFO [ 305.174992] rb: not syncing ramdisk, timeout: 1489
2022-08-10 21:18:01,855 T0000: INFO [ 305.178407] ---[ end trace a0c62d0743dfbdc7 ]---
2022-08-10 21:18:01,855 T0000: INFO [ 305.184980] RIP: 0010:0x2bf20
2022-08-10 21:18:01,855 T0000: INFO [ 305.191391] Code: Bad RIP value.
2022-08-10 21:18:01,855 T0000: INFO [ 305.194752] RSP: 0018:ffffa65bc063fdc0 EFLAGS: 00010246
2022-08-10 21:18:01,855 T0000: INFO [ 305.200108] RAX: 000000000002bf20 RBX: ffff8c8df8cd0418 RCX: dead000000000200
2022-08-10 21:18:01,856 T0000: INFO [ 305.207374] RDX: 0000000000000001 RSI: ffff8c8df8cd0418 RDI: ffff8c8df8cd0400
2022-08-10 21:18:01,856 T0000: INFO [ 305.214639] RBP: 0000000000000000 R08: ffff8c8dfba21900 R09: ffff8c8dfb000000
2022-08-10 21:18:01,856 T0000: INFO [ 305.221905] R10: ffff8c8dfb000028 R11: 0000000000000000 R12: ffff8c8df8cd0400
2022-08-10 21:18:01,856 T0000: INFO [ 305.229171] R13: ffffffff838c660a R14: ffff8c8df8cd0460 R15: 0000000000000000
2022-08-10 21:18:01,856 T0000: INFO [ 305.236438] FS: 00007fdf112ea940(0000) GS:ffff8c8dfba00000(0000) knlGS:0000000000000000
2022-08-10 21:18:01,856 T0000: INFO [ 305.244658] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
2022-08-10 21:18:01,856 T0000: INFO [ 305.250534] CR2: 000000000002bef6 CR3: 0000000179908000 CR4: 00000000003406f0
2022-08-10 21:18:01,857 T0000: INFO [ 305.257801] Kernel panic - not syncing: Fatal exception
2022-08-10 21:18:01,857 T0000: INFO [ 305.263208] Kernel Offset: 0x1a00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
2022-08-10 21:18:01,857 T0000: INFO [ 305.274033] rb: panic handler for ramdisk sync
2022-08-10 21:18:01,857 T0000: INFO [ 305.285097] Rebooting in 10 seconds..
2022-08-10 21:18:01,857 T0000: INFO [ 315.210203] ACPI MEMORY or I/O RESET_REG.
2022-08-10 21:18:01,857 T0000: INFO
2022-08-10 21:18:01,858 T0000: INFO
2022-08-10 21:18:01,858 T0000: INFO
2022-08-10 21:18:01,858 T0000: INFO Primary BIOS boot in progress...
2022-08-10 21:18:01,858 T0000: INFO
2022-08-10 21:18:01,858 T0000: INFO CPLD_C version: 0.6
2022-08-10 21:18:01,858 T0000: INFO CPLD_B version: 2.4

How I did it

Cause:

  • Invalid cdev container pointer from the inode is being accessing in misc
    device open, which causes a memory corruption in the slub.
  • Because of the slub corruption, random crash is seen during reboot.

Fix: - Instead of cdev pointer from the inode, mdev container pointer is
used from the file->privdate_data member.

Action: update the pddf_custom_wdt driver,

How to verify it

Do the reboot stress test to check whether there is kernel dump during reboot progress

Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • 202012
  • 202106
  • 202111
  • [x ] 202205

Description for the changelog

Ensure to add label/tag for the feature raised. example - PR#2174 under sonic-utilities repo. where, Generic Config and Update feature has been labelled as GCU.

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

@jerseyang jerseyang requested a review from lguohan as a code owner October 9, 2022 03:10
@jerseyang
Copy link
Contributor Author

supplement: watchdog test:

root@sonic:/home/admin# watchdogutil arm -s 30
Watchdog armed for 30 seconds
root@sonic:/home/admin# watchdogutil status
Status: Armed
Time remaining: 12 seconds
root@sonic:/home/admin# watchdogutil status
Status: Armed
Time remaining: 5 seconds
root@sonic:/home/admin#

Primary BIOS boot in progress...

CPLD_C version: 0.6
CPLD_B version: 2.5

Version 2.19.1266. Copyright (C) 2022 American Megatrends, Inc.
HBIOS Date: 02/10/2022 15:25:46 Ver: COMe-Dnvt.3.00.00_B

error: terminal `serial' isn't found.

error: terminal `serial' isn't found.

GNU GRUB version 2.02

 Use the ^ and v keys to select which entry is highlighted.          

  Press enter to boot the selected OS, `e' to edit the commands       

  before booting or `c' for a command-line.                          
  SONiC-OS-Belgite_Fix_WDT_Driver_Dump.0-2654abf4a                           
  ONIE                                                               
                                                                                                                                                                                                                                                                                                                                                                    �[15;78H�[16;3H                                                                            �[16;78H�[17;3H                                                                            �[17;78H�[17;80H �[5;78H�[23;1H   The highlighted entry will be executed automatically in 5s.                 �[5;78H�[23;1H   The highlighted entry will be executed automatically in 4s.                 �[5;78H�[23;1H   The highlighted entry will be executed automatically in 3s.                 �[5;78H�[23;1H   The highlighted entry will be executed automatically in 2s.                 �[5;78H�[23;1H   The highlighted entry will be executed automatically in 1s.                 �[5;78H�[23;1H   The highlighted entry will be executed automatically in 0s.                 �[5;78H�[0;30;40m�[2J�[1;1H�[0;37;40m�[0;30;40m�[2J�[1;1H�[0;37;40m  Booting `SONiC-OS-Belgite_Fix_WDT_Driver_Dump.0-2654abf4a'

Loading SONiC-OS OS kernel ...

Loading SONiC-OS OS initial ramdisk ...

tune2fs 1.46.2 (28-Feb-2021)
Setting reserved blocks percentage to 0% (0 blocks)
Setting reserved blocks count to 0
[ 4.690616] rc.local[447]: + grep build_version
[ 4.751777] rc.local[446]: + cat /etc/sonic/sonic_version.yml
[ 4.832195] rc.local[448]: + sed -e s/build_version: //g;s/'//g
[ 4.935234] rc.local[442]: + SONIC_VERSION=Belgite_Fix_WDT_Driver_Dump.0-2654abf4a
[ 5.047236] rc.local[442]: + FIRST_BOOT_FILE=/host/image-Belgite_Fix_WDT_Driver_Dump.0-2654abf4a/platform/firsttime
[ 5.175610] rc.local[442]: + SONIC_CONFIG_DIR=/host/image-Belgite_Fix_WDT_Driver_Dump.0-2654abf4a/sonic-config
[ 5.303639] rc.local[442]: + SONIC_ENV_FILE=/host/image-Belgite_Fix_WDT_Driver_Dump.0-2654abf4a/sonic-config/sonic-environment
[ 5.447604] rc.local[442]: + [ -d /host/image-Belgite_Fix_WDT_Driver_Dump.0-2654abf4a/sonic-config -a -f /host/image-Belgite_Fix_WDT_Driver_Dump.0-2654abf4a/sonic-config/sonic-environment ]
[ 5.659616] rc.local[442]: + logger SONiC version Belgite_Fix_WDT_Driver_Dump.0-2654abf4a starting up...
[ 5.775639] rc.local[442]: + grub_installation_needed=
[ 5.847633] rc.local[442]: + [ ! -e /host/machine.conf ]
[ 5.922284] rc.local[442]: + . /host/machine.conf
[ 5.983831] rc.local[442]: + onie_arch=x86_64
[ 6.043641] rc.local[442]: + onie_bin=
[ 6.099584] rc.local[442]: + onie_boot_reason=install
[ 6.167604] rc.local[442]: + onie_build_date=2022-02-11T12:38+08:00
[ 6.255494] rc.local[442]: + onie_build_machine=cel_belgite
[ 6.331979] rc.local[442]: + onie_build_platform=x86_64-cel_belgite-r0
[ 6.431587] rc.local[442]: + onie_cli_static_parms=
[ 6.503485] rc.local[442]: + onie_cli_static_url=sonic-broadcom-PR-1019.bin
[ 6.591486] rc.local[442]: + onie_config_version=1
[ 6.651490] rc.local[442]: + onie_dev=/dev/sda2
[ 6.711502] rc.local[442]: + onie_exec_url=sonic-broadcom-PR-1019.bin
[ 6.799486] rc.local[442]: + onie_firmware=auto
[ 6.859480] rc.local[442]: + onie_grub_image_name=grubx64.efi
[ 6.935485] rc.local[442]: + onie_initrd_tmp=/
[ 6.995638] rc.local[442]: + onie_installer=/var/tmp/installer
[ 7.071493] rc.local[442]: + onie_kernel_version=4.9.95
[ 7.143492] rc.local[442]: + onie_machine=cel_belgite
[ 7.215491] rc.local[442]: + onie_machine_rev=0
[ 7.275499] rc.local[442]: + onie_partition_type=gpt
[ 7.347489] rc.local[442]: + onie_platform=x86_64-cel_belgite-r0
[ 7.423493] rc.local[442]: + onie_root_dir=/mnt/onie-boot/onie
[ 7.499488] rc.local[442]: + onie_skip_ethmgmt_macs=no
[ 7.571488] rc.local[442]: + onie_switch_asic=bcm
[ 7.631492] rc.local[442]: + onie_uefi_arch=x64
[ 7.691489] rc.local[442]: + onie_uefi_boot_loader=grubx64.efi
[ 7.767491] rc.local[442]: + onie_vendor_id=12244
[ 7.827498] rc.local[442]: + onie_version=2019.02.01.3.0.0
[ 7.899488] rc.local[442]: + program_console_speed
[ 7.961465] rc.local[464]: + cat /proc/cmdline
[ 8.021197] rc.local[465]: + grep -Eo console=ttyS[0-9]+,[0-9]+
[ 8.100726] rc.local[475]: + cut -d , -f2
[ 8.160151] rc.local[442]: + speed=9600
[ 8.219494] rc.local[442]: + [ -z 9600 ]
[ 8.279495] rc.local[442]: + CONSOLE_SPEED=9600
[ 8.340492] rc.local[477]: + grep agetty /lib/systemd/system/serial-getty@.service
[ 8.445099] rc.local[478]: + grep keep-baud
[ 8.510415] rc.local[442]: + [ 1 = 0 ]
[ 8.566806] rc.local[442]: + sed -i s|u' .* %I|u' 9600 %I|g /lib/systemd/system/serial-getty@.service
[ 8.679595] rc.local[442]: + systemctl daemon-reload
[ 8.751621] rc.local[442]: + [ -f /host/image-Belgite_Fix_WDT_Driver_Dump.0-2654abf4a/platform/firsttime ]
[ 8.875643] rc.local[442]: + [ -f /var/log/fsck.log.gz ]
[ 8.954603] rc.local[539]: + gunzip -d -c /var/log/fsck.log.gz
[ 9.041189] rc.local[540]: + logger -t FSCK
[ 9.104658] rc.local[442]: + rm -f /var/log/fsck.log.gz
[ 9.183874] rc.local[442]: + exit 0
[ 9.616071] c3xxx 0000:01:00.0: Failed to send admin msg 0 to accelerator 2
[ 9.701169] c3xxx 0000:01:00.0: Failed to send init message

Debian GNU/Linux 11 sonic ttyS0

sonic login: admin

Password:
Linux sonic 5.10.0-12-2-amd64 #1 SMP Debian 5.10.103-1 (2022-03-07) x86_64
You are on


/ | / _ | \ | ()/ |
_
| | | | | | | |
) | || | |\ | | |

|
/ ___/|| _||____|

-- Software for Open Networking in the Cloud --

Unauthorized access and/or use are prohibited.
All access and/or use are subject to monitoring.

Help: https://sonic-net.github.io/SONiC/

Last login: Fri Oct 21 02:25:45 UTC 2022 on ttyS0
admin@sonic:~$ sudo -s
root@sonic:/home/admin#
root@sonic:/home/admin# watchdogutil arm -s 30
Watchdog armed for 30 seconds
root@sonic:/home/admin# watchdogutil status
Status: Armed
Time remaining: 20 seconds
root@sonic:/home/admin#
root@sonic:/home/admin#
root@sonic:/home/admin# watchdogutil disarm
Watchdog disarmed successfully

@jerseyang
Copy link
Contributor Author

@lguohan Hi Guohan, can you assign a reviewer for this PR? Thanks!

@jerseyang
Copy link
Contributor Author

@lguohan Hi guohan, any update about this? Thanks!

@jerseyang
Copy link
Contributor Author

@lguohan Hi Guohan, Can you assign a reviewer for this PR? Thanks.

@jerseyang
Copy link
Contributor Author

@Blueve and also, Can you help to have a look this? Thanks!

@jerseyang
Copy link
Contributor Author

@lguohan Hi Guohan, could you assign a reviewer for this PR? waiting for your reply, Thanks!

@Blueve Blueve merged commit 7fb8bf7 into sonic-net:master Nov 4, 2022
@jerseyang jerseyang deleted the Belgite_Fix_WDT_Driver_Dump branch November 28, 2022 05:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants