-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix the pddf_custom_wdt driver rarely reports kernel dump issue while reboot in belgite platform #12322
Fix the pddf_custom_wdt driver rarely reports kernel dump issue while reboot in belgite platform #12322
Conversation
…r dump during reboot
supplement: watchdog test: root@sonic:/home/admin# watchdogutil arm -s 30 Primary BIOS boot in progress... CPLD_C version: 0.6 Version 2.19.1266. Copyright (C) 2022 American Megatrends, Inc. error: terminal `serial' isn't found. error: terminal `serial' isn't found. GNU GRUB version 2.02
Loading SONiC-OS OS kernel ... Loading SONiC-OS OS initial ramdisk ... tune2fs 1.46.2 (28-Feb-2021) Debian GNU/Linux 11 sonic ttyS0 sonic login: admin Password: / | / _ | \ | ()/ | -- Software for Open Networking in the Cloud -- Unauthorized access and/or use are prohibited. Help: https://sonic-net.github.io/SONiC/ Last login: Fri Oct 21 02:25:45 UTC 2022 on ttyS0 |
@lguohan Hi Guohan, can you assign a reviewer for this PR? Thanks! |
@lguohan Hi guohan, any update about this? Thanks! |
@lguohan Hi Guohan, Can you assign a reviewer for this PR? Thanks. |
@Blueve and also, Can you help to have a look this? Thanks! |
@lguohan Hi Guohan, could you assign a reviewer for this PR? waiting for your reply, Thanks! |
…r dump during reboot
Why I did it
SONiC will report the kernel dump while system reboot in Belgite platform as the following shows:
2022-08-10 21:16:01,600 T0000: INFO SCMD: reboot
2022-08-10 21:18:01,849 T0000: INFO reboot in process .....
2022-08-10 21:18:01,849 T0000: INFO Waiting for the reboot operation to complete
2022-08-10 21:18:01,850 T0000: INFO [ 302.549829] kdump-tools[14989]: Stopping kdump-tools: unloaded kdump kernel.
2022-08-10 21:18:01,850 T0000: INFO [ 303.882742] watchdog: watchdog0: watchdog did not stop!
2022-08-10 21:18:01,850 T0000: INFO [ 304.781981] pddf_custom_wdt: CPLD Watchdog did not Stop!
2022-08-10 21:18:01,850 T0000: INFO [ 304.858387] BUG: unable to handle kernel paging request at 000000000002bf20
2022-08-10 21:18:01,850 T0000: INFO [ 304.865490] PGD 0 P4D 0
2022-08-10 21:18:01,850 T0000: INFO [ 304.868159] Oops: 0010 [#1] SMP NOPTI
2022-08-10 21:18:01,851 T0000: INFO [ 304.871954] CPU: 0 PID: 1 Comm: systemd-shutdow Tainted: GF OE 4.19.0-9-2-amd64 #1 Debian 4.19.118-2+deb10u1
2022-08-10 21:18:01,851 T0000: INFO [ 304.883044] Hardware name: Celestica Belgite/Belgite, BIOS COMe-Dnvt.2.02.00 11/02/2021
2022-08-10 21:18:01,851 T0000: INFO [ 304.891183] RIP: 0010:0x2bf20
2022-08-10 21:18:01,851 T0000: INFO [ 304.894286] Code: Bad RIP value.
2022-08-10 21:18:01,851 T0000: INFO [ 304.897647] RSP: 0018:ffffa65bc063fdc0 EFLAGS: 00010246
2022-08-10 21:18:01,851 T0000: INFO [ 304.903002] RAX: 000000000002bf20 RBX: ffff8c8df8cd0418 RCX: dead000000000200
2022-08-10 21:18:01,851 T0000: INFO [ 304.910268] RDX: 0000000000000001 RSI: ffff8c8df8cd0418 RDI: ffff8c8df8cd0400
2022-08-10 21:18:01,852 T0000: INFO [ 304.917534] RBP: 0000000000000000 R08: ffff8c8dfba21900 R09: ffff8c8dfb000000
2022-08-10 21:18:01,852 T0000: INFO [ 304.924799] R10: ffff8c8dfb000028 R11: 0000000000000000 R12: ffff8c8df8cd0400
2022-08-10 21:18:01,852 T0000: INFO [ 304.932066] R13: ffffffff838c660a R14: ffff8c8df8cd0460 R15: 0000000000000000
2022-08-10 21:18:01,852 T0000: INFO [ 304.939333] FS: 00007fdf112ea940(0000) GS:ffff8c8dfba00000(0000) knlGS:0000000000000000
2022-08-10 21:18:01,852 T0000: INFO [ 304.947556] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
2022-08-10 21:18:01,852 T0000: INFO [ 304.953432] CR2: 000000000002bef6 CR3: 0000000179908000 CR4: 00000000003406f0
2022-08-10 21:18:01,853 T0000: INFO [ 304.960698] Call Trace:
2022-08-10 21:18:01,853 T0000: INFO [ 304.963282] ? device_shutdown+0x13f/0x210
2022-08-10 21:18:01,853 T0000: INFO [ 304.967510] ? kernel_restart+0xe/0x30
2022-08-10 21:18:01,853 T0000: INFO [ 304.971388] ? __do_sys_reboot+0x1cf/0x210
2022-08-10 21:18:01,853 T0000: INFO [ 304.975618] ? vfs_writev+0xc5/0x100
2022-08-10 21:18:01,853 T0000: INFO [ 304.979328] ? do_writev+0x5f/0x100
2022-08-10 21:18:01,853 T0000: INFO [ 304.982949] ? do_syscall_64+0x53/0x110
2022-08-10 21:18:01,854 T0000: INFO [ 304.986919] ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
2022-08-10 21:18:01,854 T0000: INFO [ 304.992275] Modules linked in: tcp_diag(E) udp_diag(E) raw_diag(E) inet_diag(E) unix_diag(E) af_packet_diag(E) netlink_diag(E) veth(E) dummy(E) nft_chain_nat_ipv6(E) nf_nat_ipv6(E) act_police(E) cls_u32(E) ip6t_REJECT(E) nf_reject_ipv6(E) ipt_REJECT(E) nf_reject_ipv4(E) xt_tcpudp(E) sch_ingress(E) lm75(E) ebt_vlan(E) pddf_custom_wdt(FOE) pddf_custom_psu(FOE) nft_compat(E) pddf_fan_module(OE) pddf_fan_driver_module(OE) pddf_led_module(OE) nft_counter(E) pddf_xcvr_driver_module(OE) pddf_xcvr_module(OE) pddf_gpio_module(OE) pddf_psu_module(OE) pddf_psu_driver_module(OE) pddf_fpgai2c_driver(OE) pddf_fpgai2c_module(OE) pddf_cpld_driver(OE) pddf_cpld_module(OE) pddf_mux_module(OE) pddf_client_module(OE) optoe(E) mc24lc64t(OE) gpio_pca953x(E) i2c_mux_pca954x(E) i2c_mux(E) i2c_dev(E) i2c_i801(E) i2c_ismt(E)
2022-08-10 21:18:01,854 T0000: INFO [ 305.063221] bridge(E) stp(E) llc(E) nf_conntrack_netlink(E) xfrm_user(E) xfrm_algo(E) intel_rapl(E) pnd2_edac(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) kvm_intel(E) bonding(E) kvm(E) ramdisk(OE) evdev(E) irqbypass(E) qat_c3xxx(E) crct10dif_pclmul(E) tpm_crb(E) crc32_pclmul(E) ghash_clmulni_intel(E) intel_cstate(E) intel_rapl_perf(E) intel_qat(E) wdat_wdt(E) efi_pstore(E) pcspkr(E) sg(E) efivars(E) authenc(E) tpm_tis(E) tpm_tis_core(E) tpm(E) rng_core(E) button(E) pcc_cpufreq(E) acpi_cpufreq(E) nft_chain_nat_ipv4(E) nf_nat_ipv4(E) nf_nat(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) libcrc32c(E) nf_tables(E) nfnetlink(E) efivarfs(E) ip_tables(E) x_tables(E) autofs4(E) loop(E) ext4(E) crc16(E) mbcache(E) jbd2(E) crc32c_generic(E) fscrypto(E) ecb(E) nvme(E) nvme_core(E) nls_utf8(E)
2022-08-10 21:18:01,854 T0000: INFO [ 305.134697] nls_cp437(E) nls_ascii(E) vfat(E) fat(E) overlay(E) squashfs(E) zstd_decompress(E) xxhash(E) sd_mod(E) uas(E) usb_storage(E) crc32c_intel(E) aesni_intel(E) aes_x86_64(E) crypto_simd(E) ixgbe(E) cryptd(E) glue_helper(E) mdio(E) ahci(E) igb(E) libahci(E) xhci_pci(E) i2c_algo_bit(E) dca(E) xhci_hcd(E) libata(E) usbcore(E) usb_common(E) scsi_mod(E) thermal(E) [last unloaded: linux_kernel_bde]
2022-08-10 21:18:01,854 T0000: INFO [ 305.170379] rb: panic handler for ramdisk sync
2022-08-10 21:18:01,855 T0000: INFO [ 305.174954] CR2: 000000000002bf20
2022-08-10 21:18:01,855 T0000: INFO [ 305.174992] rb: not syncing ramdisk, timeout: 1489
2022-08-10 21:18:01,855 T0000: INFO [ 305.178407] ---[ end trace a0c62d0743dfbdc7 ]---
2022-08-10 21:18:01,855 T0000: INFO [ 305.184980] RIP: 0010:0x2bf20
2022-08-10 21:18:01,855 T0000: INFO [ 305.191391] Code: Bad RIP value.
2022-08-10 21:18:01,855 T0000: INFO [ 305.194752] RSP: 0018:ffffa65bc063fdc0 EFLAGS: 00010246
2022-08-10 21:18:01,855 T0000: INFO [ 305.200108] RAX: 000000000002bf20 RBX: ffff8c8df8cd0418 RCX: dead000000000200
2022-08-10 21:18:01,856 T0000: INFO [ 305.207374] RDX: 0000000000000001 RSI: ffff8c8df8cd0418 RDI: ffff8c8df8cd0400
2022-08-10 21:18:01,856 T0000: INFO [ 305.214639] RBP: 0000000000000000 R08: ffff8c8dfba21900 R09: ffff8c8dfb000000
2022-08-10 21:18:01,856 T0000: INFO [ 305.221905] R10: ffff8c8dfb000028 R11: 0000000000000000 R12: ffff8c8df8cd0400
2022-08-10 21:18:01,856 T0000: INFO [ 305.229171] R13: ffffffff838c660a R14: ffff8c8df8cd0460 R15: 0000000000000000
2022-08-10 21:18:01,856 T0000: INFO [ 305.236438] FS: 00007fdf112ea940(0000) GS:ffff8c8dfba00000(0000) knlGS:0000000000000000
2022-08-10 21:18:01,856 T0000: INFO [ 305.244658] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
2022-08-10 21:18:01,856 T0000: INFO [ 305.250534] CR2: 000000000002bef6 CR3: 0000000179908000 CR4: 00000000003406f0
2022-08-10 21:18:01,857 T0000: INFO [ 305.257801] Kernel panic - not syncing: Fatal exception
2022-08-10 21:18:01,857 T0000: INFO [ 305.263208] Kernel Offset: 0x1a00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
2022-08-10 21:18:01,857 T0000: INFO [ 305.274033] rb: panic handler for ramdisk sync
2022-08-10 21:18:01,857 T0000: INFO [ 305.285097] Rebooting in 10 seconds..
2022-08-10 21:18:01,857 T0000: INFO [ 315.210203] ACPI MEMORY or I/O RESET_REG.
2022-08-10 21:18:01,857 T0000: INFO
2022-08-10 21:18:01,858 T0000: INFO
2022-08-10 21:18:01,858 T0000: INFO
2022-08-10 21:18:01,858 T0000: INFO Primary BIOS boot in progress...
2022-08-10 21:18:01,858 T0000: INFO
2022-08-10 21:18:01,858 T0000: INFO CPLD_C version: 0.6
2022-08-10 21:18:01,858 T0000: INFO CPLD_B version: 2.4
How I did it
Cause:
device open, which causes a memory corruption in the slub.
Fix: - Instead of cdev pointer from the inode, mdev container pointer is
used from the file->privdate_data member.
Action: update the pddf_custom_wdt driver,
How to verify it
Do the reboot stress test to check whether there is kernel dump during reboot progress
Which release branch to backport (provide reason below if selected)
Description for the changelog
Ensure to add label/tag for the feature raised. example - PR#2174 under sonic-utilities repo. where, Generic Config and Update feature has been labelled as GCU.
Link to config_db schema for YANG module changes
A picture of a cute animal (not mandatory but encouraged)