Skip to content

Instantly share code, notes, and snippets.

@sbonds
Last active May 2, 2020 00:27
Show Gist options
  • Save sbonds/88e51fc339313c9035a29b1af9950a8b to your computer and use it in GitHub Desktop.
Save sbonds/88e51fc339313c9035a29b1af9950a8b to your computer and use it in GitHub Desktop.
OEL UEK4 Azure crash kernel failure reproduction steps

OEL UEK4 Azure crash kernel failure

Recreate 7.4 VM

Add a resource

Search marketplace for "oracle linux"

Use "Oracle Linux"

Oracle Linux

Choose Oracle Linux 7.4

This gets the UEK4 kernel series.

Oracle Linux 7.4

VM size: 2GiB RAM or more

Smaller VMs won't have enough RAM for crashkernel=auto to allocate anything for the crash kernel.

Do yum update

That will update the VM to OEL 7.8.

# yum update -y
... lots of updates ...
(138/201): kernel-uek-4.1.12-124.38.1.el7uek.x86_64.rpm    |  45 MB   00:03
...

Change the grub config

Make the /etc/default/grub file look like:

GRUB_TIMEOUT=30
GRUB_DISTRIBUTOR="$(sed 's, release .*$,,g' /etc/system-release)"
GRUB_DEFAULT=saved
GRUB_DISABLE_SUBMENU=true
GRUB_CMDLINE_LINUX_DEFAULT="crashkernel=auto console=tty0 console=ttyS0,115200n8"
GRUB_TERMINAL="console serial"
GRUB_SERIAL_COMMAND="serial --speed=115200 --unit=0 --word=8 --parity=no --stop=1"
GRUB_DISABLE_RECOVERY=true

Original contents

GRUB_TIMEOUT=1
GRUB_DISTRIBUTOR="$(sed 's, release .*$,,g' /etc/system-release)"
GRUB_DEFAULT=saved
GRUB_DISABLE_SUBMENU=true
GRUB_TERMINAL_OUTPUT="console"
GRUB_CMDLINE_LINUX="console=tty1 console=ttyS0,115200n8 earlyprintk=ttyS0,115200 rootdelay=300 net.ifnames=0"
GRUB_DISABLE_RECOVERY="true"

Update grub

# grub2-mkconfig -o /boot/grub2/grub.cfg

Update kexec-tools

# yum install -y kexec-tools
# systemctl enable kdump

Reboot

Restart via Azure Portal "restart" option on VM overview.

Ensure kdump started

# systemctl status -l kdump
● kdump.service - Crash recovery kernel arming
   Loaded: loaded (/usr/lib/systemd/system/kdump.service; enabled; vendor preset: enabled)
   Active: active (exited) since Fri 2020-05-01 23:28:49 UTC; 1min 11s ago
  Process: 735 ExecStart=/usr/bin/kdumpctl start (code=exited, status=0/SUCCESS)
 Main PID: 735 (code=exited, status=0/SUCCESS)

May 01 23:28:45 oel-74-uek4-kdump-test dracut[1047]: drwxr-xr-x   1 root     root            0 May  1 23:28 usr/share/zoneinfo
May 01 23:28:45 oel-74-uek4-kdump-test dracut[1047]: -rw-r--r--   1 root     root          118 Apr 30 14:55 usr/share/zoneinfo/UTC
May 01 23:28:45 oel-74-uek4-kdump-test dracut[1047]: drwxr-xr-x   1 root     root            0 May  1 23:28 var
May 01 23:28:45 oel-74-uek4-kdump-test dracut[1047]: lrwxrwxrwx   1 root     root           11 May  1 23:28 var/lock -> ../run/lock
May 01 23:28:45 oel-74-uek4-kdump-test dracut[1047]: lrwxrwxrwx   1 root     root            6 May  1 23:28 var/run -> ../run
May 01 23:28:45 oel-74-uek4-kdump-test dracut[1047]: ========================================================================
May 01 23:28:45 oel-74-uek4-kdump-test dracut[1047]: *** Creating initramfs image file '/boot/initramfs-4.1.12-124.38.1.el7uek.x86_64kdump.img' done ***
May 01 23:28:49 oel-74-uek4-kdump-test kdumpctl[735]: kexec: loaded kdump kernel
May 01 23:28:49 oel-74-uek4-kdump-test kdumpctl[735]: Starting kdump: [OK]
May 01 23:28:49 oel-74-uek4-kdump-test systemd[1]: Started Crash recovery kernel arming.

Record kernel version and kdump status to console

It also helps to be watching the console at the time.

# uname -r > /dev/console
# systemctl status -l kdump > /dev/console

Trigger NMI and record serial output

[  381.383344] Uhhuh. NMI received for unknown reason 21 on CPU 0.
[  381.383344] Do you have a strange power saving mode enabled?
[  381.383344] Dazed and confused, but trying to continue

Makes sense-- I never configured Linux to panic on unknown NMI.

ADDITIONAL STEP: Configure Linux to panic on unknown NMI

Change /etc/sysctl.conf:

kernel.unknown_nmi_panic=1
kernel.panic_on_unrecovered_nmi=1
kernel.sysrq=1
# sysctl -p

Then reboot from the Azure portal again.

Second try: Record kernel version and kdump status to console

It also helps to be watching the console at the time.

# uname -r > /dev/console
# systemctl status -l kdump > /dev/console

Second try: Trigger NMI and record serial output

[  189.646369] BUG: unable to handle kernel NULL pointer dereference at           (null)
[  189.646369] IP: [<          (null)>]           (null)
[  189.646369] PGD 0
[  189.646369] Oops: 0010 [#1] SMP
[  189.646369] Modules linked in: nf_conntrack_ipv4 nf_defrag_ipv4 xt_owner xt_conntrack nf_conntrack iptable_security ext4 jbd2 mbcache2 xfs crct10dif_pclmul crc32_pclmul libcrc32c ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd sg hv_balloon pcspkr i2c_piix4 acpi_cpufreq i2c_core ip_tables btrfs xor raid6_pq sd_mod hv_netvsc ata_generic pata_acpi hyperv_keyboard hv_utils hv_storvsc hid_hyperv hyperv_fb crc32c_intel ata_piix serio_raw libata hv_vmbus floppy
[  189.646369] CPU: 0 PID: 7020 Comm: sshd Tainted: G        W        4.1.12-124.38.1.el7uek.x86_64 #2
[  189.646369] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090007  06/02/2017
[  189.646369] task: ffff880131542a00 ti: ffff88002140c000 task.ti: ffff88002140c000
[  189.646369] RIP: 0010:[<0000000000000000>]  [<          (null)>]           (null)
[  189.646369] RSP: 0018:ffff88013b603d70  EFLAGS: 00010046
[  189.646369] RAX: 0000000000000000 RBX: ffff88013b618140 RCX: 0000002c27cdf1be
[  189.646369] RDX: 0000000000000005 RSI: ffffffff81b4f480 RDI: ffff88013b618140
[  189.646369] RBP: ffff88013b603d98 R08: 0000000000000000 R09: 0000000000000101
[  189.646369] R10: 00000000006f8fe9 R11: 0000000000001f20 R12: ffffffff81b4f480
[  189.646369] R13: 0000000000000005 R14: 0000000000000046 R15: 0000000000000000
[  189.646369] FS:  00007f552c1e98c0(0000) GS:ffff88013b600000(0000) knlGS:0000000000000000
[  189.646369] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  189.646369] CR2: 0000000000000000 CR3: 0000000021402000 CR4: 0000000000360670
[  189.646369] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  189.646369] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  189.646369] Stack:
[  189.646369]  ffffffff810b4c64 ffff88013b603d98 ffffffff81b4f480 ffff88013b618140
[  189.646369]  ffffffff81b4ff34 ffff88013b603da8 ffffffff810ba0b3 ffff88013b603dc8
[  189.646369]  ffffffff810ba3f3 ffffffff81b4f480 ffff88013b618140 ffff88013b603e28
[  189.646369] Call Trace:
[  189.646369]  <IRQ>
[  189.646369]  [<ffffffff810b4c64>] ? enqueue_task+0x54/0x90
[  189.646369]  [<ffffffff810ba0b3>] activate_task+0x23/0x30
[  189.646369]  [<ffffffff810ba3f3>] ttwu_do_activate.constprop.90+0x33/0x70
[  189.646369]  [<ffffffff810bd227>] try_to_wake_up+0x1c7/0x390
[  189.646369]  [<ffffffff810bd472>] default_wake_function+0x12/0x20
[  189.646369]  [<ffffffff810d3deb>] __wake_up_common+0x5b/0x90
[  189.646369]  [<ffffffff810d3e33>] __wake_up_locked+0x13/0x20
[  189.646369]  [<ffffffff810d46db>] complete+0x3b/0x60
[  189.646369]  [<ffffffffc0021225>] vmbus_unload_response+0x15/0x20 [hv_vmbus]
[  189.646369]  [<ffffffffc001e07f>] vmbus_on_msg_dpc+0x17f/0x210 [hv_vmbus]
[  189.646369]  [<ffffffff81091020>] tasklet_action+0x130/0x140
[  189.646369]  [<ffffffff81091320>] __do_softirq+0x100/0x320
[  189.646369]  [<ffffffff8175f3bc>] do_softirq_own_stack+0x1c/0x30
[  189.646369]  <EOI>
[  189.646369]  [<ffffffff810915e5>] do_softirq+0x55/0x60
[  189.646369]  [<ffffffff8109167b>] __local_bh_enable_ip+0x8b/0xa0
[  189.646369]  [<ffffffff816153d7>] lock_sock_nested+0x47/0x60
[  189.646369]  [<ffffffff81683455>] tcp_sendmsg+0x35/0xb70
[  189.646369]  [<ffffffff812d3320>] ? sock_has_perm+0x70/0x90
[  189.646369]  [<ffffffff816b01aa>] inet_sendmsg+0x6a/0xb0
[  189.646369]  [<ffffffff812d3453>] ? selinux_socket_sendmsg+0x23/0x30
[  189.646369]  [<ffffffff81612323>] sock_sendmsg+0x43/0x50
[  189.646369]  [<ffffffff816123b5>] sock_write_iter+0x85/0xf0
[  189.646369]  [<ffffffff8121fcac>] __vfs_write+0xdc/0x130
[  189.646369]  [<ffffffff81220379>] vfs_write+0xa9/0x1b0
[  189.646369]  [<ffffffff81102c82>] ? ktime_get_with_offset+0x52/0xb0
[  189.646369]  [<ffffffff81221265>] SyS_write+0x55/0xd0
[  189.646369]  [<ffffffff810ff801>] ? SyS_clock_gettime+0x91/0xd0
[  189.646369]  [<ffffffff8175a7b6>] system_call_fastpath+0x18/0xee
[  189.646369] Code:  Bad RIP value.
[  189.646369] RIP  [<          (null)>]           (null)
[  189.646369]  RSP <ffff88013b603d70>
[  189.646369] CR2: 0000000000000000
[  189.646369] ---[ end trace 5aa01d0606c737ee ]---
[  189.646369] Kernel panic - not syncing: Fatal exception in interrupt
[  189.646369] Kernel Offset: disabled
[  189.646369] ---[ end Kernel panic - not syncing: Fatal exception in interrupt

... crash kernel never starts ...

Collect sosreport

# sosreport
...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment