Skip to content

Instantly share code, notes, and snippets.

@Millnert
Last active February 28, 2017 13:18
Show Gist options
  • Save Millnert/ecc10d8cc79c81b55d7f to your computer and use it in GitHub Desktop.
Save Millnert/ecc10d8cc79c81b55d7f to your computer and use it in GitHub Desktop.
Stack trace and source code references from a RHEL 7.1 NFS lock bug
Apr 18 21:01:26 tsm1 kernel: [<ffffffffa07045a0>] ? nfs_pageio_doio+0x50/0x50 [nfs] http://lxr.free-electrons.com/source/fs/nfs/pagelist.c?v=3.10#L394
Apr 18 21:01:26 tsm1 kernel: [<ffffffff8160954d>] io_schedule+0x9d/0x130 http://lxr.free-electrons.com/source/kernel/sched/core.c?v=3.10#L4512
Apr 18 21:01:26 tsm1 kernel: [<ffffffffa07045ae>] nfs_wait_bit_uninterruptible+0xe/0x20 [nfs] http://lxr.free-electrons.com/source/fs/nfs/pagelist.c?v=3.10#L261
Apr 18 21:01:26 tsm1 kernel: [<ffffffff81607320>] __wait_on_bit+0x60/0x90 http://lxr.free-electrons.com/source/kernel/sched/wait.c#L387
Apr 18 21:01:26 tsm1 kernel: [<ffffffffa07045a0>] ? nfs_pageio_doio+0x50/0x50 [nfs] http://lxr.free-electrons.com/source/fs/nfs/pagelist.c?v=3.10#L394
Apr 18 21:01:26 tsm1 kernel: [<ffffffff816073d7>] out_of_line_wait_on_bit+0x87/0xb0 http://lxr.free-electrons.com/source/kernel/wait.c?v=3.10#L209
Apr 18 21:01:26 tsm1 kernel: [<ffffffff81098260>] ? autoremove_wake_function+0x40/0x40 http://lxr.free-electrons.com/source/kernel/wait.c?v=3.10#L163
Apr 18 21:01:26 tsm1 kernel: [<ffffffffa0705ad3>] nfs_wait_on_request+0x33/0x40 [nfs] http://lxr.free-electrons.com/source/fs/nfs/pagelist.c?v=3.10#L275
Apr 18 21:01:26 tsm1 kernel: [<ffffffffa070a351>] nfs_updatepage+0x121/0x8a0 [nfs] http://lxr.free-electrons.com/source/fs/nfs/write.c?v=3.10#L897
-> nfs_writepage_setup @ http://lxr.free-electrons.com/source/fs/nfs/write.c?v=3.10#L828
-> nfs_setup_write_request via http://lxr.free-electrons.com/source/fs/nfs/write.c?v=3.10#L833
func @ http://lxr.free-electrons.com/source/fs/nfs/write.c?v=3.10#L811
-> nfs_mark_uptodate via http://lxr.free-electrons.com/source/fs/nfs/write.c?v=3.10#L838
func @ http://lxr.free-electrons.com/source/fs/nfs/write.c?v=3.10#L215
-> nfs_mark_request_dirty via http://lxr.free-electrons.com/source/fs/nfs/write.c?v=3.10#L839
func @ http://lxr.free-electrons.com/source/fs/nfs/write.c?v=3.10#L466
-> nfs_unlock_and_release_request via http://lxr.free-electrons.com/source/fs/nfs/write.c?v=3.10#L840
func @ http://lxr.free-electrons.com/source/fs/nfs/pagelist.c?v=3.10#L206
Apr 18 21:01:26 tsm1 kernel: [<ffffffffa06fac11>] nfs_write_end+0x121/0x350 [nfs] http://lxr.free-electrons.com/source/fs/nfs/file.c?v=3.10#L404
Apr 18 21:01:26 tsm1 kernel: [<ffffffff811564e4>] generic_file_buffered_write+0x184/0x290 http://lxr.free-electrons.com/source/mm/filemap.c?v=3.10#L2393
via http://lxr.free-electrons.com/source/mm/filemap.c?v=3.10#L2301
via write_end: http://lxr.free-electrons.com/source/mm/filemap.c?v=3.10#L2357
mapped as http://lxr.free-electrons.com/source/fs/nfs/file.c?v=3.10#L537 to nfs_write_end
Apr 18 21:01:26 tsm1 kernel: [<ffffffff811585e5>] __generic_file_aio_write+0x1d5/0x3e0 http://lxr.free-electrons.com/source/mm/filemap.c?v=3.10#L2432
does not do O_DIRECT, but generic_file_buffered_write
Apr 18 21:01:26 tsm1 kernel: [<ffffffff8115884d>] generic_file_aio_write+0x5d/0xc0 http://lxr.free-electrons.com/source/mm/filemap.c?v=3.10#L2540
Apr 18 21:01:26 tsm1 kernel: [<ffffffffa06f9d1b>] nfs_file_write+0xbb/0x1d0 [nfs] via http://lxr.free-electrons.com/source/fs/nfs/file.c?v=3.10#L925 to
http://lxr.free-electrons.com/source/fs/nfs/file.c?v=3.10#L613
does *not* do O_DIRECT, O_APPEND write from here, but rather: generic_file_aio_write
Apr 18 21:01:26 tsm1 kernel: [<ffffffff811c5e2d>] do_sync_write+0x8d/0xd0 http://lxr.free-electrons.com/source/fs/read_write.c?v=3.10#L383
Apr 18 21:01:26 tsm1 kernel: [<ffffffff811c65cd>] vfs_write+0xbd/0x1e0 http://lxr.free-electrons.com/source/fs/read_write.c?v=3.10#L430
Apr 18 21:01:26 tsm1 kernel: [<ffffffff811c60b4>] ? generic_file_llseek+0x24/0x30 http://lxr.free-electrons.com/source/fs/read_write.c?v=3.10#L137
Apr 18 21:01:26 tsm1 kernel: [<ffffffff811c7018>] SyS_write+0x58/0xb0
Apr 18 21:01:26 tsm1 kernel: [<ffffffff81613da9>] system_call_fastpath+0x16/0x1b http://lxr.free-electrons.com/source/arch/x86/kernel/entry_64.S?v=3.10#L635
@dansomers
Copy link

Seeing this on RHEL7.1 in a scenario with a single client writing to NFS (bowtie2 alignments). Several alignments in a row will complete, but then we hit this bug and the bowtie2 process zombifies as a result. I see that last update here was September of last year. Anyone else still having this problem? If not, what was the fix?

@dansomers
Copy link

Issue remains on RHEL7.2 after NFS-utils update.

Feb 17 16:37:00 host kernel: INFO: task bowtie2-align-s:26812 blocked for more than 120 seconds.
Feb 17 16:37:00 host kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 17 16:37:00 host kernel: bowtie2-align-s D ffff88081ac53b00 0 26812 26805 0x00000080
Feb 17 16:37:00 host kernel: ffff88081ac539a0 0000000000000086 ffff8808263d6780 ffff88081ac53fd8
Feb 17 16:37:00 host kernel: ffff88081ac53fd8 ffff88081ac53fd8 ffff8808263d6780 ffff88082f094780
Feb 17 16:37:00 host kernel: 0000000000000000 7fffffffffffffff ffffffffa057dbc0 ffff88081ac53b00
Feb 17 16:37:00 host kernel: Call Trace:
Feb 17 16:37:00 host kernel: [] ? nfs_pgio_header_free+0x20/0x20 [nfs]
Feb 17 16:37:00 host kernel: [] schedule+0x29/0x70
Feb 17 16:37:00 host kernel: [] schedule_timeout+0x209/0x2d0
Feb 17 16:37:00 host kernel: [] ? native_sched_clock+0x35/0x80
Feb 17 16:37:00 host kernel: [] ? read_tsc+0x9/0x10
Feb 17 16:37:00 host kernel: [] ? nfs_pgio_header_free+0x20/0x20 [nfs]
Feb 17 16:37:00 host kernel: [] io_schedule_timeout+0xae/0x130
Feb 17 16:37:00 host kernel: [] io_schedule+0x18/0x20
Feb 17 16:37:00 host kernel: [] nfs_wait_bit_uninterruptible+0xe/0x20 [nfs]
Feb 17 16:37:00 host kernel: [] __wait_on_bit+0x60/0x90
Feb 17 16:37:00 host kernel: [] ? nfs_pgio_header_free+0x20/0x20 [nfs]
Feb 17 16:37:00 host kernel: [] out_of_line_wait_on_bit+0x87/0xb0
Feb 17 16:37:00 host kernel: [] ? wake_atomic_t_function+0x40/0x40
Feb 17 16:37:00 host kernel: [] nfs_wait_on_request+0x33/0x40 [nfs]
Feb 17 16:37:00 host kernel: [] nfs_updatepage+0x121/0x8a0 [nfs]
Feb 17 16:37:00 host kernel: [] nfs_write_end+0x121/0x350 [nfs]
Feb 17 16:37:00 host kernel: [] generic_file_buffered_write+0x184/0x290
Feb 17 16:37:00 host kernel: [] __generic_file_aio_write+0x1d5/0x3e0
Feb 17 16:37:00 host kernel: [] generic_file_aio_write+0x5d/0xc0
Feb 17 16:37:00 host kernel: [] nfs_file_write+0xbb/0x1d0 [nfs]
Feb 17 16:37:00 host kernel: [] do_sync_write+0x8d/0xd0
Feb 17 16:37:00 host kernel: [] vfs_write+0xbd/0x1e0
Feb 17 16:37:00 host kernel: [] SyS_write+0x7f/0xe0
Feb 17 16:37:00 host kernel: [] symev_write+0x43/0x80 [symev_rh_ES_7_3_10_0_123_el7_x86_64]

@mmarcrr
Copy link

mmarcrr commented Jul 13, 2016

I have 10 machines that every day have this error

Jul 10 14:28:35 hpcnode004 kernel: INFO: task srun:20481 blocked for more than 120 seconds.
Jul 10 14:28:35 hpcnode004 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 10 14:28:35 hpcnode004 kernel: srun D ffff8810115bbb30 0 20481 20435 0x00000084
Jul 10 14:28:35 hpcnode004 kernel: ffff8810115bb9d0 0000000000000082 ffff881014a49700 ffff8810115bbfd8
Jul 10 14:28:35 hpcnode004 kernel: ffff8810115bbfd8 ffff8810115bbfd8 ffff881014a49700 ffff88103fc147c0
Jul 10 14:28:35 hpcnode004 kernel: 0000000000000000 7fffffffffffffff ffffffffa0630bb0 ffff8810115bbb30
Jul 10 14:28:35 hpcnode004 kernel: Call Trace:
Jul 10 14:28:35 hpcnode004 kernel: [] ? nfs_pgio_header_free+0x20/0x20 [nfs]
Jul 10 14:28:35 hpcnode004 kernel: [] schedule+0x29/0x70
Jul 10 14:28:35 hpcnode004 kernel: [] schedule_timeout+0x209/0x2d0
Jul 10 14:28:35 hpcnode004 kernel: [] ? dev_hard_start_xmit+0x171/0x3b0
Jul 10 14:28:35 hpcnode004 kernel: [] ? netif_skb_features+0xc7/0x210
Jul 10 14:28:35 hpcnode004 kernel: [] ? read_tsc+0x9/0x10
Jul 10 14:28:35 hpcnode004 kernel: [] ? nfs_pgio_header_free+0x20/0x20 [nfs]
Jul 10 14:28:35 hpcnode004 kernel: [] io_schedule_timeout+0xae/0x130
Jul 10 14:28:35 hpcnode004 kernel: [] io_schedule+0x18/0x20
Jul 10 14:28:35 hpcnode004 kernel: [] nfs_wait_bit_uninterruptible+0xe/0x20 [nfs]
Jul 10 14:28:35 hpcnode004 kernel: [] __wait_on_bit+0x60/0x90
Jul 10 14:28:35 hpcnode004 kernel: [] ? nfs_pgio_header_free+0x20/0x20 [nfs]
Jul 10 14:28:35 hpcnode004 kernel: [] out_of_line_wait_on_bit+0x87/0xb0
Jul 10 14:28:35 hpcnode004 kernel: [] ? wake_atomic_t_function+0x40/0x40
Jul 10 14:28:35 hpcnode004 kernel: [] nfs_wait_on_request+0x33/0x40 [nfs]
Jul 10 14:28:35 hpcnode004 kernel: [] nfs_updatepage+0x121/0x8a0 [nfs]
Jul 10 14:28:35 hpcnode004 kernel: [] nfs_write_end+0x121/0x350 [nfs]
Jul 10 14:28:35 hpcnode004 kernel: [] generic_file_buffered_write+0x184/0x290
Jul 10 14:28:35 hpcnode004 kernel: [] __generic_file_aio_write+0x1d5/0x3e0
Jul 10 14:28:35 hpcnode004 kernel: [] ? inet_recvmsg+0x7b/0xa0
Jul 10 14:28:35 hpcnode004 kernel: [] generic_file_aio_write+0x5d/0xc0
Jul 10 14:28:35 hpcnode004 kernel: [] nfs_file_write+0xbb/0x1d0 [nfs]
Jul 10 14:28:35 hpcnode004 kernel: [] do_sync_write+0x8d/0xd0
Jul 10 14:28:35 hpcnode004 kernel: [] vfs_write+0xbd/0x1e0
Jul 10 14:28:35 hpcnode004 kernel: [] SyS_write+0x7f/0xe0
Jul 10 14:28:35 hpcnode004 kernel: [] system_call_fastpath+0x16/0x1b
Jul 10 14:28:35 hpcnode004 kernel: INFO: task srun:20502 blocked for more than 120 seconds.
Jul 10 14:28:35 hpcnode004 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 10 14:28:35 hpcnode004 kernel: srun D ffff880e98787b30 0 20502 20436 0x00000084
Jul 10 14:28:35 hpcnode004 kernel: ffff880e987879d0 0000000000000086 ffff88102303c500 ffff880e98787fd8
Jul 10 14:28:35 hpcnode004 kernel: ffff880e98787fd8 ffff880e98787fd8 ffff88102303c500 ffff88103fc147c0
Jul 10 14:28:35 hpcnode004 kernel: 0000000000000000 7fffffffffffffff ffffffffa0630bb0 ffff880e98787b30
Jul 10 14:28:35 hpcnode004 kernel: Call Trace:
Jul 10 14:28:35 hpcnode004 kernel: [] ? nfs_pgio_header_free+0x20/0x20 [nfs]
Jul 10 14:28:35 hpcnode004 kernel: [] schedule+0x29/0x70
Jul 10 14:28:35 hpcnode004 kernel: [] schedule_timeout+0x209/0x2d0
Jul 10 14:28:35 hpcnode004 kernel: [] ? dev_hard_start_xmit+0x171/0x3b0
Jul 10 14:28:35 hpcnode004 kernel: [] ? netif_skb_features+0xc7/0x210
Jul 10 14:28:35 hpcnode004 kernel: [] ? read_tsc+0x9/0x10
Jul 10 14:28:35 hpcnode004 kernel: [] ? nfs_pgio_header_free+0x20/0x20 [nfs]
Jul 10 14:28:35 hpcnode004 kernel: [] io_schedule_timeout+0xae/0x130
Jul 10 14:28:35 hpcnode004 kernel: [] io_schedule+0x18/0x20
Jul 10 14:28:35 hpcnode004 kernel: [] nfs_wait_bit_uninterruptible+0xe/0x20 [nfs]
Jul 10 14:28:35 hpcnode004 kernel: [] __wait_on_bit+0x60/0x90
Jul 10 14:28:35 hpcnode004 kernel: [] ? nfs_pgio_header_free+0x20/0x20 [nfs]
Jul 10 14:28:35 hpcnode004 kernel: [] out_of_line_wait_on_bit+0x87/0xb0
Jul 10 14:28:35 hpcnode004 kernel: [] ? wake_atomic_t_function+0x40/0x40
Jul 10 14:28:35 hpcnode004 kernel: [] nfs_wait_on_request+0x33/0x40 [nfs]
Jul 10 14:28:35 hpcnode004 kernel: [] nfs_updatepage+0x121/0x8a0 [nfs]
Jul 10 14:28:35 hpcnode004 kernel: [] nfs_write_end+0x121/0x350 [nfs]
Jul 10 14:28:35 hpcnode004 kernel: [] generic_file_buffered_write+0x184/0x290
Jul 10 14:28:35 hpcnode004 kernel: [] __generic_file_aio_write+0x1d5/0x3e0
Jul 10 14:28:35 hpcnode004 kernel: [] ? inet_recvmsg+0x7b/0xa0
Jul 10 14:28:35 hpcnode004 kernel: [] generic_file_aio_write+0x5d/0xc0
Jul 10 14:28:35 hpcnode004 kernel: [] nfs_file_write+0xbb/0x1d0 [nfs]
Jul 10 14:28:35 hpcnode004 kernel: [] do_sync_write+0x8d/0xd0
Jul 10 14:28:35 hpcnode004 kernel: [] vfs_write+0xbd/0x1e0
Jul 10 14:28:35 hpcnode004 kernel: [] SyS_write+0x7f/0xe0

@gniuk
Copy link

gniuk commented Jan 12, 2017

I have a similar issue(and this issue has occured several times), a process hanging on NFS writing, the process came into D state(uninterruptable sleep state) and can't be killed.

  1. The hanging process's stack is as follow:
    [] nfs_wait_bit_uninterruptible+0xe/0x20 [nfs]
    [] nfs_wait_on_request+0x33/0x40 [nfs]
    [] nfs_updatepage+0x121/0x8a0 [nfs]
    [] nfs_write_end+0x121/0x350 [nfs]
    [] generic_file_buffered_write+0x184/0x290
    [] __generic_file_aio_write+0x1d5/0x3e0
    [] generic_file_aio_write+0x5d/0xc0
    [] nfs_file_write+0xbb/0x1d0 [nfs]
    [] do_sync_write+0x8d/0xd0
    [] vfs_write+0xbd/0x1e0
    [] SyS_write+0x7f/0xe0
    [] system_call_fastpath+0x16/0x1b
    [] 0xffffffffffffffff

  2. The process status is as follow:
    Name: python
    State: D (disk sleep)
    Tgid: 12585
    Ngid: 13108
    Pid: 12585
    PPid: 1
    TracerPid: 0
    Uid: 1001 1001 1001 1001
    Gid: 1001 1001 1001 1001
    FDSize: 64
    Groups: 10 1001
    VmPeak: 251223144 kB
    VmSize: 251051100 kB
    VmLck: 0 kB
    VmPin: 0 kB
    VmHWM: 23502332 kB
    VmRSS: 148884 kB
    VmData: 249587980 kB
    VmStk: 136 kB
    VmExe: 4 kB
    VmLib: 409248 kB
    VmPTE: 47444 kB
    VmSwap: 22780916 kB
    Threads: 1
    SigQ: 10/256532
    SigPnd: 0000000000040100
    ShdPnd: 0000000000004322
    SigBlk: 0000000000000000
    SigIgn: 0000000001381000
    SigCgt: 0000000180000202
    CapInh: 0000000000000000
    CapPrm: 0000000000000000
    CapEff: 0000000000000000
    CapBnd: 0000001fffffffff
    Seccomp: 0
    Cpus_allowed: ffffff,ffffffff
    Cpus_allowed_list: 0-55
    Mems_allowed: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000003
    Mems_allowed_list: 0-1
    voluntary_ctxt_switches: 2022575
    nonvoluntary_ctxt_switches: 134484

  3. The kernel version is RHEL7.2.1511 3.10.0-327.22.2.el7.x86_64

  4. Someone already created an issue on RedHat a week ago, but not yet solved officially.
    https://access.redhat.com/solutions/2245341

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment