Skip to content

Instantly share code, notes, and snippets.

@Millnert
Last active February 28, 2017 13:18
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save Millnert/ecc10d8cc79c81b55d7f to your computer and use it in GitHub Desktop.
Save Millnert/ecc10d8cc79c81b55d7f to your computer and use it in GitHub Desktop.
Stack trace and source code references from a RHEL 7.1 NFS lock bug
Apr 18 21:01:26 tsm1 kernel: [<ffffffffa07045a0>] ? nfs_pageio_doio+0x50/0x50 [nfs] http://lxr.free-electrons.com/source/fs/nfs/pagelist.c?v=3.10#L394
Apr 18 21:01:26 tsm1 kernel: [<ffffffff8160954d>] io_schedule+0x9d/0x130 http://lxr.free-electrons.com/source/kernel/sched/core.c?v=3.10#L4512
Apr 18 21:01:26 tsm1 kernel: [<ffffffffa07045ae>] nfs_wait_bit_uninterruptible+0xe/0x20 [nfs] http://lxr.free-electrons.com/source/fs/nfs/pagelist.c?v=3.10#L261
Apr 18 21:01:26 tsm1 kernel: [<ffffffff81607320>] __wait_on_bit+0x60/0x90 http://lxr.free-electrons.com/source/kernel/sched/wait.c#L387
Apr 18 21:01:26 tsm1 kernel: [<ffffffffa07045a0>] ? nfs_pageio_doio+0x50/0x50 [nfs] http://lxr.free-electrons.com/source/fs/nfs/pagelist.c?v=3.10#L394
Apr 18 21:01:26 tsm1 kernel: [<ffffffff816073d7>] out_of_line_wait_on_bit+0x87/0xb0 http://lxr.free-electrons.com/source/kernel/wait.c?v=3.10#L209
Apr 18 21:01:26 tsm1 kernel: [<ffffffff81098260>] ? autoremove_wake_function+0x40/0x40 http://lxr.free-electrons.com/source/kernel/wait.c?v=3.10#L163
Apr 18 21:01:26 tsm1 kernel: [<ffffffffa0705ad3>] nfs_wait_on_request+0x33/0x40 [nfs] http://lxr.free-electrons.com/source/fs/nfs/pagelist.c?v=3.10#L275
Apr 18 21:01:26 tsm1 kernel: [<ffffffffa070a351>] nfs_updatepage+0x121/0x8a0 [nfs] http://lxr.free-electrons.com/source/fs/nfs/write.c?v=3.10#L897
-> nfs_writepage_setup @ http://lxr.free-electrons.com/source/fs/nfs/write.c?v=3.10#L828
-> nfs_setup_write_request via http://lxr.free-electrons.com/source/fs/nfs/write.c?v=3.10#L833
func @ http://lxr.free-electrons.com/source/fs/nfs/write.c?v=3.10#L811
-> nfs_mark_uptodate via http://lxr.free-electrons.com/source/fs/nfs/write.c?v=3.10#L838
func @ http://lxr.free-electrons.com/source/fs/nfs/write.c?v=3.10#L215
-> nfs_mark_request_dirty via http://lxr.free-electrons.com/source/fs/nfs/write.c?v=3.10#L839
func @ http://lxr.free-electrons.com/source/fs/nfs/write.c?v=3.10#L466
-> nfs_unlock_and_release_request via http://lxr.free-electrons.com/source/fs/nfs/write.c?v=3.10#L840
func @ http://lxr.free-electrons.com/source/fs/nfs/pagelist.c?v=3.10#L206
Apr 18 21:01:26 tsm1 kernel: [<ffffffffa06fac11>] nfs_write_end+0x121/0x350 [nfs] http://lxr.free-electrons.com/source/fs/nfs/file.c?v=3.10#L404
Apr 18 21:01:26 tsm1 kernel: [<ffffffff811564e4>] generic_file_buffered_write+0x184/0x290 http://lxr.free-electrons.com/source/mm/filemap.c?v=3.10#L2393
via http://lxr.free-electrons.com/source/mm/filemap.c?v=3.10#L2301
via write_end: http://lxr.free-electrons.com/source/mm/filemap.c?v=3.10#L2357
mapped as http://lxr.free-electrons.com/source/fs/nfs/file.c?v=3.10#L537 to nfs_write_end
Apr 18 21:01:26 tsm1 kernel: [<ffffffff811585e5>] __generic_file_aio_write+0x1d5/0x3e0 http://lxr.free-electrons.com/source/mm/filemap.c?v=3.10#L2432
does not do O_DIRECT, but generic_file_buffered_write
Apr 18 21:01:26 tsm1 kernel: [<ffffffff8115884d>] generic_file_aio_write+0x5d/0xc0 http://lxr.free-electrons.com/source/mm/filemap.c?v=3.10#L2540
Apr 18 21:01:26 tsm1 kernel: [<ffffffffa06f9d1b>] nfs_file_write+0xbb/0x1d0 [nfs] via http://lxr.free-electrons.com/source/fs/nfs/file.c?v=3.10#L925 to
http://lxr.free-electrons.com/source/fs/nfs/file.c?v=3.10#L613
does *not* do O_DIRECT, O_APPEND write from here, but rather: generic_file_aio_write
Apr 18 21:01:26 tsm1 kernel: [<ffffffff811c5e2d>] do_sync_write+0x8d/0xd0 http://lxr.free-electrons.com/source/fs/read_write.c?v=3.10#L383
Apr 18 21:01:26 tsm1 kernel: [<ffffffff811c65cd>] vfs_write+0xbd/0x1e0 http://lxr.free-electrons.com/source/fs/read_write.c?v=3.10#L430
Apr 18 21:01:26 tsm1 kernel: [<ffffffff811c60b4>] ? generic_file_llseek+0x24/0x30 http://lxr.free-electrons.com/source/fs/read_write.c?v=3.10#L137
Apr 18 21:01:26 tsm1 kernel: [<ffffffff811c7018>] SyS_write+0x58/0xb0
Apr 18 21:01:26 tsm1 kernel: [<ffffffff81613da9>] system_call_fastpath+0x16/0x1b http://lxr.free-electrons.com/source/arch/x86/kernel/entry_64.S?v=3.10#L635
@avleen
Copy link

avleen commented Aug 13, 2015

Hi, did you ever find a resolution to this?
I'm seeing it too and haven't been able to find a cause yet!

@Millnert
Copy link
Author

Replied your email also.

No, no resolution - haven't even bugreported it yet for lack of time. Unless there are changes to the involved files above to 7.2 that deals with this bug, there is no fix for this in 7.2 either.

So it's a "novel" bug, that should get reported. Extremely hard to reproduce, in fact, we've been unable to reproduce it artificially in lab. Only saw it from production traffic.

@avleen
Copy link

avleen commented Aug 22, 2015

For anyone else who finds this gist, there is a bug report open with CentOS:
https://bugs.centos.org/view.php?id=9284

I've been able to reproduce this very reliably under high traffic load, so hopefully that might lead us somewhere!

@erickgar
Copy link

Having same signature (time out after 120s) when writing big files from multiple sources to the mount on Centos7.1.

@dansomers
Copy link

Seeing this on RHEL7.1 in a scenario with a single client writing to NFS (bowtie2 alignments). Several alignments in a row will complete, but then we hit this bug and the bowtie2 process zombifies as a result. I see that last update here was September of last year. Anyone else still having this problem? If not, what was the fix?

@dansomers
Copy link

Issue remains on RHEL7.2 after NFS-utils update.

Feb 17 16:37:00 host kernel: INFO: task bowtie2-align-s:26812 blocked for more than 120 seconds.
Feb 17 16:37:00 host kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 17 16:37:00 host kernel: bowtie2-align-s D ffff88081ac53b00 0 26812 26805 0x00000080
Feb 17 16:37:00 host kernel: ffff88081ac539a0 0000000000000086 ffff8808263d6780 ffff88081ac53fd8
Feb 17 16:37:00 host kernel: ffff88081ac53fd8 ffff88081ac53fd8 ffff8808263d6780 ffff88082f094780
Feb 17 16:37:00 host kernel: 0000000000000000 7fffffffffffffff ffffffffa057dbc0 ffff88081ac53b00
Feb 17 16:37:00 host kernel: Call Trace:
Feb 17 16:37:00 host kernel: [] ? nfs_pgio_header_free+0x20/0x20 [nfs]
Feb 17 16:37:00 host kernel: [] schedule+0x29/0x70
Feb 17 16:37:00 host kernel: [] schedule_timeout+0x209/0x2d0
Feb 17 16:37:00 host kernel: [] ? native_sched_clock+0x35/0x80
Feb 17 16:37:00 host kernel: [] ? read_tsc+0x9/0x10
Feb 17 16:37:00 host kernel: [] ? nfs_pgio_header_free+0x20/0x20 [nfs]
Feb 17 16:37:00 host kernel: [] io_schedule_timeout+0xae/0x130
Feb 17 16:37:00 host kernel: [] io_schedule+0x18/0x20
Feb 17 16:37:00 host kernel: [] nfs_wait_bit_uninterruptible+0xe/0x20 [nfs]
Feb 17 16:37:00 host kernel: [] __wait_on_bit+0x60/0x90
Feb 17 16:37:00 host kernel: [] ? nfs_pgio_header_free+0x20/0x20 [nfs]
Feb 17 16:37:00 host kernel: [] out_of_line_wait_on_bit+0x87/0xb0
Feb 17 16:37:00 host kernel: [] ? wake_atomic_t_function+0x40/0x40
Feb 17 16:37:00 host kernel: [] nfs_wait_on_request+0x33/0x40 [nfs]
Feb 17 16:37:00 host kernel: [] nfs_updatepage+0x121/0x8a0 [nfs]
Feb 17 16:37:00 host kernel: [] nfs_write_end+0x121/0x350 [nfs]
Feb 17 16:37:00 host kernel: [] generic_file_buffered_write+0x184/0x290
Feb 17 16:37:00 host kernel: [] __generic_file_aio_write+0x1d5/0x3e0
Feb 17 16:37:00 host kernel: [] generic_file_aio_write+0x5d/0xc0
Feb 17 16:37:00 host kernel: [] nfs_file_write+0xbb/0x1d0 [nfs]
Feb 17 16:37:00 host kernel: [] do_sync_write+0x8d/0xd0
Feb 17 16:37:00 host kernel: [] vfs_write+0xbd/0x1e0
Feb 17 16:37:00 host kernel: [] SyS_write+0x7f/0xe0
Feb 17 16:37:00 host kernel: [] symev_write+0x43/0x80 [symev_rh_ES_7_3_10_0_123_el7_x86_64]

@mmarcrr
Copy link

mmarcrr commented Jul 13, 2016

I have 10 machines that every day have this error

Jul 10 14:28:35 hpcnode004 kernel: INFO: task srun:20481 blocked for more than 120 seconds.
Jul 10 14:28:35 hpcnode004 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 10 14:28:35 hpcnode004 kernel: srun D ffff8810115bbb30 0 20481 20435 0x00000084
Jul 10 14:28:35 hpcnode004 kernel: ffff8810115bb9d0 0000000000000082 ffff881014a49700 ffff8810115bbfd8
Jul 10 14:28:35 hpcnode004 kernel: ffff8810115bbfd8 ffff8810115bbfd8 ffff881014a49700 ffff88103fc147c0
Jul 10 14:28:35 hpcnode004 kernel: 0000000000000000 7fffffffffffffff ffffffffa0630bb0 ffff8810115bbb30
Jul 10 14:28:35 hpcnode004 kernel: Call Trace:
Jul 10 14:28:35 hpcnode004 kernel: [] ? nfs_pgio_header_free+0x20/0x20 [nfs]
Jul 10 14:28:35 hpcnode004 kernel: [] schedule+0x29/0x70
Jul 10 14:28:35 hpcnode004 kernel: [] schedule_timeout+0x209/0x2d0
Jul 10 14:28:35 hpcnode004 kernel: [] ? dev_hard_start_xmit+0x171/0x3b0
Jul 10 14:28:35 hpcnode004 kernel: [] ? netif_skb_features+0xc7/0x210
Jul 10 14:28:35 hpcnode004 kernel: [] ? read_tsc+0x9/0x10
Jul 10 14:28:35 hpcnode004 kernel: [] ? nfs_pgio_header_free+0x20/0x20 [nfs]
Jul 10 14:28:35 hpcnode004 kernel: [] io_schedule_timeout+0xae/0x130
Jul 10 14:28:35 hpcnode004 kernel: [] io_schedule+0x18/0x20
Jul 10 14:28:35 hpcnode004 kernel: [] nfs_wait_bit_uninterruptible+0xe/0x20 [nfs]
Jul 10 14:28:35 hpcnode004 kernel: [] __wait_on_bit+0x60/0x90
Jul 10 14:28:35 hpcnode004 kernel: [] ? nfs_pgio_header_free+0x20/0x20 [nfs]
Jul 10 14:28:35 hpcnode004 kernel: [] out_of_line_wait_on_bit+0x87/0xb0
Jul 10 14:28:35 hpcnode004 kernel: [] ? wake_atomic_t_function+0x40/0x40
Jul 10 14:28:35 hpcnode004 kernel: [] nfs_wait_on_request+0x33/0x40 [nfs]
Jul 10 14:28:35 hpcnode004 kernel: [] nfs_updatepage+0x121/0x8a0 [nfs]
Jul 10 14:28:35 hpcnode004 kernel: [] nfs_write_end+0x121/0x350 [nfs]
Jul 10 14:28:35 hpcnode004 kernel: [] generic_file_buffered_write+0x184/0x290
Jul 10 14:28:35 hpcnode004 kernel: [] __generic_file_aio_write+0x1d5/0x3e0
Jul 10 14:28:35 hpcnode004 kernel: [] ? inet_recvmsg+0x7b/0xa0
Jul 10 14:28:35 hpcnode004 kernel: [] generic_file_aio_write+0x5d/0xc0
Jul 10 14:28:35 hpcnode004 kernel: [] nfs_file_write+0xbb/0x1d0 [nfs]
Jul 10 14:28:35 hpcnode004 kernel: [] do_sync_write+0x8d/0xd0
Jul 10 14:28:35 hpcnode004 kernel: [] vfs_write+0xbd/0x1e0
Jul 10 14:28:35 hpcnode004 kernel: [] SyS_write+0x7f/0xe0
Jul 10 14:28:35 hpcnode004 kernel: [] system_call_fastpath+0x16/0x1b
Jul 10 14:28:35 hpcnode004 kernel: INFO: task srun:20502 blocked for more than 120 seconds.
Jul 10 14:28:35 hpcnode004 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 10 14:28:35 hpcnode004 kernel: srun D ffff880e98787b30 0 20502 20436 0x00000084
Jul 10 14:28:35 hpcnode004 kernel: ffff880e987879d0 0000000000000086 ffff88102303c500 ffff880e98787fd8
Jul 10 14:28:35 hpcnode004 kernel: ffff880e98787fd8 ffff880e98787fd8 ffff88102303c500 ffff88103fc147c0
Jul 10 14:28:35 hpcnode004 kernel: 0000000000000000 7fffffffffffffff ffffffffa0630bb0 ffff880e98787b30
Jul 10 14:28:35 hpcnode004 kernel: Call Trace:
Jul 10 14:28:35 hpcnode004 kernel: [] ? nfs_pgio_header_free+0x20/0x20 [nfs]
Jul 10 14:28:35 hpcnode004 kernel: [] schedule+0x29/0x70
Jul 10 14:28:35 hpcnode004 kernel: [] schedule_timeout+0x209/0x2d0
Jul 10 14:28:35 hpcnode004 kernel: [] ? dev_hard_start_xmit+0x171/0x3b0
Jul 10 14:28:35 hpcnode004 kernel: [] ? netif_skb_features+0xc7/0x210
Jul 10 14:28:35 hpcnode004 kernel: [] ? read_tsc+0x9/0x10
Jul 10 14:28:35 hpcnode004 kernel: [] ? nfs_pgio_header_free+0x20/0x20 [nfs]
Jul 10 14:28:35 hpcnode004 kernel: [] io_schedule_timeout+0xae/0x130
Jul 10 14:28:35 hpcnode004 kernel: [] io_schedule+0x18/0x20
Jul 10 14:28:35 hpcnode004 kernel: [] nfs_wait_bit_uninterruptible+0xe/0x20 [nfs]
Jul 10 14:28:35 hpcnode004 kernel: [] __wait_on_bit+0x60/0x90
Jul 10 14:28:35 hpcnode004 kernel: [] ? nfs_pgio_header_free+0x20/0x20 [nfs]
Jul 10 14:28:35 hpcnode004 kernel: [] out_of_line_wait_on_bit+0x87/0xb0
Jul 10 14:28:35 hpcnode004 kernel: [] ? wake_atomic_t_function+0x40/0x40
Jul 10 14:28:35 hpcnode004 kernel: [] nfs_wait_on_request+0x33/0x40 [nfs]
Jul 10 14:28:35 hpcnode004 kernel: [] nfs_updatepage+0x121/0x8a0 [nfs]
Jul 10 14:28:35 hpcnode004 kernel: [] nfs_write_end+0x121/0x350 [nfs]
Jul 10 14:28:35 hpcnode004 kernel: [] generic_file_buffered_write+0x184/0x290
Jul 10 14:28:35 hpcnode004 kernel: [] __generic_file_aio_write+0x1d5/0x3e0
Jul 10 14:28:35 hpcnode004 kernel: [] ? inet_recvmsg+0x7b/0xa0
Jul 10 14:28:35 hpcnode004 kernel: [] generic_file_aio_write+0x5d/0xc0
Jul 10 14:28:35 hpcnode004 kernel: [] nfs_file_write+0xbb/0x1d0 [nfs]
Jul 10 14:28:35 hpcnode004 kernel: [] do_sync_write+0x8d/0xd0
Jul 10 14:28:35 hpcnode004 kernel: [] vfs_write+0xbd/0x1e0
Jul 10 14:28:35 hpcnode004 kernel: [] SyS_write+0x7f/0xe0

@gniuk
Copy link

gniuk commented Jan 12, 2017

I have a similar issue(and this issue has occured several times), a process hanging on NFS writing, the process came into D state(uninterruptable sleep state) and can't be killed.

  1. The hanging process's stack is as follow:
    [] nfs_wait_bit_uninterruptible+0xe/0x20 [nfs]
    [] nfs_wait_on_request+0x33/0x40 [nfs]
    [] nfs_updatepage+0x121/0x8a0 [nfs]
    [] nfs_write_end+0x121/0x350 [nfs]
    [] generic_file_buffered_write+0x184/0x290
    [] __generic_file_aio_write+0x1d5/0x3e0
    [] generic_file_aio_write+0x5d/0xc0
    [] nfs_file_write+0xbb/0x1d0 [nfs]
    [] do_sync_write+0x8d/0xd0
    [] vfs_write+0xbd/0x1e0
    [] SyS_write+0x7f/0xe0
    [] system_call_fastpath+0x16/0x1b
    [] 0xffffffffffffffff

  2. The process status is as follow:
    Name: python
    State: D (disk sleep)
    Tgid: 12585
    Ngid: 13108
    Pid: 12585
    PPid: 1
    TracerPid: 0
    Uid: 1001 1001 1001 1001
    Gid: 1001 1001 1001 1001
    FDSize: 64
    Groups: 10 1001
    VmPeak: 251223144 kB
    VmSize: 251051100 kB
    VmLck: 0 kB
    VmPin: 0 kB
    VmHWM: 23502332 kB
    VmRSS: 148884 kB
    VmData: 249587980 kB
    VmStk: 136 kB
    VmExe: 4 kB
    VmLib: 409248 kB
    VmPTE: 47444 kB
    VmSwap: 22780916 kB
    Threads: 1
    SigQ: 10/256532
    SigPnd: 0000000000040100
    ShdPnd: 0000000000004322
    SigBlk: 0000000000000000
    SigIgn: 0000000001381000
    SigCgt: 0000000180000202
    CapInh: 0000000000000000
    CapPrm: 0000000000000000
    CapEff: 0000000000000000
    CapBnd: 0000001fffffffff
    Seccomp: 0
    Cpus_allowed: ffffff,ffffffff
    Cpus_allowed_list: 0-55
    Mems_allowed: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000003
    Mems_allowed_list: 0-1
    voluntary_ctxt_switches: 2022575
    nonvoluntary_ctxt_switches: 134484

  3. The kernel version is RHEL7.2.1511 3.10.0-327.22.2.el7.x86_64

  4. Someone already created an issue on RedHat a week ago, but not yet solved officially.
    https://access.redhat.com/solutions/2245341

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment