Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save kvaps/b04893e194315527002afb2ec3bc5725 to your computer and use it in GitHub Desktop.
Save kvaps/b04893e194315527002afb2ec3bc5725 to your computer and use it in GitHub Desktop.
Kernel tainted on workqueue: do_submit [drbd]

Hi we have very strange issue with one volume, we are using lvm and drbd with diskless replica.

After we start workload on diskless node it it is working fine, but after some time all I/O is hung and we have the next messages in dmesg:

from m8c25 (diskless node):

[Sun Mar 10 05:42:28 2019] drbd csi-940049b5-d5aa-46f0-9a73-fe601c3fc696 m12c4: [drbd_s_csi-9400/6151] sending time expired, ko = 6
[Sun Mar 10 05:42:28 2019] drbd csi-940049b5-d5aa-46f0-9a73-fe601c3fc696 m7c14: [drbd_s_csi-9400/6154] sending time expired, ko = 6
[Sun Mar 10 05:42:34 2019] drbd csi-940049b5-d5aa-46f0-9a73-fe601c3fc696 m12c4: [drbd_s_csi-9400/6151] sending time expired, ko = 5
[Sun Mar 10 05:42:34 2019] drbd csi-940049b5-d5aa-46f0-9a73-fe601c3fc696 m7c14: [drbd_s_csi-9400/6154] sending time expired, ko = 6
[Sun Mar 10 05:42:40 2019] drbd csi-940049b5-d5aa-46f0-9a73-fe601c3fc696 m12c4: [drbd_s_csi-9400/6151] sending time expired, ko = 4
[Sun Mar 10 05:42:40 2019] drbd csi-940049b5-d5aa-46f0-9a73-fe601c3fc696 m7c14: [drbd_s_csi-9400/6154] sending time expired, ko = 5
[Sun Mar 10 05:42:46 2019] drbd csi-940049b5-d5aa-46f0-9a73-fe601c3fc696 m12c4: [drbd_s_csi-9400/6151] sending time expired, ko = 3
[Sun Mar 10 05:42:46 2019] drbd csi-940049b5-d5aa-46f0-9a73-fe601c3fc696 m7c14: [drbd_s_csi-9400/6154] sending time expired, ko = 4
[Sun Mar 10 05:42:52 2019] drbd csi-940049b5-d5aa-46f0-9a73-fe601c3fc696 m12c4: [drbd_s_csi-9400/6151] sending time expired, ko = 2
[Sun Mar 10 05:42:52 2019] drbd csi-940049b5-d5aa-46f0-9a73-fe601c3fc696 m7c14: [drbd_s_csi-9400/6154] sending time expired, ko = 3
[Sun Mar 10 05:42:59 2019] drbd csi-940049b5-d5aa-46f0-9a73-fe601c3fc696 m12c4: [drbd_s_csi-9400/6151] sending time expired, ko = 1
[Sun Mar 10 05:42:59 2019] drbd csi-940049b5-d5aa-46f0-9a73-fe601c3fc696 m7c14: [drbd_s_csi-9400/6154] sending time expired, ko = 2
[Sun Mar 10 05:43:05 2019] drbd csi-940049b5-d5aa-46f0-9a73-fe601c3fc696 m12c4: conn( Connected -> NetworkFailure ) peer( Secondary -> Unknown )
[Sun Mar 10 05:43:05 2019] drbd csi-940049b5-d5aa-46f0-9a73-fe601c3fc696/0 drbd1104 m12c4: pdsk( UpToDate -> DUnknown ) repl( Established -> Off )
[Sun Mar 10 05:43:05 2019] drbd csi-940049b5-d5aa-46f0-9a73-fe601c3fc696 m7c14: [drbd_s_csi-9400/6154] sending time expired, ko = 1
[Sun Mar 10 05:43:05 2019] drbd csi-940049b5-d5aa-46f0-9a73-fe601c3fc696 m12c4: ack_receiver terminated
[Sun Mar 10 05:43:05 2019] drbd csi-940049b5-d5aa-46f0-9a73-fe601c3fc696 m12c4: Terminating ack_recv thread
[Sun Mar 10 05:43:05 2019] drbd csi-940049b5-d5aa-46f0-9a73-fe601c3fc696/0 drbd1104: sending new current UUID: C07BEF398E0AD984
[Sun Mar 10 05:43:05 2019] drbd csi-940049b5-d5aa-46f0-9a73-fe601c3fc696 m7c14: Preparing remote state change 1414712872
[Sun Mar 10 05:43:05 2019] drbd csi-940049b5-d5aa-46f0-9a73-fe601c3fc696 m7c14: Committing remote state change 1414712872 (primary_nodes=8)
[Sun Mar 10 05:43:05 2019] drbd csi-940049b5-d5aa-46f0-9a73-fe601c3fc696/0 drbd1104 m12c4: pdsk( DUnknown -> Outdated )
[Sun Mar 10 05:43:11 2019] drbd csi-940049b5-d5aa-46f0-9a73-fe601c3fc696 m7c14: conn( Connected -> NetworkFailure ) peer( Secondary -> Unknown )
[Sun Mar 10 05:43:11 2019] drbd csi-940049b5-d5aa-46f0-9a73-fe601c3fc696/0 drbd1104 m7c14: pdsk( UpToDate -> DUnknown ) repl( Established -> Off )
[Sun Mar 10 05:43:11 2019] drbd csi-940049b5-d5aa-46f0-9a73-fe601c3fc696 m12c4: Connection closed
[Sun Mar 10 05:43:11 2019] drbd csi-940049b5-d5aa-46f0-9a73-fe601c3fc696 m12c4: conn( NetworkFailure -> Unconnected )
[Sun Mar 10 05:43:11 2019] drbd csi-940049b5-d5aa-46f0-9a73-fe601c3fc696 m12c4: Restarting receiver thread
[Sun Mar 10 05:43:11 2019] drbd csi-940049b5-d5aa-46f0-9a73-fe601c3fc696/0 drbd1104: IO ERROR: neither local nor remote data, sector 25237760+8
[Sun Mar 10 05:43:11 2019] drbd csi-940049b5-d5aa-46f0-9a73-fe601c3fc696 m12c4: conn( Unconnected -> Connecting )
[Sun Mar 10 05:43:11 2019] drbd csi-940049b5-d5aa-46f0-9a73-fe601c3fc696/0 drbd1104: IO ERROR: neither local nor remote data, sector 1001578856+8
[Sun Mar 10 05:43:11 2019] drbd csi-940049b5-d5aa-46f0-9a73-fe601c3fc696/0 drbd1104: IO ERROR: neither local nor remote data, sector 713186240+8

From m7c14 (first data node):

[Sun Mar 10 05:45:34 2019] INFO: task kworker/u16:3:3676 blocked for more than 120 seconds.
[Sun Mar 10 05:45:34 2019]       Tainted: P           O     4.15.18-7-pve #1
[Sun Mar 10 05:45:34 2019] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Sun Mar 10 05:45:34 2019] kworker/u16:3   D    0  3676      2 0x80000000
[Sun Mar 10 05:45:34 2019] Workqueue: drbd1104_submit do_submit [drbd]
[Sun Mar 10 05:45:34 2019] Call Trace:
[Sun Mar 10 05:45:34 2019]  __schedule+0x3e0/0x870
[Sun Mar 10 05:45:34 2019]  schedule+0x36/0x80
[Sun Mar 10 05:45:34 2019]  do_submit+0x325/0x5f0 [drbd]
[Sun Mar 10 05:45:34 2019]  ? __switch_to_asm+0x34/0x70
[Sun Mar 10 05:45:34 2019]  ? __switch_to_asm+0x40/0x70
[Sun Mar 10 05:45:34 2019]  ? __switch_to_asm+0x34/0x70
[Sun Mar 10 05:45:34 2019]  ? __switch_to_asm+0x40/0x70
[Sun Mar 10 05:45:34 2019]  ? wait_woken+0x80/0x80
[Sun Mar 10 05:45:34 2019]  ? __switch_to_asm+0x34/0x70
[Sun Mar 10 05:45:34 2019]  process_one_work+0x1e0/0x400
[Sun Mar 10 05:45:34 2019]  ? __drbd_make_request+0x4e0/0x4e0 [drbd]
[Sun Mar 10 05:45:34 2019]  ? process_one_work+0x1e0/0x400
[Sun Mar 10 05:45:34 2019]  worker_thread+0x4b/0x420
[Sun Mar 10 05:45:34 2019]  kthread+0x105/0x140
[Sun Mar 10 05:45:34 2019]  ? process_one_work+0x400/0x400
[Sun Mar 10 05:45:34 2019]  ? kthread_create_worker_on_cpu+0x70/0x70
[Sun Mar 10 05:45:34 2019]  ? do_syscall_64+0x73/0x130
[Sun Mar 10 05:45:34 2019]  ? SyS_exit_group+0x14/0x20
[Sun Mar 10 05:45:34 2019]  ret_from_fork+0x35/0x40
[Sun Mar 10 05:45:34 2019] INFO: task drbd_r_csi-9400:18969 blocked for more than 120 seconds.
[Sun Mar 10 05:45:34 2019]       Tainted: P           O     4.15.18-7-pve #1
[Sun Mar 10 05:45:34 2019] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Sun Mar 10 05:45:34 2019] drbd_r_csi-9400 D    0 18969      2 0x80000004
[Sun Mar 10 05:45:34 2019] Call Trace:
[Sun Mar 10 05:45:34 2019]  __schedule+0x3e0/0x870
[Sun Mar 10 05:45:34 2019]  schedule+0x36/0x80
[Sun Mar 10 05:45:34 2019]  conn_wait_ee_empty+0x72/0x9f [drbd]
[Sun Mar 10 05:45:34 2019]  ? wait_woken+0x80/0x80
[Sun Mar 10 05:45:34 2019]  conn_disconnect+0x1c0/0x780 [drbd]
[Sun Mar 10 05:45:34 2019]  ? w_flush+0x50/0x50 [drbd]
[Sun Mar 10 05:45:34 2019]  drbd_receiver+0x2a2/0x670 [drbd]
[Sun Mar 10 05:45:34 2019]  drbd_thread_setup+0x76/0x180 [drbd]
[Sun Mar 10 05:45:34 2019]  kthread+0x105/0x140
[Sun Mar 10 05:45:34 2019]  ? __drbd_next_peer_device_ref+0x150/0x150 [drbd]
[Sun Mar 10 05:45:34 2019]  ? kthread_create_worker_on_cpu+0x70/0x70
[Sun Mar 10 05:45:34 2019]  ? do_syscall_64+0x73/0x130
[Sun Mar 10 05:45:34 2019]  ? SyS_exit_group+0x14/0x20
[Sun Mar 10 05:45:34 2019]  ret_from_fork+0x35/0x40

From m7c14 (second data node):

[Sun Mar 10 05:46:55 2019] INFO: task drbd_r_csi-9400:21586 blocked for more than 120 seconds.
[Sun Mar 10 05:46:55 2019]       Tainted: P           O     4.15.18-7-pve #1
[Sun Mar 10 05:46:55 2019] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Sun Mar 10 05:46:55 2019] drbd_r_csi-9400 D    0 21586      2 0x80000000
[Sun Mar 10 05:46:55 2019] Call Trace:
[Sun Mar 10 05:46:55 2019]  __schedule+0x3e0/0x870
[Sun Mar 10 05:46:55 2019]  schedule+0x36/0x80
[Sun Mar 10 05:46:55 2019]  receive_peer_dagtag+0x162/0x2f0 [drbd]
[Sun Mar 10 05:46:55 2019]  ? wait_woken+0x80/0x80
[Sun Mar 10 05:46:55 2019]  ? got_twopc_reply+0x1f0/0x1f0 [drbd]
[Sun Mar 10 05:46:55 2019]  drbd_receiver+0x4ed/0x670 [drbd]
[Sun Mar 10 05:46:55 2019]  drbd_thread_setup+0x76/0x180 [drbd]
[Sun Mar 10 05:46:55 2019]  kthread+0x105/0x140
[Sun Mar 10 05:46:55 2019]  ? __drbd_next_peer_device_ref+0x150/0x150 [drbd]
[Sun Mar 10 05:46:55 2019]  ? kthread_create_worker_on_cpu+0x70/0x70
[Sun Mar 10 05:46:55 2019]  ? do_syscall_64+0x73/0x130
[Sun Mar 10 05:46:55 2019]  ? SyS_exit_group+0x14/0x20
[Sun Mar 10 05:46:55 2019]  ret_from_fork+0x35/0x40
[Sun Mar 10 05:46:55 2019] INFO: task drbd_r_csi-9400:18504 blocked for more than 120 seconds.
[Sun Mar 10 05:46:55 2019]       Tainted: P           O     4.15.18-7-pve #1
[Sun Mar 10 05:46:55 2019] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Sun Mar 10 05:46:55 2019] drbd_r_csi-9400 D    0 18504      2 0x80000004
[Sun Mar 10 05:46:55 2019] Call Trace:
[Sun Mar 10 05:46:55 2019]  __schedule+0x3e0/0x870
[Sun Mar 10 05:46:55 2019]  schedule+0x36/0x80
[Sun Mar 10 05:46:55 2019]  conn_wait_ee_empty+0x72/0x9f [drbd]
[Sun Mar 10 05:46:55 2019]  ? wait_woken+0x80/0x80
[Sun Mar 10 05:46:55 2019]  conn_disconnect+0x1c0/0x780 [drbd]
[Sun Mar 10 05:46:55 2019]  ? w_flush+0x50/0x50 [drbd]
[Sun Mar 10 05:46:55 2019]  drbd_receiver+0x2a2/0x670 [drbd]
[Sun Mar 10 05:46:55 2019]  drbd_thread_setup+0x76/0x180 [drbd]
[Sun Mar 10 05:46:55 2019]  kthread+0x105/0x140
[Sun Mar 10 05:46:55 2019]  ? __drbd_next_peer_device_ref+0x150/0x150 [drbd]
[Sun Mar 10 05:46:55 2019]  ? kthread_create_worker_on_cpu+0x70/0x70
[Sun Mar 10 05:46:55 2019]  ret_from_fork+0x35/0x40
[Sun Mar 10 05:46:55 2019] INFO: task kworker/u16:0:18946 blocked for more than 120 seconds.
[Sun Mar 10 05:46:55 2019]       Tainted: P           O     4.15.18-7-pve #1
[Sun Mar 10 05:46:55 2019] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Sun Mar 10 05:46:55 2019] kworker/u16:0   D    0 18946      2 0x80000000
[Sun Mar 10 05:46:55 2019] Workqueue: drbd1104_submit do_submit [drbd]
[Sun Mar 10 05:46:55 2019] Call Trace:
[Sun Mar 10 05:46:55 2019]  __schedule+0x3e0/0x870
[Sun Mar 10 05:46:55 2019]  schedule+0x36/0x80
[Sun Mar 10 05:46:55 2019]  do_submit+0x325/0x5f0 [drbd]
[Sun Mar 10 05:46:55 2019]  ? __switch_to_asm+0x34/0x70
[Sun Mar 10 05:46:55 2019]  ? __switch_to_asm+0x40/0x70
[Sun Mar 10 05:46:55 2019]  ? __switch_to_asm+0x34/0x70
[Sun Mar 10 05:46:55 2019]  ? __switch_to_asm+0x40/0x70
[Sun Mar 10 05:46:55 2019]  ? wait_woken+0x80/0x80
[Sun Mar 10 05:46:55 2019]  ? __switch_to_asm+0x34/0x70
[Sun Mar 10 05:46:55 2019]  process_one_work+0x1e0/0x400
[Sun Mar 10 05:46:55 2019]  ? __drbd_make_request+0x4e0/0x4e0 [drbd]
[Sun Mar 10 05:46:55 2019]  ? process_one_work+0x1e0/0x400
[Sun Mar 10 05:46:55 2019]  worker_thread+0x4b/0x420
[Sun Mar 10 05:46:55 2019]  kthread+0x105/0x140
[Sun Mar 10 05:46:55 2019]  ? process_one_work+0x400/0x400
[Sun Mar 10 05:46:55 2019]  ? kthread_create_worker_on_cpu+0x70/0x70
[Sun Mar 10 05:46:55 2019]  ? do_syscall_64+0x73/0x130
[Sun Mar 10 05:46:55 2019]  ? SyS_exit_group+0x14/0x20
[Sun Mar 10 05:46:55 2019]  ret_from_fork+0x35/0x40

drbdadm status shows NetworkFailure on the diskless node:

csi-940049b5-d5aa-46f0-9a73-fe601c3fc696 role:Secondary
  disk:Consistent
  m7c14 role:Secondary
    peer-disk:Consistent
  m8c25 connection:NetworkFailure
  m8c29 role:Secondary
    peer-disk:Diskless

linstor r l shows Consistent for the diskless node.

| csi-940049b5-d5aa-46f0-9a73-fe601c3fc696       | m12c4  | 7158 | Unused | Consistent |
| csi-940049b5-d5aa-46f0-9a73-fe601c3fc696       | m7c14  | 7158 | Unused |   UpToDate |
| csi-940049b5-d5aa-46f0-9a73-fe601c3fc696       | m8c25  | 7158 | Unused |   Diskless |
| csi-940049b5-d5aa-46f0-9a73-fe601c3fc696       | m8c29  | 7158 | Unused |   Diskless |

Any drbdadm down, drbdadm disconnect operations are stuck and not working on the data nodes. Only reboot is helping.

We have this issue with one volume, then we created another one and copied all the files from old one to new one and run workload on new. This problem was occurred with new volume too, although the different nodes with different hardware was used.

Environement:

  • backend: lvm

  • filesystem: ext4

  • linstor version: v0.7.5

  • kernel version: 4.15.18-7-pve

  • drbd version:

    version: 9.0.14-1 (api:2/proto:86-113)
    GIT-hash: 62f906cf44ef02a30ce0c148fec223b40c51c533 build by @gitlab-runner-docker1-0, 2018-10-09 16:33:22
    Transports (api:16): tcp (9.0.14-1)
    

Could you explain please where exactly can be problem and where to dig further?

Thanks in advance!

  • kvaps
@kvaps
Copy link
Author

kvaps commented Mar 11, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment