Skip to content

Instantly share code, notes, and snippets.

View maxhgerlach's full-sized avatar

Max H. Gerlach maxhgerlach

  • DeepL SE
  • Cologne, Germany
  • 16:11 (UTC +02:00)
View GitHub Profile
@maxhgerlach
maxhgerlach / gdb_thread_apply_all_backtrace_hanging_at_init.txt
Created September 5, 2019 09:17
hvd.init() hangs with error message help-opal-shmem-mmap.txt
(gdb) set pagination off
(gdb) thread apply all bt
Thread 4 (Thread 0x7efdfdfab700 (LWP 330906)):
#0 0x00007efe181faa13 in epoll_wait () at ../sysdeps/unix/syscall-template.S:84
#1 0x00007efd73f3b138 in epoll_dispatch (base=0x7efd280b78d0, tv=<optimized out>) at epoll.c:407
#2 0x00007efd73f3e4ff in opal_libevent2022_event_base_loop (base=0x7efd280b78d0, flags=1) at event.c:1630
#3 0x00007efd2dbdee9e in progress_engine () from /opt/openmpi/lib/openmpi/mca_pmix_pmix3x.so
#4 0x00007efe184c46ba in start_thread (arg=0x7efdfdfab700) at pthread_create.c:333
#5 0x00007efe181fa41d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
@maxhgerlach
maxhgerlach / gdb_thread_apply_all_backtrace-232676.txt
Created May 29, 2019 14:54
hanging in MPI_Finalize, 2 threads
Thread 2 (Thread 0x7fa2c5ffb700 (LWP 232683)):
#0 0x00007fa4d186e4bd in write () at ../sysdeps/unix/syscall-template.S:84
#1 0x00007fa4ccad9ede in ?? () from /usr/lib/python2.7/lib-dynload/_multiprocessing.x86_64-linux-gnu.so
#2 0x00007fa4ccadb481 in ?? () from /usr/lib/python2.7/lib-dynload/_multiprocessing.x86_64-linux-gnu.so
#3 0x00000000004c182d in PyEval_EvalFrameEx ()
#4 0x00000000004b9b66 in PyEval_EvalCodeEx ()
#5 0x00000000004d57a3 in ?? ()
#6 0x00000000004a587e in PyObject_Call ()
#7 0x00000000004be51e in PyEval_EvalFrameEx ()
@maxhgerlach
maxhgerlach / gdb_thread_apply_all_backtrace-231633.txt
Created May 29, 2019 14:52
hanging in MPI_Finalize, 75 threads
Thread 75 (Thread 0x7f9f7103a700 (LWP 232884)):
#0 0x00007fa4d186d827 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x10f69ba8) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
#1 do_futex_wait (sem=sem@entry=0x10f69ba8, abstime=0x0) at sem_waitcommon.c:111
#2 0x00007fa4d186d8d4 in __new_sem_wait_slow (sem=0x10f69ba8, abstime=0x0) at sem_waitcommon.c:181
#3 0x00007fa4d186d97a in __new_sem_wait (sem=<optimized out>) at sem_wait.c:29
#4 0x00007fa45e538957 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#5 0x00007fa45e523881 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#6 0x00007fa45e53a988 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#7 0x00007fa4d18656ba in start_thread (arg=0x7f9f7103a700) at pthread_create.c:333
@maxhgerlach
maxhgerlach / gist:2c7030198b86da5093c56272ede114b6
Created May 28, 2019 14:27
Log of hanging Horovod shutdown
Tue May 28 15:56:20 2019[1,6]<stdout>:[2019-05-28 15:56:20.361660: I horovod/common/operations.cc:1106] [6]: MaxG Shutdown A
Tue May 28 15:56:20 2019[1,4]<stdout>:[2019-05-28 15:56:20.361653: I horovod/common/operations.cc:1106] [4]: MaxG Shutdown A
Tue May 28 15:56:20 2019[1,7]<stdout>:[2019-05-28 15:56:20.361644: I horovod/common/operations.cc:1106] [7]: MaxG Shutdown A
Tue May 28 15:56:20 2019[1,3]<stdout>:[2019-05-28 15:56:20.361649: I horovod/common/operations.cc:1106] [3]: MaxG Shutdown A
Tue May 28 15:56:20 2019[1,1]<stdout>:[2019-05-28 15:56:20.361646: I horovod/common/operations.cc:1106] [1]: MaxG Shutdown A
Tue May 28 15:56:20 2019[1,2]<stdout>:[2019-05-28 15:56:20.361652: I horovod/common/operations.cc:1106] [2]: MaxG Shutdown A
Tue May 28 15:56:20 2019[1,0]<stdout>:[2019-05-28 15:56:20.361660: I horovod/common/operations.cc:1106] [0]: MaxG Shutdown A
Tue May 28 15:56:20 2019[1,5]<stdout>:[2019-05-28 15:56:20.361700: I horovod/common/operations.cc:1106] [5]: MaxG Shutdown A
Tue May 28 15:56:20 2019
@maxhgerlach
maxhgerlach / gdb_thread_apply_all_backtrace.txt
Created November 9, 2018 10:50
Backtrace for Segmentation Fault (11) after about 20 hours of training, no. 2
Thread 88 (Thread 0x7f92af7fe700 (LWP 14136)):
#0 0x00007f94ff2d8593 in select () at ../sysdeps/unix/syscall-template.S:84
#1 0x0000000000597569 in ?? ()
#2 0x00000000004c45fa in PyEval_EvalFrameEx ()
#3 0x00000000004c2705 in PyEval_EvalCodeEx ()
#4 0x00000000004de858 in ?? ()
#5 0x00000000004b0c93 in PyObject_Call ()
#6 0x00000000004c6ef6 in PyEval_EvalFrameEx ()
#7 0x00000000004c9d7f in PyEval_EvalFrameEx ()
@maxhgerlach
maxhgerlach / thread-apply-all-bt.txt
Created October 24, 2018 12:30
(gdb) thread apply all backtrace
(gdb) thread apply all backtrace
Thread 88 (Thread 0x7f2ae5afa700 (LWP 22174)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00007f2c14f1691c in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#2 0x00007f2c1e2e7fb7 in Eigen::NonBlockingThreadPoolTempl<tensorflow::thread::EigenEnvironment>::WaitForWork(Eigen::EventCount::Waiter*, tensorflow::thread::EigenEnvironment::Task*) () from /usr/local/lib/python2.7/dist-packages/tensorflow/python/../libtensorflow_framework.so
#3 0x00007f2c1e2e8a24 in Eigen::NonBlockingThreadPoolTempl<tensorflow::thread::EigenEnvironment>::WorkerLoop(int) () from /usr/local/lib/python2.7/dist-packages/tensorflow/python/../libtensorflow_framework.so
#4 0x00007f2c1e2e7752 in std::_Function_handler<void (), tensorflow::thread::EigenEnvironment::CreateThread(std::function<void ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) () from /usr/local/lib/python2.7/dist