manish-kumar-garg/distributed_error.txt

## distributed_error.txt
rnn/rnn.py returnn-distributed.config
[ip-10-1-21-241:21504] Warning: could not find environment variable "HOROVOD_TIMELINE"
[ip-10-1-21-241:21504] Warning: could not find environment variable "DEBUG"
--------------------------------------------------------------------------
WARNING: Linux kernel CMA support was requested via the
btl_vader_single_copy_mechanism MCA variable, but CMA support is
not available due to restrictive ptrace settings.

The vader shared memory BTL will fall back on another single-copy
mechanism if one is available. This may result in lower performance.

  Local host: ip-10-1-21-241
--------------------------------------------------------------------------
Horovod initialized. Hostname ip-10-1-21-241, pid 21509, rank 0 / size 2, local rank 0 / local size 2.
Horovod initialized. Hostname ip-10-1-21-241, pid 21510, rank 1 / size 2, local rank 1 / local size 2.
warning: unable to access '/home/ubuntu/.gitconfig': Is a directory
warning: unable to access '/home/ubuntu/.gitconfig': Is a directory
fatal: unknown error occurred while reading the configuration files
warning: unable to access '/home/ubuntu/.gitconfig': Is a directory
warning: unable to access '/home/ubuntu/.gitconfig': Is a directory
fatal: unknown error occurred while reading the configuration files
RETURNN starting up, version unknown(git exception: CalledProcessError(128, ('git', 'show', '-s', '--format=%ci', 'HEAD'))), date/time 2019-12-24-08-42-32 (UTC+0000), pid 21509, cwd /home/ubuntu/rwth-i6/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention, Python /home/ubuntu/horovod_2/bin/python3
RETURNN command line options: ['returnn-distributed.config']
RETURNN starting up, version unknown(git exception: CalledProcessError(128, ('git', 'show', '-s', '--format=%ci', 'HEAD'))), date/time 2019-12-24-08-42-32 (UTC+0000), pid 21510, cwd /home/ubuntu/rwth-i6/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention, Python /home/ubuntu/horovod_2/bin/python3
Hostname: ip-10-1-21-241
RETURNN command line options: ['returnn-distributed.config']
Hostname: ip-10-1-21-241
/home/ubuntu/horovod_2/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
/home/ubuntu/horovod_2/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/home/ubuntu/horovod_2/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:528: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
/home/ubuntu/horovod_2/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:529: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/home/ubuntu/horovod_2/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:530: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
/home/ubuntu/horovod_2/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:535: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])
/home/ubuntu/horovod_2/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
/home/ubuntu/horovod_2/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/home/ubuntu/horovod_2/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:528: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
/home/ubuntu/horovod_2/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:529: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/home/ubuntu/horovod_2/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:530: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
/home/ubuntu/horovod_2/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:535: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])
TensorFlow: 1.13.1 (b'v1.13.1-0-g6612da8951') (<site-package> in /home/ubuntu/horovod_2/lib/python3.6/site-packages/tensorflow)
TensorFlow: 1.13.1 (b'v1.13.1-0-g6612da8951') (<site-package> in /home/ubuntu/horovod_2/lib/python3.6/site-packages/tensorflow)
Horovod: Hostname ip-10-1-21-241, pid 21509, using GPU 0.
Horovod: Reduce type: grad
Horovod: Hostname ip-10-1-21-241, pid 21510, using GPU 1.
Setup TF inter and intra global thread pools, num_threads None, session opts {'gpu_options': {'visible_device_list': '0'}, 'log_device_placement': False, 'device_count': {'GPU': 0}}.
Setup TF inter and intra global thread pools, num_threads None, session opts {'gpu_options': {'visible_device_list': '1'}, 'log_device_placement': False, 'device_count': {'GPU': 0}}.
2019-12-24 08:42:32.121618: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-12-24 08:42:32.121734: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-12-24 08:42:32.599274: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-12-24 08:42:32.600216: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-12-24 08:42:32.608409: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-12-24 08:42:32.609131: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-12-24 08:42:32.612055: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x564500322210 executing computations on platform CUDA. Devices:
2019-12-24 08:42:32.612089: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): Tesla K80, Compute Capability 3.7
2019-12-24 08:42:32.612098: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (1): Tesla K80, Compute Capability 3.7
2019-12-24 08:42:32.612595: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x55eb17f9b210 executing computations on platform CUDA. Devices:
2019-12-24 08:42:32.612625: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): Tesla K80, Compute Capability 3.7
2019-12-24 08:42:32.612634: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (1): Tesla K80, Compute Capability 3.7
2019-12-24 08:42:32.614674: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2300090000 Hz
2019-12-24 08:42:32.614793: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2300090000 Hz
2019-12-24 08:42:32.616644: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x5645003481d0 executing computations on platform Host. Devices:
2019-12-24 08:42:32.616672: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>
2019-12-24 08:42:32.616672: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x55eb17fc11d0 executing computations on platform Host. Devices:
2019-12-24 08:42:32.616695: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>
2019-12-24 08:42:32.616742: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-12-24 08:42:32.616761: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]
2019-12-24 08:42:32.616772: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-12-24 08:42:32.616784: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]
CUDA_VISIBLE_DEVICES is set to '1,2'.
TF session gpu_options.visible_device_list is set to '1'.
Collecting TensorFlow device list...
CUDA_VISIBLE_DEVICES is set to '1,2'.
TF session gpu_options.visible_device_list is set to '0'.
Collecting TensorFlow device list...
2019-12-24 08:42:32.620117: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:00:19.0
totalMemory: 11.17GiB freeMemory: 11.05GiB
2019-12-24 08:42:32.620152: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 1
2019-12-24 08:42:32.620300: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:00:18.0
totalMemory: 11.17GiB freeMemory: 11.05GiB
2019-12-24 08:42:32.620334: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-12-24 08:42:32.628371: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-12-24 08:42:32.628397: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      1
2019-12-24 08:42:32.628406: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 1:   N
2019-12-24 08:42:32.628610: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10749 MB memory) -> physical GPU (device: 1, name: Tesla K80, pci bus id: 0000:00:19.0, compute capability: 3.7)
2019-12-24 08:42:32.629000: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-12-24 08:42:32.629025: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0
2019-12-24 08:42:32.629034: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N
2019-12-24 08:42:32.629191: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10749 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:00:18.0, compute capability: 3.7)
Unhandled exception <class 'AssertionError'> in thread <_MainThread(MainThread, started 140139725219648)>, proc 21510.

Thread current, main, <_MainThread(MainThread, started 140139725219648)>:
(Excluded thread.)

That were all threads.
EXCEPTION
Unhandled exception <class 'AssertionError'> in thread <_MainThread(MainThread, started 140476757673792)>, proc 21509.

Thread current, main, <_MainThread(MainThread, started 140476757673792)>:
(Excluded thread.)

That were all threads.
EXCEPTION
Traceback (most recent call last):
  File "returnn/rnn.py", line 654, in <module>
    line: main(sys.argv)
    locals:
      main = <local> <function main at 0x7f74c8c1bea0>
      sys = <local> <module 'sys' (built-in)>
      sys.argv = <local> ['returnn/rnn.py', 'returnn-distributed.config'], _[0]: {len = 14}
  File "returnn/rnn.py", line 641, in main
    line: init(command_line_options=argv[1:])
    locals:
      init = <global> <function init at 0x7f74c8c1bbf8>
      command_line_options = <not found>
      argv = <local> ['returnn/rnn.py', 'returnn-distributed.config'], _[0]: {len = 14}
  File "returnn/rnn.py", line 390, in init
    line: init_backend_engine()
    locals:
      init_backend_engine = <global> <function init_backend_engine at 0x7f74c8c1bb70>
  File "returnn/rnn.py", line 366, in init_backend_engine
    line: print_available_devices(tf_session_opts=tf_session_opts, file=log.v2)
    locals:
      print_available_devices = <local> <function print_available_devices at 0x7f74b44747b8>
      tf_session_opts = <local> {'gpu_options': {'visible_device_list': '1'}}
      file = <not found>
      log = <global> <Log.Log object at 0x7f747c4072b0>
      log.v2 = <global> <Log.Stream object at 0x7f74b4488198>
  File "/home/ubuntu/rwth-i6/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention/returnn/TFUtil.py", line 3454, in print_available_devices
    line: devs = get_tf_list_local_devices(tf_session_opts=tf_session_opts)
    locals:
      devs = <not found>
      get_tf_list_local_devices = <global> <function get_tf_list_local_devices at 0x7f74b4474488>
      tf_session_opts = <local> {'gpu_options': {'visible_device_list': '1'}}
  File "/home/ubuntu/rwth-i6/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention/returnn/TFUtil.py", line 3411, in get_tf_list_local_devices
    line: dev.set_physical_device_desc(session=session)
    locals:
      dev = <local> <_DeviceAttributes name: '/job:localhost/replica:0/task:0/device:CPU:0', device_type: 'CPU', memory_limit_bytes: 268435456, physical_device_desc: None>
      dev.set_physical_device_desc = <local> <bound method _DeviceAttributes.set_physical_device_desc of <_DeviceAttributes name: '/job:localhost/replica:0/task:0/device:CPU:0', device_type: 'CPU', memory_limit_bytes: 268435456, physical_device_desc: None>>
      session = <local> <tensorflow.python.client.session.Session object at 0x7f74b44e1160>
  File "/home/ubuntu/rwth-i6/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention/returnn/TFUtil.py", line 3359, in set_physical_device_desc
    line: physical_device_desc = session.run(get_device_attr(self.name))
    locals:
      physical_device_desc = <not found>
      session = <local> <tensorflow.python.client.session.Session object at 0x7f74b44e1160>
      session.run = <local> <bound method BaseSession.run of <tensorflow.python.client.session.Session object at 0x7f74b44e1160>>
      get_device_attr = <global> <function get_device_attr at 0x7f74b4486620>
      self = <local> <_DeviceAttributes name: '/job:localhost/replica:0/task:0/device:CPU:0', device_type: 'CPU', memory_limit_bytes: 268435456, physical_device_desc: None>
      self.name = <local> '/job:localhost/replica:0/task:0/device:CPU:0', len = 44
  File "/home/ubuntu/rwth-i6/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention/returnn/TFUtil.py", line 8582, in get_device_attr
    line: return _DeviceAttrMod.get_device_attr()
    locals:
      _DeviceAttrMod = <global> <class 'TFUtil._DeviceAttrMod'>
      _DeviceAttrMod.get_device_attr = <global> <bound method _DeviceAttrMod.get_device_attr of <class 'TFUtil._DeviceAttrMod'>>
  File "/home/ubuntu/rwth-i6/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention/returnn/TFUtil.py", line 8570, in get_device_attr
    line: return cls.get_mod().get_device_attr()
    locals:
      cls = <local> <class 'TFUtil._DeviceAttrMod'>
      cls.get_mod = <local> <bound method _DeviceAttrMod.get_mod of <class 'TFUtil._DeviceAttrMod'>>
      get_device_attr = <global> <function get_device_attr at 0x7f74b4486620>
  File "/home/ubuntu/rwth-i6/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention/returnn/TFUtil.py", line 8558, in get_mod
Traceback (most recent call last):
  File "returnn/rnn.py", line 654, in <module>
    line: main(sys.argv)
    locals:
    line: compiler = OpCodeCompiler(
            base_name="GetDeviceAttr", code_version=1, code=src_code,
            is_cpp=True, use_cuda_if_available=True,
            # This would lead to a get_tf_list_local_devices call, which we might not want at this point.
            cuda_auto_min_compute_capability=False,
            verbose=verbose)
    locals:
      compiler = <not found>
      OpCodeCompiler = <global> <class 'TFUtil.OpCodeCompiler'>
      base_name = <not found>
      code_version = <not found>
      code = <not found>
      src_code = <local> '\n    #include "tensorflow/core/framework/common_shape_fns.h"\n    #include "tensorflow/core/framework/op.h"\n    #include "tensorflow/core/framework/op_kernel.h"\n    #include "tensorflow/core/framework/device_attributes.pb.h"\n\n    using namespace tensorflow;\n\n    REGISTER_OP("GetDeviceAttr..., len = 1084
      main = <local> <function main at 0x7fc34174dea0>
      sys = <local> <module 'sys' (built-in)>
      sys.argv = <local> ['returnn/rnn.py', 'returnn-distributed.config'], _[0]: {len = 14}
  File "returnn/rnn.py", line 641, in main
    line: init(command_line_options=argv[1:])
    locals:
      init = <global> <function init at 0x7fc34174dbf8>
      is_cpp = <not found>
      use_cuda_if_available = <not found>
  File "/home/ubuntu/rwth-i6/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention/returnn/TFUtil.py", line 4772, in __init__
    line: self._cuda_env = use_cuda_if_available and CudaEnv.get_instance()
    locals:
      self = <local> !AttributeError: 'OpCodeCompiler' object has no attribute 'base_name'
      self._cuda_env = <local> !AttributeError: 'OpCodeCompiler' object has no attribute '_cuda_env'
      use_cuda_if_available = <local> True
      CudaEnv = <global> <class 'TFUtil.CudaEnv'>
      CudaEnv.get_instance = <global> <bound method CudaEnv.get_instance of <class 'TFUtil.CudaEnv'>>
      command_line_options = <not found>
      argv = <local> ['returnn/rnn.py', 'returnn-distributed.config'], _[0]: {len = 14}
  File "returnn/rnn.py", line 390, in init
    line: init_backend_engine()
    locals:
      init_backend_engine = <global> <function init_backend_engine at 0x7fc34174db70>
  File "returnn/rnn.py", line 366, in init_backend_engine
    line: print_available_devices(tf_session_opts=tf_session_opts, file=log.v2)
    locals:
      print_available_devices = <local> <function print_available_devices at 0x7fc323ff77b8>
  File "/home/ubuntu/rwth-i6/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention/returnn/TFUtil.py", line 4757, in get_instance
    line: cls._instance = cls()
    locals:
      cls = <local> <class 'TFUtil.CudaEnv'>
      cls._instance = <local> None
  File "/home/ubuntu/rwth-i6/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention/returnn/TFUtil.py", line 4624, in __init__
    line: self.cuda_path = self._find_cuda_path()
    locals:
      self = <local> <TFUtil.CudaEnv object at 0x7f74b44888d0>
      self.cuda_path = <local> !AttributeError: 'CudaEnv' object has no attribute 'cuda_path'
      self._find_cuda_path = <local> <bound method CudaEnv._find_cuda_path of <class 'TFUtil.CudaEnv'>>
  File "/home/ubuntu/rwth-i6/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention/returnn/TFUtil.py", line 4721, in _find_cuda_path
    line: for p in cls._cuda_path_candidates():
    locals:
      tf_session_opts = <local> {'gpu_options': {'visible_device_list': '0'}}
      file = <not found>
      log = <global> <Log.Log object at 0x7fc2f4f392b0>
      log.v2 = <global> <Log.Stream object at 0x7fc32400b198>
  File "/home/ubuntu/rwth-i6/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention/returnn/TFUtil.py", line 3454, in print_available_devices
    line: devs = get_tf_list_local_devices(tf_session_opts=tf_session_opts)
    locals:
      devs = <not found>
      get_tf_list_local_devices = <global> <function get_tf_list_local_devices at 0x7fc323ff7488>
      tf_session_opts = <local> {'gpu_options': {'visible_device_list': '0'}}
      p = <not found>
      cls = <local> <class 'TFUtil.CudaEnv'>
      cls._cuda_path_candidates = <local> <bound method CudaEnv._cuda_path_candidates of <class 'TFUtil.CudaEnv'>>
  File "/home/ubuntu/rwth-i6/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention/returnn/TFUtil.py", line 4678, in _cuda_path_candidates
    line: p = cls._cuda_path_candidate_via_proc_map_libcudart()
    locals:
      p = <not found>
      cls = <local> <class 'TFUtil.CudaEnv'>
      cls._cuda_path_candidate_via_proc_map_libcudart = <local> <bound method CudaEnv._cuda_path_candidate_via_proc_map_libcudart of <class 'TFUtil.CudaEnv'>>
  File "/home/ubuntu/rwth-i6/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention/returnn/TFUtil.py", line 4672, in _cuda_path_candidate_via_proc_map_libcudart
    line: assert p not in ["", "/"], "No parent dir of %r is a valid CUDA path." % fn
    locals:
      p = <local> '/'
  File "/home/ubuntu/rwth-i6/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention/returnn/TFUtil.py", line 3411, in get_tf_list_local_devices
    line: dev.set_physical_device_desc(session=session)
    locals:
      dev = <local> <_DeviceAttributes name: '/job:localhost/replica:0/task:0/device:CPU:0', device_type: 'CPU', memory_limit_bytes: 268435456, physical_device_desc: None>
      dev.set_physical_device_desc = <local> <bound method _DeviceAttributes.set_physical_device_desc of <_DeviceAttributes name: '/job:localhost/replica:0/task:0/device:CPU:0', device_type: 'CPU', memory_limit_bytes: 268435456, physical_device_desc: None>>
      session = <local> <tensorflow.python.client.session.Session object at 0x7fc3247f9160>
  File "/home/ubuntu/rwth-i6/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention/returnn/TFUtil.py", line 3359, in set_physical_device_desc
    line: physical_device_desc = session.run(get_device_attr(self.name))
    locals:
      physical_device_desc = <not found>
      session = <local> <tensorflow.python.client.session.Session object at 0x7fc3247f9160>
      fn = <local> '/home/ubuntu/anaconda3/lib/libcudart.so.10.1.243', len = 48
AssertionError: No parent dir of '/home/ubuntu/anaconda3/lib/libcudart.so.10.1.243' is a valid CUDA path.
      session.run = <local> <bound method BaseSession.run of <tensorflow.python.client.session.Session object at 0x7fc3247f9160>>
      get_device_attr = <global> <function get_device_attr at 0x7fc324009620>
      self = <local> <_DeviceAttributes name: '/job:localhost/replica:0/task:0/device:CPU:0', device_type: 'CPU', memory_limit_bytes: 268435456, physical_device_desc: None>
      self.name = <local> '/job:localhost/replica:0/task:0/device:CPU:0', len = 44
  File "/home/ubuntu/rwth-i6/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention/returnn/TFUtil.py", line 8582, in get_device_attr
    line: return _DeviceAttrMod.get_device_attr()
    locals:
      _DeviceAttrMod = <global> <class 'TFUtil._DeviceAttrMod'>
      _DeviceAttrMod.get_device_attr = <global> <bound method _DeviceAttrMod.get_device_attr of <class 'TFUtil._DeviceAttrMod'>>
  File "/home/ubuntu/rwth-i6/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention/returnn/TFUtil.py", line 8570, in get_device_attr
    line: return cls.get_mod().get_device_attr()
    locals:
      cls = <local> <class 'TFUtil._DeviceAttrMod'>
      cls.get_mod = <local> <bound method _DeviceAttrMod.get_mod of <class 'TFUtil._DeviceAttrMod'>>
      get_device_attr = <global> <function get_device_attr at 0x7fc324009620>
  File "/home/ubuntu/rwth-i6/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention/returnn/TFUtil.py", line 8558, in get_mod
    line: compiler = OpCodeCompiler(
            base_name="GetDeviceAttr", code_version=1, code=src_code,
            is_cpp=True, use_cuda_if_available=True,
            # This would lead to a get_tf_list_local_devices call, which we might not want at this point.
            cuda_auto_min_compute_capability=False,
            verbose=verbose)
    locals:
      compiler = <not found>
      OpCodeCompiler = <global> <class 'TFUtil.OpCodeCompiler'>
      base_name = <not found>
      code_version = <not found>
      code = <not found>
      src_code = <local> '\n    #include "tensorflow/core/framework/common_shape_fns.h"\n    #include "tensorflow/core/framework/op.h"\n    #include "tensorflow/core/framework/op_kernel.h"\n    #include "tensorflow/core/framework/device_attributes.pb.h"\n\n    using namespace tensorflow;\n\n    REGISTER_OP("GetDeviceAttr..., len = 1084
      is_cpp = <not found>
      use_cuda_if_available = <not found>
  File "/home/ubuntu/rwth-i6/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention/returnn/TFUtil.py", line 4772, in __init__
    line: self._cuda_env = use_cuda_if_available and CudaEnv.get_instance()
    locals:
      self = <local> !AttributeError: 'OpCodeCompiler' object has no attribute 'base_name'
      self._cuda_env = <local> !AttributeError: 'OpCodeCompiler' object has no attribute '_cuda_env'
      use_cuda_if_available = <local> True
      CudaEnv = <global> <class 'TFUtil.CudaEnv'>
      CudaEnv.get_instance = <global> <bound method CudaEnv.get_instance of <class 'TFUtil.CudaEnv'>>
  File "/home/ubuntu/rwth-i6/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention/returnn/TFUtil.py", line 4757, in get_instance
    line: cls._instance = cls()
    locals:
      cls = <local> <class 'TFUtil.CudaEnv'>
      cls._instance = <local> None
  File "/home/ubuntu/rwth-i6/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention/returnn/TFUtil.py", line 4624, in __init__
    line: self.cuda_path = self._find_cuda_path()
    locals:
      self = <local> <TFUtil.CudaEnv object at 0x7fc32400b8d0>
      self.cuda_path = <local> !AttributeError: 'CudaEnv' object has no attribute 'cuda_path'
      self._find_cuda_path = <local> <bound method CudaEnv._find_cuda_path of <class 'TFUtil.CudaEnv'>>
  File "/home/ubuntu/rwth-i6/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention/returnn/TFUtil.py", line 4721, in _find_cuda_path
    line: for p in cls._cuda_path_candidates():
    locals:
      p = <not found>
      cls = <local> <class 'TFUtil.CudaEnv'>
      cls._cuda_path_candidates = <local> <bound method CudaEnv._cuda_path_candidates of <class 'TFUtil.CudaEnv'>>
  File "/home/ubuntu/rwth-i6/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention/returnn/TFUtil.py", line 4678, in _cuda_path_candidates
    line: p = cls._cuda_path_candidate_via_proc_map_libcudart()
    locals:
      p = <not found>
      cls = <local> <class 'TFUtil.CudaEnv'>
      cls._cuda_path_candidate_via_proc_map_libcudart = <local> <bound method CudaEnv._cuda_path_candidate_via_proc_map_libcudart of <class 'TFUtil.CudaEnv'>>
  File "/home/ubuntu/rwth-i6/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention/returnn/TFUtil.py", line 4672, in _cuda_path_candidate_via_proc_map_libcudart
    line: assert p not in ["", "/"], "No parent dir of %r is a valid CUDA path." % fn
    locals:
      p = <local> '/'
      fn = <local> '/home/ubuntu/anaconda3/lib/libcudart.so.10.1.243', len = 48
AssertionError: No parent dir of '/home/ubuntu/anaconda3/lib/libcudart.so.10.1.243' is a valid CUDA path.
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[19355,1],0]
  Exit code:    1
--------------------------------------------------------------------------