Skip to content

Instantly share code, notes, and snippets.

@manish-kumar-garg
Created December 24, 2019 08:49
Show Gist options
  • Save manish-kumar-garg/882926f279fdacb7e6848c250a247230 to your computer and use it in GitHub Desktop.
Save manish-kumar-garg/882926f279fdacb7e6848c250a247230 to your computer and use it in GitHub Desktop.
Distributed Error
rnn/rnn.py returnn-distributed.config
[ip-10-1-21-241:21504] Warning: could not find environment variable "HOROVOD_TIMELINE"
[ip-10-1-21-241:21504] Warning: could not find environment variable "DEBUG"
--------------------------------------------------------------------------
WARNING: Linux kernel CMA support was requested via the
btl_vader_single_copy_mechanism MCA variable, but CMA support is
not available due to restrictive ptrace settings.
The vader shared memory BTL will fall back on another single-copy
mechanism if one is available. This may result in lower performance.
Local host: ip-10-1-21-241
--------------------------------------------------------------------------
Horovod initialized. Hostname ip-10-1-21-241, pid 21509, rank 0 / size 2, local rank 0 / local size 2.
Horovod initialized. Hostname ip-10-1-21-241, pid 21510, rank 1 / size 2, local rank 1 / local size 2.
warning: unable to access '/home/ubuntu/.gitconfig': Is a directory
warning: unable to access '/home/ubuntu/.gitconfig': Is a directory
fatal: unknown error occurred while reading the configuration files
warning: unable to access '/home/ubuntu/.gitconfig': Is a directory
warning: unable to access '/home/ubuntu/.gitconfig': Is a directory
fatal: unknown error occurred while reading the configuration files
RETURNN starting up, version unknown(git exception: CalledProcessError(128, ('git', 'show', '-s', '--format=%ci', 'HEAD'))), date/time 2019-12-24-08-42-32 (UTC+0000), pid 21509, cwd /home/ubuntu/rwth-i6/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention, Python /home/ubuntu/horovod_2/bin/python3
RETURNN command line options: ['returnn-distributed.config']
RETURNN starting up, version unknown(git exception: CalledProcessError(128, ('git', 'show', '-s', '--format=%ci', 'HEAD'))), date/time 2019-12-24-08-42-32 (UTC+0000), pid 21510, cwd /home/ubuntu/rwth-i6/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention, Python /home/ubuntu/horovod_2/bin/python3
Hostname: ip-10-1-21-241
RETURNN command line options: ['returnn-distributed.config']
Hostname: ip-10-1-21-241
/home/ubuntu/horovod_2/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint8 = np.dtype([("qint8", np.int8, 1)])
/home/ubuntu/horovod_2/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/home/ubuntu/horovod_2/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:528: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint16 = np.dtype([("qint16", np.int16, 1)])
/home/ubuntu/horovod_2/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:529: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/home/ubuntu/horovod_2/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:530: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint32 = np.dtype([("qint32", np.int32, 1)])
/home/ubuntu/horovod_2/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:535: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
np_resource = np.dtype([("resource", np.ubyte, 1)])
/home/ubuntu/horovod_2/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint8 = np.dtype([("qint8", np.int8, 1)])
/home/ubuntu/horovod_2/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/home/ubuntu/horovod_2/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:528: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint16 = np.dtype([("qint16", np.int16, 1)])
/home/ubuntu/horovod_2/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:529: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/home/ubuntu/horovod_2/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:530: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint32 = np.dtype([("qint32", np.int32, 1)])
/home/ubuntu/horovod_2/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:535: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
np_resource = np.dtype([("resource", np.ubyte, 1)])
TensorFlow: 1.13.1 (b'v1.13.1-0-g6612da8951') (<site-package> in /home/ubuntu/horovod_2/lib/python3.6/site-packages/tensorflow)
TensorFlow: 1.13.1 (b'v1.13.1-0-g6612da8951') (<site-package> in /home/ubuntu/horovod_2/lib/python3.6/site-packages/tensorflow)
Horovod: Hostname ip-10-1-21-241, pid 21509, using GPU 0.
Horovod: Reduce type: grad
Horovod: Hostname ip-10-1-21-241, pid 21510, using GPU 1.
Setup TF inter and intra global thread pools, num_threads None, session opts {'gpu_options': {'visible_device_list': '0'}, 'log_device_placement': False, 'device_count': {'GPU': 0}}.
Setup TF inter and intra global thread pools, num_threads None, session opts {'gpu_options': {'visible_device_list': '1'}, 'log_device_placement': False, 'device_count': {'GPU': 0}}.
2019-12-24 08:42:32.121618: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-12-24 08:42:32.121734: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-12-24 08:42:32.599274: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-12-24 08:42:32.600216: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-12-24 08:42:32.608409: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-12-24 08:42:32.609131: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-12-24 08:42:32.612055: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x564500322210 executing computations on platform CUDA. Devices:
2019-12-24 08:42:32.612089: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): Tesla K80, Compute Capability 3.7
2019-12-24 08:42:32.612098: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (1): Tesla K80, Compute Capability 3.7
2019-12-24 08:42:32.612595: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x55eb17f9b210 executing computations on platform CUDA. Devices:
2019-12-24 08:42:32.612625: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): Tesla K80, Compute Capability 3.7
2019-12-24 08:42:32.612634: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (1): Tesla K80, Compute Capability 3.7
2019-12-24 08:42:32.614674: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2300090000 Hz
2019-12-24 08:42:32.614793: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2300090000 Hz
2019-12-24 08:42:32.616644: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x5645003481d0 executing computations on platform Host. Devices:
2019-12-24 08:42:32.616672: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): <undefined>, <undefined>
2019-12-24 08:42:32.616672: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x55eb17fc11d0 executing computations on platform Host. Devices:
2019-12-24 08:42:32.616695: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): <undefined>, <undefined>
2019-12-24 08:42:32.616742: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-12-24 08:42:32.616761: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]
2019-12-24 08:42:32.616772: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-12-24 08:42:32.616784: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]
CUDA_VISIBLE_DEVICES is set to '1,2'.
TF session gpu_options.visible_device_list is set to '1'.
Collecting TensorFlow device list...
CUDA_VISIBLE_DEVICES is set to '1,2'.
TF session gpu_options.visible_device_list is set to '0'.
Collecting TensorFlow device list...
2019-12-24 08:42:32.620117: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:00:19.0
totalMemory: 11.17GiB freeMemory: 11.05GiB
2019-12-24 08:42:32.620152: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 1
2019-12-24 08:42:32.620300: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:00:18.0
totalMemory: 11.17GiB freeMemory: 11.05GiB
2019-12-24 08:42:32.620334: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-12-24 08:42:32.628371: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-12-24 08:42:32.628397: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 1
2019-12-24 08:42:32.628406: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 1: N
2019-12-24 08:42:32.628610: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10749 MB memory) -> physical GPU (device: 1, name: Tesla K80, pci bus id: 0000:00:19.0, compute capability: 3.7)
2019-12-24 08:42:32.629000: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-12-24 08:42:32.629025: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0
2019-12-24 08:42:32.629034: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N
2019-12-24 08:42:32.629191: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10749 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:00:18.0, compute capability: 3.7)
Unhandled exception <class 'AssertionError'> in thread <_MainThread(MainThread, started 140139725219648)>, proc 21510.
Thread current, main, <_MainThread(MainThread, started 140139725219648)>:
(Excluded thread.)
That were all threads.
EXCEPTION
Unhandled exception <class 'AssertionError'> in thread <_MainThread(MainThread, started 140476757673792)>, proc 21509.
Thread current, main, <_MainThread(MainThread, started 140476757673792)>:
(Excluded thread.)
That were all threads.
EXCEPTION
Traceback (most recent call last):
File "returnn/rnn.py", line 654, in <module>
line: main(sys.argv)
locals:
main = <local> <function main at 0x7f74c8c1bea0>
sys = <local> <module 'sys' (built-in)>
sys.argv = <local> ['returnn/rnn.py', 'returnn-distributed.config'], _[0]: {len = 14}
File "returnn/rnn.py", line 641, in main
line: init(command_line_options=argv[1:])
locals:
init = <global> <function init at 0x7f74c8c1bbf8>
command_line_options = <not found>
argv = <local> ['returnn/rnn.py', 'returnn-distributed.config'], _[0]: {len = 14}
File "returnn/rnn.py", line 390, in init
line: init_backend_engine()
locals:
init_backend_engine = <global> <function init_backend_engine at 0x7f74c8c1bb70>
File "returnn/rnn.py", line 366, in init_backend_engine
line: print_available_devices(tf_session_opts=tf_session_opts, file=log.v2)
locals:
print_available_devices = <local> <function print_available_devices at 0x7f74b44747b8>
tf_session_opts = <local> {'gpu_options': {'visible_device_list': '1'}}
file = <not found>
log = <global> <Log.Log object at 0x7f747c4072b0>
log.v2 = <global> <Log.Stream object at 0x7f74b4488198>
File "/home/ubuntu/rwth-i6/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention/returnn/TFUtil.py", line 3454, in print_available_devices
line: devs = get_tf_list_local_devices(tf_session_opts=tf_session_opts)
locals:
devs = <not found>
get_tf_list_local_devices = <global> <function get_tf_list_local_devices at 0x7f74b4474488>
tf_session_opts = <local> {'gpu_options': {'visible_device_list': '1'}}
File "/home/ubuntu/rwth-i6/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention/returnn/TFUtil.py", line 3411, in get_tf_list_local_devices
line: dev.set_physical_device_desc(session=session)
locals:
dev = <local> <_DeviceAttributes name: '/job:localhost/replica:0/task:0/device:CPU:0', device_type: 'CPU', memory_limit_bytes: 268435456, physical_device_desc: None>
dev.set_physical_device_desc = <local> <bound method _DeviceAttributes.set_physical_device_desc of <_DeviceAttributes name: '/job:localhost/replica:0/task:0/device:CPU:0', device_type: 'CPU', memory_limit_bytes: 268435456, physical_device_desc: None>>
session = <local> <tensorflow.python.client.session.Session object at 0x7f74b44e1160>
File "/home/ubuntu/rwth-i6/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention/returnn/TFUtil.py", line 3359, in set_physical_device_desc
line: physical_device_desc = session.run(get_device_attr(self.name))
locals:
physical_device_desc = <not found>
session = <local> <tensorflow.python.client.session.Session object at 0x7f74b44e1160>
session.run = <local> <bound method BaseSession.run of <tensorflow.python.client.session.Session object at 0x7f74b44e1160>>
get_device_attr = <global> <function get_device_attr at 0x7f74b4486620>
self = <local> <_DeviceAttributes name: '/job:localhost/replica:0/task:0/device:CPU:0', device_type: 'CPU', memory_limit_bytes: 268435456, physical_device_desc: None>
self.name = <local> '/job:localhost/replica:0/task:0/device:CPU:0', len = 44
File "/home/ubuntu/rwth-i6/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention/returnn/TFUtil.py", line 8582, in get_device_attr
line: return _DeviceAttrMod.get_device_attr()
locals:
_DeviceAttrMod = <global> <class 'TFUtil._DeviceAttrMod'>
_DeviceAttrMod.get_device_attr = <global> <bound method _DeviceAttrMod.get_device_attr of <class 'TFUtil._DeviceAttrMod'>>
File "/home/ubuntu/rwth-i6/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention/returnn/TFUtil.py", line 8570, in get_device_attr
line: return cls.get_mod().get_device_attr()
locals:
cls = <local> <class 'TFUtil._DeviceAttrMod'>
cls.get_mod = <local> <bound method _DeviceAttrMod.get_mod of <class 'TFUtil._DeviceAttrMod'>>
get_device_attr = <global> <function get_device_attr at 0x7f74b4486620>
File "/home/ubuntu/rwth-i6/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention/returnn/TFUtil.py", line 8558, in get_mod
Traceback (most recent call last):
File "returnn/rnn.py", line 654, in <module>
line: main(sys.argv)
locals:
line: compiler = OpCodeCompiler(
base_name="GetDeviceAttr", code_version=1, code=src_code,
is_cpp=True, use_cuda_if_available=True,
# This would lead to a get_tf_list_local_devices call, which we might not want at this point.
cuda_auto_min_compute_capability=False,
verbose=verbose)
locals:
compiler = <not found>
OpCodeCompiler = <global> <class 'TFUtil.OpCodeCompiler'>
base_name = <not found>
code_version = <not found>
code = <not found>
src_code = <local> '\n #include "tensorflow/core/framework/common_shape_fns.h"\n #include "tensorflow/core/framework/op.h"\n #include "tensorflow/core/framework/op_kernel.h"\n #include "tensorflow/core/framework/device_attributes.pb.h"\n\n using namespace tensorflow;\n\n REGISTER_OP("GetDeviceAttr..., len = 1084
main = <local> <function main at 0x7fc34174dea0>
sys = <local> <module 'sys' (built-in)>
sys.argv = <local> ['returnn/rnn.py', 'returnn-distributed.config'], _[0]: {len = 14}
File "returnn/rnn.py", line 641, in main
line: init(command_line_options=argv[1:])
locals:
init = <global> <function init at 0x7fc34174dbf8>
is_cpp = <not found>
use_cuda_if_available = <not found>
File "/home/ubuntu/rwth-i6/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention/returnn/TFUtil.py", line 4772, in __init__
line: self._cuda_env = use_cuda_if_available and CudaEnv.get_instance()
locals:
self = <local> !AttributeError: 'OpCodeCompiler' object has no attribute 'base_name'
self._cuda_env = <local> !AttributeError: 'OpCodeCompiler' object has no attribute '_cuda_env'
use_cuda_if_available = <local> True
CudaEnv = <global> <class 'TFUtil.CudaEnv'>
CudaEnv.get_instance = <global> <bound method CudaEnv.get_instance of <class 'TFUtil.CudaEnv'>>
command_line_options = <not found>
argv = <local> ['returnn/rnn.py', 'returnn-distributed.config'], _[0]: {len = 14}
File "returnn/rnn.py", line 390, in init
line: init_backend_engine()
locals:
init_backend_engine = <global> <function init_backend_engine at 0x7fc34174db70>
File "returnn/rnn.py", line 366, in init_backend_engine
line: print_available_devices(tf_session_opts=tf_session_opts, file=log.v2)
locals:
print_available_devices = <local> <function print_available_devices at 0x7fc323ff77b8>
File "/home/ubuntu/rwth-i6/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention/returnn/TFUtil.py", line 4757, in get_instance
line: cls._instance = cls()
locals:
cls = <local> <class 'TFUtil.CudaEnv'>
cls._instance = <local> None
File "/home/ubuntu/rwth-i6/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention/returnn/TFUtil.py", line 4624, in __init__
line: self.cuda_path = self._find_cuda_path()
locals:
self = <local> <TFUtil.CudaEnv object at 0x7f74b44888d0>
self.cuda_path = <local> !AttributeError: 'CudaEnv' object has no attribute 'cuda_path'
self._find_cuda_path = <local> <bound method CudaEnv._find_cuda_path of <class 'TFUtil.CudaEnv'>>
File "/home/ubuntu/rwth-i6/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention/returnn/TFUtil.py", line 4721, in _find_cuda_path
line: for p in cls._cuda_path_candidates():
locals:
tf_session_opts = <local> {'gpu_options': {'visible_device_list': '0'}}
file = <not found>
log = <global> <Log.Log object at 0x7fc2f4f392b0>
log.v2 = <global> <Log.Stream object at 0x7fc32400b198>
File "/home/ubuntu/rwth-i6/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention/returnn/TFUtil.py", line 3454, in print_available_devices
line: devs = get_tf_list_local_devices(tf_session_opts=tf_session_opts)
locals:
devs = <not found>
get_tf_list_local_devices = <global> <function get_tf_list_local_devices at 0x7fc323ff7488>
tf_session_opts = <local> {'gpu_options': {'visible_device_list': '0'}}
p = <not found>
cls = <local> <class 'TFUtil.CudaEnv'>
cls._cuda_path_candidates = <local> <bound method CudaEnv._cuda_path_candidates of <class 'TFUtil.CudaEnv'>>
File "/home/ubuntu/rwth-i6/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention/returnn/TFUtil.py", line 4678, in _cuda_path_candidates
line: p = cls._cuda_path_candidate_via_proc_map_libcudart()
locals:
p = <not found>
cls = <local> <class 'TFUtil.CudaEnv'>
cls._cuda_path_candidate_via_proc_map_libcudart = <local> <bound method CudaEnv._cuda_path_candidate_via_proc_map_libcudart of <class 'TFUtil.CudaEnv'>>
File "/home/ubuntu/rwth-i6/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention/returnn/TFUtil.py", line 4672, in _cuda_path_candidate_via_proc_map_libcudart
line: assert p not in ["", "/"], "No parent dir of %r is a valid CUDA path." % fn
locals:
p = <local> '/'
File "/home/ubuntu/rwth-i6/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention/returnn/TFUtil.py", line 3411, in get_tf_list_local_devices
line: dev.set_physical_device_desc(session=session)
locals:
dev = <local> <_DeviceAttributes name: '/job:localhost/replica:0/task:0/device:CPU:0', device_type: 'CPU', memory_limit_bytes: 268435456, physical_device_desc: None>
dev.set_physical_device_desc = <local> <bound method _DeviceAttributes.set_physical_device_desc of <_DeviceAttributes name: '/job:localhost/replica:0/task:0/device:CPU:0', device_type: 'CPU', memory_limit_bytes: 268435456, physical_device_desc: None>>
session = <local> <tensorflow.python.client.session.Session object at 0x7fc3247f9160>
File "/home/ubuntu/rwth-i6/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention/returnn/TFUtil.py", line 3359, in set_physical_device_desc
line: physical_device_desc = session.run(get_device_attr(self.name))
locals:
physical_device_desc = <not found>
session = <local> <tensorflow.python.client.session.Session object at 0x7fc3247f9160>
fn = <local> '/home/ubuntu/anaconda3/lib/libcudart.so.10.1.243', len = 48
AssertionError: No parent dir of '/home/ubuntu/anaconda3/lib/libcudart.so.10.1.243' is a valid CUDA path.
session.run = <local> <bound method BaseSession.run of <tensorflow.python.client.session.Session object at 0x7fc3247f9160>>
get_device_attr = <global> <function get_device_attr at 0x7fc324009620>
self = <local> <_DeviceAttributes name: '/job:localhost/replica:0/task:0/device:CPU:0', device_type: 'CPU', memory_limit_bytes: 268435456, physical_device_desc: None>
self.name = <local> '/job:localhost/replica:0/task:0/device:CPU:0', len = 44
File "/home/ubuntu/rwth-i6/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention/returnn/TFUtil.py", line 8582, in get_device_attr
line: return _DeviceAttrMod.get_device_attr()
locals:
_DeviceAttrMod = <global> <class 'TFUtil._DeviceAttrMod'>
_DeviceAttrMod.get_device_attr = <global> <bound method _DeviceAttrMod.get_device_attr of <class 'TFUtil._DeviceAttrMod'>>
File "/home/ubuntu/rwth-i6/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention/returnn/TFUtil.py", line 8570, in get_device_attr
line: return cls.get_mod().get_device_attr()
locals:
cls = <local> <class 'TFUtil._DeviceAttrMod'>
cls.get_mod = <local> <bound method _DeviceAttrMod.get_mod of <class 'TFUtil._DeviceAttrMod'>>
get_device_attr = <global> <function get_device_attr at 0x7fc324009620>
File "/home/ubuntu/rwth-i6/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention/returnn/TFUtil.py", line 8558, in get_mod
line: compiler = OpCodeCompiler(
base_name="GetDeviceAttr", code_version=1, code=src_code,
is_cpp=True, use_cuda_if_available=True,
# This would lead to a get_tf_list_local_devices call, which we might not want at this point.
cuda_auto_min_compute_capability=False,
verbose=verbose)
locals:
compiler = <not found>
OpCodeCompiler = <global> <class 'TFUtil.OpCodeCompiler'>
base_name = <not found>
code_version = <not found>
code = <not found>
src_code = <local> '\n #include "tensorflow/core/framework/common_shape_fns.h"\n #include "tensorflow/core/framework/op.h"\n #include "tensorflow/core/framework/op_kernel.h"\n #include "tensorflow/core/framework/device_attributes.pb.h"\n\n using namespace tensorflow;\n\n REGISTER_OP("GetDeviceAttr..., len = 1084
is_cpp = <not found>
use_cuda_if_available = <not found>
File "/home/ubuntu/rwth-i6/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention/returnn/TFUtil.py", line 4772, in __init__
line: self._cuda_env = use_cuda_if_available and CudaEnv.get_instance()
locals:
self = <local> !AttributeError: 'OpCodeCompiler' object has no attribute 'base_name'
self._cuda_env = <local> !AttributeError: 'OpCodeCompiler' object has no attribute '_cuda_env'
use_cuda_if_available = <local> True
CudaEnv = <global> <class 'TFUtil.CudaEnv'>
CudaEnv.get_instance = <global> <bound method CudaEnv.get_instance of <class 'TFUtil.CudaEnv'>>
File "/home/ubuntu/rwth-i6/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention/returnn/TFUtil.py", line 4757, in get_instance
line: cls._instance = cls()
locals:
cls = <local> <class 'TFUtil.CudaEnv'>
cls._instance = <local> None
File "/home/ubuntu/rwth-i6/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention/returnn/TFUtil.py", line 4624, in __init__
line: self.cuda_path = self._find_cuda_path()
locals:
self = <local> <TFUtil.CudaEnv object at 0x7fc32400b8d0>
self.cuda_path = <local> !AttributeError: 'CudaEnv' object has no attribute 'cuda_path'
self._find_cuda_path = <local> <bound method CudaEnv._find_cuda_path of <class 'TFUtil.CudaEnv'>>
File "/home/ubuntu/rwth-i6/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention/returnn/TFUtil.py", line 4721, in _find_cuda_path
line: for p in cls._cuda_path_candidates():
locals:
p = <not found>
cls = <local> <class 'TFUtil.CudaEnv'>
cls._cuda_path_candidates = <local> <bound method CudaEnv._cuda_path_candidates of <class 'TFUtil.CudaEnv'>>
File "/home/ubuntu/rwth-i6/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention/returnn/TFUtil.py", line 4678, in _cuda_path_candidates
line: p = cls._cuda_path_candidate_via_proc_map_libcudart()
locals:
p = <not found>
cls = <local> <class 'TFUtil.CudaEnv'>
cls._cuda_path_candidate_via_proc_map_libcudart = <local> <bound method CudaEnv._cuda_path_candidate_via_proc_map_libcudart of <class 'TFUtil.CudaEnv'>>
File "/home/ubuntu/rwth-i6/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention/returnn/TFUtil.py", line 4672, in _cuda_path_candidate_via_proc_map_libcudart
line: assert p not in ["", "/"], "No parent dir of %r is a valid CUDA path." % fn
locals:
p = <local> '/'
fn = <local> '/home/ubuntu/anaconda3/lib/libcudart.so.10.1.243', len = 48
AssertionError: No parent dir of '/home/ubuntu/anaconda3/lib/libcudart.so.10.1.243' is a valid CUDA path.
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[19355,1],0]
Exit code: 1
--------------------------------------------------------------------------
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment