Created
December 24, 2019 08:49
-
-
Save manish-kumar-garg/882926f279fdacb7e6848c250a247230 to your computer and use it in GitHub Desktop.
Distributed Error
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
rnn/rnn.py returnn-distributed.config | |
[ip-10-1-21-241:21504] Warning: could not find environment variable "HOROVOD_TIMELINE" | |
[ip-10-1-21-241:21504] Warning: could not find environment variable "DEBUG" | |
-------------------------------------------------------------------------- | |
WARNING: Linux kernel CMA support was requested via the | |
btl_vader_single_copy_mechanism MCA variable, but CMA support is | |
not available due to restrictive ptrace settings. | |
The vader shared memory BTL will fall back on another single-copy | |
mechanism if one is available. This may result in lower performance. | |
Local host: ip-10-1-21-241 | |
-------------------------------------------------------------------------- | |
Horovod initialized. Hostname ip-10-1-21-241, pid 21509, rank 0 / size 2, local rank 0 / local size 2. | |
Horovod initialized. Hostname ip-10-1-21-241, pid 21510, rank 1 / size 2, local rank 1 / local size 2. | |
warning: unable to access '/home/ubuntu/.gitconfig': Is a directory | |
warning: unable to access '/home/ubuntu/.gitconfig': Is a directory | |
fatal: unknown error occurred while reading the configuration files | |
warning: unable to access '/home/ubuntu/.gitconfig': Is a directory | |
warning: unable to access '/home/ubuntu/.gitconfig': Is a directory | |
fatal: unknown error occurred while reading the configuration files | |
RETURNN starting up, version unknown(git exception: CalledProcessError(128, ('git', 'show', '-s', '--format=%ci', 'HEAD'))), date/time 2019-12-24-08-42-32 (UTC+0000), pid 21509, cwd /home/ubuntu/rwth-i6/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention, Python /home/ubuntu/horovod_2/bin/python3 | |
RETURNN command line options: ['returnn-distributed.config'] | |
RETURNN starting up, version unknown(git exception: CalledProcessError(128, ('git', 'show', '-s', '--format=%ci', 'HEAD'))), date/time 2019-12-24-08-42-32 (UTC+0000), pid 21510, cwd /home/ubuntu/rwth-i6/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention, Python /home/ubuntu/horovod_2/bin/python3 | |
Hostname: ip-10-1-21-241 | |
RETURNN command line options: ['returnn-distributed.config'] | |
Hostname: ip-10-1-21-241 | |
/home/ubuntu/horovod_2/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. | |
_np_qint8 = np.dtype([("qint8", np.int8, 1)]) | |
/home/ubuntu/horovod_2/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. | |
_np_quint8 = np.dtype([("quint8", np.uint8, 1)]) | |
/home/ubuntu/horovod_2/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:528: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. | |
_np_qint16 = np.dtype([("qint16", np.int16, 1)]) | |
/home/ubuntu/horovod_2/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:529: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. | |
_np_quint16 = np.dtype([("quint16", np.uint16, 1)]) | |
/home/ubuntu/horovod_2/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:530: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. | |
_np_qint32 = np.dtype([("qint32", np.int32, 1)]) | |
/home/ubuntu/horovod_2/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:535: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. | |
np_resource = np.dtype([("resource", np.ubyte, 1)]) | |
/home/ubuntu/horovod_2/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. | |
_np_qint8 = np.dtype([("qint8", np.int8, 1)]) | |
/home/ubuntu/horovod_2/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. | |
_np_quint8 = np.dtype([("quint8", np.uint8, 1)]) | |
/home/ubuntu/horovod_2/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:528: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. | |
_np_qint16 = np.dtype([("qint16", np.int16, 1)]) | |
/home/ubuntu/horovod_2/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:529: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. | |
_np_quint16 = np.dtype([("quint16", np.uint16, 1)]) | |
/home/ubuntu/horovod_2/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:530: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. | |
_np_qint32 = np.dtype([("qint32", np.int32, 1)]) | |
/home/ubuntu/horovod_2/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:535: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. | |
np_resource = np.dtype([("resource", np.ubyte, 1)]) | |
TensorFlow: 1.13.1 (b'v1.13.1-0-g6612da8951') (<site-package> in /home/ubuntu/horovod_2/lib/python3.6/site-packages/tensorflow) | |
TensorFlow: 1.13.1 (b'v1.13.1-0-g6612da8951') (<site-package> in /home/ubuntu/horovod_2/lib/python3.6/site-packages/tensorflow) | |
Horovod: Hostname ip-10-1-21-241, pid 21509, using GPU 0. | |
Horovod: Reduce type: grad | |
Horovod: Hostname ip-10-1-21-241, pid 21510, using GPU 1. | |
Setup TF inter and intra global thread pools, num_threads None, session opts {'gpu_options': {'visible_device_list': '0'}, 'log_device_placement': False, 'device_count': {'GPU': 0}}. | |
Setup TF inter and intra global thread pools, num_threads None, session opts {'gpu_options': {'visible_device_list': '1'}, 'log_device_placement': False, 'device_count': {'GPU': 0}}. | |
2019-12-24 08:42:32.121618: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA | |
2019-12-24 08:42:32.121734: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA | |
2019-12-24 08:42:32.599274: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero | |
2019-12-24 08:42:32.600216: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero | |
2019-12-24 08:42:32.608409: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero | |
2019-12-24 08:42:32.609131: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero | |
2019-12-24 08:42:32.612055: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x564500322210 executing computations on platform CUDA. Devices: | |
2019-12-24 08:42:32.612089: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): Tesla K80, Compute Capability 3.7 | |
2019-12-24 08:42:32.612098: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (1): Tesla K80, Compute Capability 3.7 | |
2019-12-24 08:42:32.612595: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x55eb17f9b210 executing computations on platform CUDA. Devices: | |
2019-12-24 08:42:32.612625: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): Tesla K80, Compute Capability 3.7 | |
2019-12-24 08:42:32.612634: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (1): Tesla K80, Compute Capability 3.7 | |
2019-12-24 08:42:32.614674: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2300090000 Hz | |
2019-12-24 08:42:32.614793: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2300090000 Hz | |
2019-12-24 08:42:32.616644: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x5645003481d0 executing computations on platform Host. Devices: | |
2019-12-24 08:42:32.616672: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): <undefined>, <undefined> | |
2019-12-24 08:42:32.616672: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x55eb17fc11d0 executing computations on platform Host. Devices: | |
2019-12-24 08:42:32.616695: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): <undefined>, <undefined> | |
2019-12-24 08:42:32.616742: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix: | |
2019-12-24 08:42:32.616761: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] | |
2019-12-24 08:42:32.616772: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix: | |
2019-12-24 08:42:32.616784: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] | |
CUDA_VISIBLE_DEVICES is set to '1,2'. | |
TF session gpu_options.visible_device_list is set to '1'. | |
Collecting TensorFlow device list... | |
CUDA_VISIBLE_DEVICES is set to '1,2'. | |
TF session gpu_options.visible_device_list is set to '0'. | |
Collecting TensorFlow device list... | |
2019-12-24 08:42:32.620117: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: | |
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235 | |
pciBusID: 0000:00:19.0 | |
totalMemory: 11.17GiB freeMemory: 11.05GiB | |
2019-12-24 08:42:32.620152: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 1 | |
2019-12-24 08:42:32.620300: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: | |
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235 | |
pciBusID: 0000:00:18.0 | |
totalMemory: 11.17GiB freeMemory: 11.05GiB | |
2019-12-24 08:42:32.620334: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0 | |
2019-12-24 08:42:32.628371: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix: | |
2019-12-24 08:42:32.628397: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 1 | |
2019-12-24 08:42:32.628406: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 1: N | |
2019-12-24 08:42:32.628610: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10749 MB memory) -> physical GPU (device: 1, name: Tesla K80, pci bus id: 0000:00:19.0, compute capability: 3.7) | |
2019-12-24 08:42:32.629000: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix: | |
2019-12-24 08:42:32.629025: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0 | |
2019-12-24 08:42:32.629034: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N | |
2019-12-24 08:42:32.629191: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10749 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:00:18.0, compute capability: 3.7) | |
Unhandled exception <class 'AssertionError'> in thread <_MainThread(MainThread, started 140139725219648)>, proc 21510. | |
Thread current, main, <_MainThread(MainThread, started 140139725219648)>: | |
(Excluded thread.) | |
That were all threads. | |
EXCEPTION | |
Unhandled exception <class 'AssertionError'> in thread <_MainThread(MainThread, started 140476757673792)>, proc 21509. | |
Thread current, main, <_MainThread(MainThread, started 140476757673792)>: | |
(Excluded thread.) | |
That were all threads. | |
EXCEPTION | |
Traceback (most recent call last): | |
File "returnn/rnn.py", line 654, in <module> | |
line: main(sys.argv) | |
locals: | |
main = <local> <function main at 0x7f74c8c1bea0> | |
sys = <local> <module 'sys' (built-in)> | |
sys.argv = <local> ['returnn/rnn.py', 'returnn-distributed.config'], _[0]: {len = 14} | |
File "returnn/rnn.py", line 641, in main | |
line: init(command_line_options=argv[1:]) | |
locals: | |
init = <global> <function init at 0x7f74c8c1bbf8> | |
command_line_options = <not found> | |
argv = <local> ['returnn/rnn.py', 'returnn-distributed.config'], _[0]: {len = 14} | |
File "returnn/rnn.py", line 390, in init | |
line: init_backend_engine() | |
locals: | |
init_backend_engine = <global> <function init_backend_engine at 0x7f74c8c1bb70> | |
File "returnn/rnn.py", line 366, in init_backend_engine | |
line: print_available_devices(tf_session_opts=tf_session_opts, file=log.v2) | |
locals: | |
print_available_devices = <local> <function print_available_devices at 0x7f74b44747b8> | |
tf_session_opts = <local> {'gpu_options': {'visible_device_list': '1'}} | |
file = <not found> | |
log = <global> <Log.Log object at 0x7f747c4072b0> | |
log.v2 = <global> <Log.Stream object at 0x7f74b4488198> | |
File "/home/ubuntu/rwth-i6/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention/returnn/TFUtil.py", line 3454, in print_available_devices | |
line: devs = get_tf_list_local_devices(tf_session_opts=tf_session_opts) | |
locals: | |
devs = <not found> | |
get_tf_list_local_devices = <global> <function get_tf_list_local_devices at 0x7f74b4474488> | |
tf_session_opts = <local> {'gpu_options': {'visible_device_list': '1'}} | |
File "/home/ubuntu/rwth-i6/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention/returnn/TFUtil.py", line 3411, in get_tf_list_local_devices | |
line: dev.set_physical_device_desc(session=session) | |
locals: | |
dev = <local> <_DeviceAttributes name: '/job:localhost/replica:0/task:0/device:CPU:0', device_type: 'CPU', memory_limit_bytes: 268435456, physical_device_desc: None> | |
dev.set_physical_device_desc = <local> <bound method _DeviceAttributes.set_physical_device_desc of <_DeviceAttributes name: '/job:localhost/replica:0/task:0/device:CPU:0', device_type: 'CPU', memory_limit_bytes: 268435456, physical_device_desc: None>> | |
session = <local> <tensorflow.python.client.session.Session object at 0x7f74b44e1160> | |
File "/home/ubuntu/rwth-i6/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention/returnn/TFUtil.py", line 3359, in set_physical_device_desc | |
line: physical_device_desc = session.run(get_device_attr(self.name)) | |
locals: | |
physical_device_desc = <not found> | |
session = <local> <tensorflow.python.client.session.Session object at 0x7f74b44e1160> | |
session.run = <local> <bound method BaseSession.run of <tensorflow.python.client.session.Session object at 0x7f74b44e1160>> | |
get_device_attr = <global> <function get_device_attr at 0x7f74b4486620> | |
self = <local> <_DeviceAttributes name: '/job:localhost/replica:0/task:0/device:CPU:0', device_type: 'CPU', memory_limit_bytes: 268435456, physical_device_desc: None> | |
self.name = <local> '/job:localhost/replica:0/task:0/device:CPU:0', len = 44 | |
File "/home/ubuntu/rwth-i6/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention/returnn/TFUtil.py", line 8582, in get_device_attr | |
line: return _DeviceAttrMod.get_device_attr() | |
locals: | |
_DeviceAttrMod = <global> <class 'TFUtil._DeviceAttrMod'> | |
_DeviceAttrMod.get_device_attr = <global> <bound method _DeviceAttrMod.get_device_attr of <class 'TFUtil._DeviceAttrMod'>> | |
File "/home/ubuntu/rwth-i6/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention/returnn/TFUtil.py", line 8570, in get_device_attr | |
line: return cls.get_mod().get_device_attr() | |
locals: | |
cls = <local> <class 'TFUtil._DeviceAttrMod'> | |
cls.get_mod = <local> <bound method _DeviceAttrMod.get_mod of <class 'TFUtil._DeviceAttrMod'>> | |
get_device_attr = <global> <function get_device_attr at 0x7f74b4486620> | |
File "/home/ubuntu/rwth-i6/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention/returnn/TFUtil.py", line 8558, in get_mod | |
Traceback (most recent call last): | |
File "returnn/rnn.py", line 654, in <module> | |
line: main(sys.argv) | |
locals: | |
line: compiler = OpCodeCompiler( | |
base_name="GetDeviceAttr", code_version=1, code=src_code, | |
is_cpp=True, use_cuda_if_available=True, | |
# This would lead to a get_tf_list_local_devices call, which we might not want at this point. | |
cuda_auto_min_compute_capability=False, | |
verbose=verbose) | |
locals: | |
compiler = <not found> | |
OpCodeCompiler = <global> <class 'TFUtil.OpCodeCompiler'> | |
base_name = <not found> | |
code_version = <not found> | |
code = <not found> | |
src_code = <local> '\n #include "tensorflow/core/framework/common_shape_fns.h"\n #include "tensorflow/core/framework/op.h"\n #include "tensorflow/core/framework/op_kernel.h"\n #include "tensorflow/core/framework/device_attributes.pb.h"\n\n using namespace tensorflow;\n\n REGISTER_OP("GetDeviceAttr..., len = 1084 | |
main = <local> <function main at 0x7fc34174dea0> | |
sys = <local> <module 'sys' (built-in)> | |
sys.argv = <local> ['returnn/rnn.py', 'returnn-distributed.config'], _[0]: {len = 14} | |
File "returnn/rnn.py", line 641, in main | |
line: init(command_line_options=argv[1:]) | |
locals: | |
init = <global> <function init at 0x7fc34174dbf8> | |
is_cpp = <not found> | |
use_cuda_if_available = <not found> | |
File "/home/ubuntu/rwth-i6/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention/returnn/TFUtil.py", line 4772, in __init__ | |
line: self._cuda_env = use_cuda_if_available and CudaEnv.get_instance() | |
locals: | |
self = <local> !AttributeError: 'OpCodeCompiler' object has no attribute 'base_name' | |
self._cuda_env = <local> !AttributeError: 'OpCodeCompiler' object has no attribute '_cuda_env' | |
use_cuda_if_available = <local> True | |
CudaEnv = <global> <class 'TFUtil.CudaEnv'> | |
CudaEnv.get_instance = <global> <bound method CudaEnv.get_instance of <class 'TFUtil.CudaEnv'>> | |
command_line_options = <not found> | |
argv = <local> ['returnn/rnn.py', 'returnn-distributed.config'], _[0]: {len = 14} | |
File "returnn/rnn.py", line 390, in init | |
line: init_backend_engine() | |
locals: | |
init_backend_engine = <global> <function init_backend_engine at 0x7fc34174db70> | |
File "returnn/rnn.py", line 366, in init_backend_engine | |
line: print_available_devices(tf_session_opts=tf_session_opts, file=log.v2) | |
locals: | |
print_available_devices = <local> <function print_available_devices at 0x7fc323ff77b8> | |
File "/home/ubuntu/rwth-i6/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention/returnn/TFUtil.py", line 4757, in get_instance | |
line: cls._instance = cls() | |
locals: | |
cls = <local> <class 'TFUtil.CudaEnv'> | |
cls._instance = <local> None | |
File "/home/ubuntu/rwth-i6/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention/returnn/TFUtil.py", line 4624, in __init__ | |
line: self.cuda_path = self._find_cuda_path() | |
locals: | |
self = <local> <TFUtil.CudaEnv object at 0x7f74b44888d0> | |
self.cuda_path = <local> !AttributeError: 'CudaEnv' object has no attribute 'cuda_path' | |
self._find_cuda_path = <local> <bound method CudaEnv._find_cuda_path of <class 'TFUtil.CudaEnv'>> | |
File "/home/ubuntu/rwth-i6/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention/returnn/TFUtil.py", line 4721, in _find_cuda_path | |
line: for p in cls._cuda_path_candidates(): | |
locals: | |
tf_session_opts = <local> {'gpu_options': {'visible_device_list': '0'}} | |
file = <not found> | |
log = <global> <Log.Log object at 0x7fc2f4f392b0> | |
log.v2 = <global> <Log.Stream object at 0x7fc32400b198> | |
File "/home/ubuntu/rwth-i6/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention/returnn/TFUtil.py", line 3454, in print_available_devices | |
line: devs = get_tf_list_local_devices(tf_session_opts=tf_session_opts) | |
locals: | |
devs = <not found> | |
get_tf_list_local_devices = <global> <function get_tf_list_local_devices at 0x7fc323ff7488> | |
tf_session_opts = <local> {'gpu_options': {'visible_device_list': '0'}} | |
p = <not found> | |
cls = <local> <class 'TFUtil.CudaEnv'> | |
cls._cuda_path_candidates = <local> <bound method CudaEnv._cuda_path_candidates of <class 'TFUtil.CudaEnv'>> | |
File "/home/ubuntu/rwth-i6/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention/returnn/TFUtil.py", line 4678, in _cuda_path_candidates | |
line: p = cls._cuda_path_candidate_via_proc_map_libcudart() | |
locals: | |
p = <not found> | |
cls = <local> <class 'TFUtil.CudaEnv'> | |
cls._cuda_path_candidate_via_proc_map_libcudart = <local> <bound method CudaEnv._cuda_path_candidate_via_proc_map_libcudart of <class 'TFUtil.CudaEnv'>> | |
File "/home/ubuntu/rwth-i6/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention/returnn/TFUtil.py", line 4672, in _cuda_path_candidate_via_proc_map_libcudart | |
line: assert p not in ["", "/"], "No parent dir of %r is a valid CUDA path." % fn | |
locals: | |
p = <local> '/' | |
File "/home/ubuntu/rwth-i6/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention/returnn/TFUtil.py", line 3411, in get_tf_list_local_devices | |
line: dev.set_physical_device_desc(session=session) | |
locals: | |
dev = <local> <_DeviceAttributes name: '/job:localhost/replica:0/task:0/device:CPU:0', device_type: 'CPU', memory_limit_bytes: 268435456, physical_device_desc: None> | |
dev.set_physical_device_desc = <local> <bound method _DeviceAttributes.set_physical_device_desc of <_DeviceAttributes name: '/job:localhost/replica:0/task:0/device:CPU:0', device_type: 'CPU', memory_limit_bytes: 268435456, physical_device_desc: None>> | |
session = <local> <tensorflow.python.client.session.Session object at 0x7fc3247f9160> | |
File "/home/ubuntu/rwth-i6/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention/returnn/TFUtil.py", line 3359, in set_physical_device_desc | |
line: physical_device_desc = session.run(get_device_attr(self.name)) | |
locals: | |
physical_device_desc = <not found> | |
session = <local> <tensorflow.python.client.session.Session object at 0x7fc3247f9160> | |
fn = <local> '/home/ubuntu/anaconda3/lib/libcudart.so.10.1.243', len = 48 | |
AssertionError: No parent dir of '/home/ubuntu/anaconda3/lib/libcudart.so.10.1.243' is a valid CUDA path. | |
session.run = <local> <bound method BaseSession.run of <tensorflow.python.client.session.Session object at 0x7fc3247f9160>> | |
get_device_attr = <global> <function get_device_attr at 0x7fc324009620> | |
self = <local> <_DeviceAttributes name: '/job:localhost/replica:0/task:0/device:CPU:0', device_type: 'CPU', memory_limit_bytes: 268435456, physical_device_desc: None> | |
self.name = <local> '/job:localhost/replica:0/task:0/device:CPU:0', len = 44 | |
File "/home/ubuntu/rwth-i6/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention/returnn/TFUtil.py", line 8582, in get_device_attr | |
line: return _DeviceAttrMod.get_device_attr() | |
locals: | |
_DeviceAttrMod = <global> <class 'TFUtil._DeviceAttrMod'> | |
_DeviceAttrMod.get_device_attr = <global> <bound method _DeviceAttrMod.get_device_attr of <class 'TFUtil._DeviceAttrMod'>> | |
File "/home/ubuntu/rwth-i6/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention/returnn/TFUtil.py", line 8570, in get_device_attr | |
line: return cls.get_mod().get_device_attr() | |
locals: | |
cls = <local> <class 'TFUtil._DeviceAttrMod'> | |
cls.get_mod = <local> <bound method _DeviceAttrMod.get_mod of <class 'TFUtil._DeviceAttrMod'>> | |
get_device_attr = <global> <function get_device_attr at 0x7fc324009620> | |
File "/home/ubuntu/rwth-i6/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention/returnn/TFUtil.py", line 8558, in get_mod | |
line: compiler = OpCodeCompiler( | |
base_name="GetDeviceAttr", code_version=1, code=src_code, | |
is_cpp=True, use_cuda_if_available=True, | |
# This would lead to a get_tf_list_local_devices call, which we might not want at this point. | |
cuda_auto_min_compute_capability=False, | |
verbose=verbose) | |
locals: | |
compiler = <not found> | |
OpCodeCompiler = <global> <class 'TFUtil.OpCodeCompiler'> | |
base_name = <not found> | |
code_version = <not found> | |
code = <not found> | |
src_code = <local> '\n #include "tensorflow/core/framework/common_shape_fns.h"\n #include "tensorflow/core/framework/op.h"\n #include "tensorflow/core/framework/op_kernel.h"\n #include "tensorflow/core/framework/device_attributes.pb.h"\n\n using namespace tensorflow;\n\n REGISTER_OP("GetDeviceAttr..., len = 1084 | |
is_cpp = <not found> | |
use_cuda_if_available = <not found> | |
File "/home/ubuntu/rwth-i6/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention/returnn/TFUtil.py", line 4772, in __init__ | |
line: self._cuda_env = use_cuda_if_available and CudaEnv.get_instance() | |
locals: | |
self = <local> !AttributeError: 'OpCodeCompiler' object has no attribute 'base_name' | |
self._cuda_env = <local> !AttributeError: 'OpCodeCompiler' object has no attribute '_cuda_env' | |
use_cuda_if_available = <local> True | |
CudaEnv = <global> <class 'TFUtil.CudaEnv'> | |
CudaEnv.get_instance = <global> <bound method CudaEnv.get_instance of <class 'TFUtil.CudaEnv'>> | |
File "/home/ubuntu/rwth-i6/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention/returnn/TFUtil.py", line 4757, in get_instance | |
line: cls._instance = cls() | |
locals: | |
cls = <local> <class 'TFUtil.CudaEnv'> | |
cls._instance = <local> None | |
File "/home/ubuntu/rwth-i6/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention/returnn/TFUtil.py", line 4624, in __init__ | |
line: self.cuda_path = self._find_cuda_path() | |
locals: | |
self = <local> <TFUtil.CudaEnv object at 0x7fc32400b8d0> | |
self.cuda_path = <local> !AttributeError: 'CudaEnv' object has no attribute 'cuda_path' | |
self._find_cuda_path = <local> <bound method CudaEnv._find_cuda_path of <class 'TFUtil.CudaEnv'>> | |
File "/home/ubuntu/rwth-i6/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention/returnn/TFUtil.py", line 4721, in _find_cuda_path | |
line: for p in cls._cuda_path_candidates(): | |
locals: | |
p = <not found> | |
cls = <local> <class 'TFUtil.CudaEnv'> | |
cls._cuda_path_candidates = <local> <bound method CudaEnv._cuda_path_candidates of <class 'TFUtil.CudaEnv'>> | |
File "/home/ubuntu/rwth-i6/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention/returnn/TFUtil.py", line 4678, in _cuda_path_candidates | |
line: p = cls._cuda_path_candidate_via_proc_map_libcudart() | |
locals: | |
p = <not found> | |
cls = <local> <class 'TFUtil.CudaEnv'> | |
cls._cuda_path_candidate_via_proc_map_libcudart = <local> <bound method CudaEnv._cuda_path_candidate_via_proc_map_libcudart of <class 'TFUtil.CudaEnv'>> | |
File "/home/ubuntu/rwth-i6/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention/returnn/TFUtil.py", line 4672, in _cuda_path_candidate_via_proc_map_libcudart | |
line: assert p not in ["", "/"], "No parent dir of %r is a valid CUDA path." % fn | |
locals: | |
p = <local> '/' | |
fn = <local> '/home/ubuntu/anaconda3/lib/libcudart.so.10.1.243', len = 48 | |
AssertionError: No parent dir of '/home/ubuntu/anaconda3/lib/libcudart.so.10.1.243' is a valid CUDA path. | |
------------------------------------------------------- | |
Primary job terminated normally, but 1 process returned | |
a non-zero exit code. Per user-direction, the job has been aborted. | |
------------------------------------------------------- | |
-------------------------------------------------------------------------- | |
mpirun detected that one or more processes exited with non-zero status, thus causing | |
the job to be terminated. The first process to do so was: | |
Process name: [[19355,1],0] | |
Exit code: 1 | |
-------------------------------------------------------------------------- |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment