Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

undefined context (mismatch in C++ ABI?) for 'lstm0_fw/rec/NativeLstm2' (op: 'NativeLstm2') with input shapes: [?,?,2048], [512,2048], [?,512], [?,512], [?,?], [], [] #387

Closed
yanghongjiazheng opened this issue Nov 10, 2020 · 5 comments

Comments

@yanghongjiazheng
Copy link

yanghongjiazheng commented Nov 10, 2020

When I run 22_train.sh , I got this.

./returnn/rnn.py returnn.config
+ ./returnn/rnn.py returnn.config
RETURNN starting up, version 1.20201030.173802+git.4f9d197, date/time 2020-11-10-17-39-51 (UTC+0800), pid 539710, cwd /home/conan/jiayang/returnn_chinese_char, Python /home/conan/anaconda3/envs/tf14/bin/python3
RETURNN command line options: ['returnn.config']
Hostname: qcbje-solar-gpu135-vm
TensorFlow: 1.14.0 (v1.14.0-rc1-22-gaf24dc91b5) (<site-package> in /home/conan/anaconda3/envs/tf14/lib/python3.6/site-packages/tensorflow)
Setup TF inter and intra global thread pools, num_threads None, session opts {'log_device_placement': False, 'device_count': {'GPU': 0}}.
2020-11-10 17:39:55.101179: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-11-10 17:39:55.114874: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
2020-11-10 17:39:55.397685: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-10 17:39:55.412379: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55c01b981c60 executing computations on platform CUDA. Devices:
2020-11-10 17:39:55.412404: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Tesla V100-SXM2-32GB, Compute Capability 7.0
2020-11-10 17:39:55.416132: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2499935000 Hz
2020-11-10 17:39:55.431244: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55c01baccb20 executing computations on platform Host. Devices:
2020-11-10 17:39:55.431274: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): <undefined>, <undefined>
2020-11-10 17:39:55.431393: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-11-10 17:39:55.431416: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]
CUDA_VISIBLE_DEVICES is set to '3'.
Collecting TensorFlow device list...
2020-11-10 17:39:55.435690: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-10 17:39:55.437725: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties:
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:00:0c.0
2020-11-10 17:39:55.438084: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2020-11-10 17:39:55.445534: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2020-11-10 17:39:55.447664: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0
2020-11-10 17:39:55.448254: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0
2020-11-10 17:39:55.451139: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0
2020-11-10 17:39:55.453355: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0
2020-11-10 17:39:55.459892: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2020-11-10 17:39:55.460092: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-10 17:39:55.462210: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-10 17:39:55.464183: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0
2020-11-10 17:39:55.464233: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2020-11-10 17:39:55.467180: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-11-10 17:39:55.467196: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]      0
2020-11-10 17:39:55.467207: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0:   N
2020-11-10 17:39:55.467391: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-10 17:39:55.470503: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-10 17:39:55.474442: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/device:GPU:0 with 30555 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:00:0c.0, compute capability: 7.0)
Local devices available to TensorFlow:
  1/4: name: "/device:CPU:0"
       device_type: "CPU"
       memory_limit: 268435456
       locality {
       }
       incarnation: 15347392412389737261
  2/4: name: "/device:XLA_GPU:0"
       device_type: "XLA_GPU"
       memory_limit: 17179869184
       locality {
       }
       incarnation: 1659417248176014554
       physical_device_desc: "device: XLA_GPU device"
  3/4: name: "/device:XLA_CPU:0"
       device_type: "XLA_CPU"
       memory_limit: 17179869184
       locality {
       }
       incarnation: 16160866742038758926
       physical_device_desc: "device: XLA_CPU device"
  4/4: name: "/device:GPU:0"
       device_type: "GPU"
       memory_limit: 32040203060
       locality {
         bus_id: 1
         links {
         }
       }
       incarnation: 14743396709713712718
       physical_device_desc: "device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:00:0c.0, compute capability: 7.0"
Using gpu device 3: Tesla V100-SXM2-32GB
<LibriSpeechCorpus 'train' epoch=1>, epoch 1. Old mean seq len (transcription) is 19.036096, new is 11.850397, requested max is 75.000000. Old num seqs is 61669, new num seqs is 30835.
<LibriSpeechCorpus 'train' epoch=1>, epoch 1. Old num seqs 61669, new num seqs 30835.
<LibriSpeechCorpus 'train' epoch=1>, epoch 1. Old mean seq len (transcription) is 19.036096, new is 11.850397, requested max is 75.000000. Old num seqs is 61669, new num seqs is 30835.
<LibriSpeechCorpus 'train' epoch=1>, epoch 1. Old num seqs 61669, new num seqs 30835.
Train data:
  input: 40 x 1
  output: {'raw': {'dtype': 'string', 'shape': ()}, 'classes': [6848, 1], 'data': [40, 2]}
  LibriSpeechCorpus, sequences: 30835, frames: unknown
Dev data:
  LibriSpeechCorpus, sequences: 20000, frames: unknown
Learning-rate-control: file data/exp-returnn/train-scores.data does not exist yet
Update config key 'max_seq_length' for epoch 1: {'classes': 25} -> {'classes': 60}
Setup TF session with options {'log_device_placement': False, 'device_count': {'GPU': 1}} ...
2020-11-10 17:40:55.637157: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-10 17:40:55.652671: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties:
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:00:0c.0
2020-11-10 17:40:55.652790: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2020-11-10 17:40:55.652823: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2020-11-10 17:40:55.652839: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0
2020-11-10 17:40:55.652856: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0
2020-11-10 17:40:55.652872: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0
2020-11-10 17:40:55.652887: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0
2020-11-10 17:40:55.652904: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2020-11-10 17:40:55.653075: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-10 17:40:55.655160: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-10 17:40:55.657112: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0
2020-11-10 17:40:55.657144: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-11-10 17:40:55.657154: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]      0
2020-11-10 17:40:55.657161: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0:   N
2020-11-10 17:40:55.657328: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-10 17:40:55.659370: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-10 17:40:55.661356: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30555 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:00:0c.0, compute capability: 7.0)
layer root/'data' output: Data(name='data', shape=(None, 40), batch_shape_meta=[B,T|'time:var:extern_data:data',F|40])
layer root/'source' output: Data(name='source_output', shape=(None, 40), batch_shape_meta=[B,T|'time:var:extern_data:data',F|40])
layer root/'lstm0_fw' output: Data(name='lstm0_fw_output', shape=(None, 512), batch_dim_axis=1, batch_shape_meta=[T|'time:var:extern_data:data',B,F|512])
Exception creating layer root/'lstm0_fw' of class RecLayer with opts:
{'direction': 1,
 'n_out': 512,
 'name': 'lstm0_fw',
 'network': <TFNetwork 'root' train=<tf.Tensor 'globals/train_flag:0' shape=() dtype=bool>>,
 'output': Data(name='lstm0_fw_output', shape=(None, 512), batch_dim_axis=1, batch_shape_meta=[T|'time:var:extern_data:data',B,F|512]),
 'sources': [<EvalLayer 'source' out_type=Data(shape=(None, 40), batch_shape_meta=[B,T|'time:var:extern_data:data',F|40])>],
 'unit': 'nativelstm2'}
Unhandled exception <class 'ValueError'> in thread <_MainThread(MainThread, started 140376601720640)>, proc 539710.

Thread current, main, <_MainThread(MainThread, started 140376601720640)>:
(Excluded thread.)

That were all threads.
EXCEPTION
Traceback (most recent call last):
  File "./returnn/rnn.py", line 11, in <module>
    line: main()
    locals:
      main = <local> <function main at 0x7fabebd7be18>
  File "/home/conan/jiayang/returnn_chinese_char/returnn/returnn/__main__.py", line 645, in main
    line: execute_main_task()
    locals:
      execute_main_task = <global> <function execute_main_task at 0x7fabebd7bd08>
  File "/home/conan/jiayang/returnn_chinese_char/returnn/returnn/__main__.py", line 451, in execute_main_task
    line: engine.init_train_from_config(config, train_data, dev_data, eval_data)
    locals:
      engine = <global> <returnn.tf.engine.Engine object at 0x7fa8642d9c88>
      engine.init_train_from_config = <global> <bound method Engine.init_train_from_config of <returnn.tf.engine.Engine object at 0x7fa8642d9c88>>
      config = <global> <returnn.config.Config object at 0x7fabf9661fd0>
      train_data = <global> <LibriSpeechCorpus 'train' epoch=1>
      dev_data = <global> <LibriSpeechCorpus 'dev' epoch=1>
      eval_data = <global> None
  File "/home/conan/jiayang/returnn_chinese_char/returnn/returnn/tf/engine.py", line 1042, in init_train_from_config
    line: self.init_network_from_config(config)
    locals:
      self = <local> <returnn.tf.engine.Engine object at 0x7fa8642d9c88>
      self.init_network_from_config = <local> <bound method Engine.init_network_from_config of <returnn.tf.engine.Engine object at 0x7fa8642d9c88>>
      config = <local> <returnn.config.Config object at 0x7fabf9661fd0>
  File "/home/conan/jiayang/returnn_chinese_char/returnn/returnn/tf/engine.py", line 1107, in init_network_from_config
    line: self._init_network(net_desc=net_dict, epoch=self.epoch)
    locals:
      self = <local> <returnn.tf.engine.Engine object at 0x7fa8642d9c88>
      self._init_network = <local> <bound method Engine._init_network of <returnn.tf.engine.Engine object at 0x7fa8642d9c88>>
      net_desc = <not found>
      net_dict = <local> {'source': {'class': 'eval', 'eval': 'tf.clip_by_value(source(0), -3.0, 3.0)'}, 'lstm0_fw': {'class': 'rec', 'unit': 'nativelstm2', 'n_out': 512, 'direction': 1, 'from': ['source']}, 'lstm0_bw': {'class': 'rec', 'unit': 'nativelstm2', 'n_out': 512, 'direction': -1, 'from': ['source']}, 'lstm0_poo..., len = 14
      epoch = <local> None
      self.epoch = <local> 1
  File "/home/conan/jiayang/returnn_chinese_char/returnn/returnn/tf/engine.py", line 1292, in _init_network
    line: self.network, self.updater = self.create_network(
            config=self.config,
            extern_data=extern_data,
            rnd_seed=net_random_seed,
            train_flag=train_flag, eval_flag=self.use_eval_flag, search_flag=self.use_search_flag,
            initial_learning_rate=getattr(self, "initial_learning_rate", None),
            net_dict=net_desc)
    locals:
      self = <local> <returnn.tf.engine.Engine object at 0x7fa8642d9c88>
      self.network = <local> None
      self.updater = <local> None
      self.create_network = <local> <bound method Engine.create_network of <class 'returnn.tf.engine.Engine'>>
      config = <not found>
      self.config = <local> <returnn.config.Config object at 0x7fabf9661fd0>
      extern_data = <local> <ExternData data={'classes': Data(name='classes', shape=(None,), dtype='int32', sparse=True, dim=6848, available_for_inference=False, batch_shape_meta=[B,T|'time:var:extern_data:classes']), 'data': Data(name='data', shape=(None, 40), batch_shape_meta=[B,T|'time:var:extern_data:data',F|40])}>
      rnd_seed = <not found>
      net_random_seed = <local> 1
      train_flag = <local> <tf.Tensor 'globals/train_flag:0' shape=() dtype=bool>
      eval_flag = <not found>
      self.use_eval_flag = <local> True
      search_flag = <not found>
      self.use_search_flag = <local> False
      initial_learning_rate = <not found>
      getattr = <builtin> <built-in function getattr>
      net_dict = <not found>
      net_desc = <local> {'source': {'class': 'eval', 'eval': 'tf.clip_by_value(source(0), -3.0, 3.0)'}, 'lstm0_fw': {'class': 'rec', 'unit': 'nativelstm2', 'n_out': 512, 'direction': 1, 'from': ['source']}, 'lstm0_bw': {'class': 'rec', 'unit': 'nativelstm2', 'n_out': 512, 'direction': -1, 'from': ['source']}, 'lstm0_poo..., len = 14
  File "/home/conan/jiayang/returnn_chinese_char/returnn/returnn/tf/engine.py", line 1327, in create_network
    line: network.construct_from_dict(net_dict)
    locals:
      network = <local> <TFNetwork 'root' train=<tf.Tensor 'globals/train_flag:0' shape=() dtype=bool>>
      network.construct_from_dict = <local> <bound method TFNetwork.construct_from_dict of <TFNetwork 'root' train=<tf.Tensor 'globals/train_flag:0' shape=() dtype=bool>>>
      net_dict = <local> {'source': {'class': 'eval', 'eval': 'tf.clip_by_value(source(0), -3.0, 3.0)'}, 'lstm0_fw': {'class': 'rec', 'unit': 'nativelstm2', 'n_out': 512, 'direction': 1, 'from': ['source']}, 'lstm0_bw': {'class': 'rec', 'unit': 'nativelstm2', 'n_out': 512, 'direction': -1, 'from': ['source']}, 'lstm0_poo..., len = 14
  File "/home/conan/jiayang/returnn_chinese_char/returnn/returnn/tf/network.py", line 447, in construct_from_dict
    line: self.construct_layer(net_dict, name)
    locals:
      self = <local> <TFNetwork 'root' train=<tf.Tensor 'globals/train_flag:0' shape=() dtype=bool>>
      self.construct_layer = <local> <bound method TFNetwork.construct_layer of <TFNetwork 'root' train=<tf.Tensor 'globals/train_flag:0' shape=() dtype=bool>>>
      net_dict = <local> {'source': {'class': 'eval', 'eval': 'tf.clip_by_value(source(0), -3.0, 3.0)'}, 'lstm0_fw': {'class': 'rec', 'unit': 'nativelstm2', 'n_out': 512, 'direction': 1, 'from': ['source']}, 'lstm0_bw': {'class': 'rec', 'unit': 'nativelstm2', 'n_out': 512, 'direction': -1, 'from': ['source']}, 'lstm0_poo..., len = 14
      name = <local> 'ctc'
  File "/home/conan/jiayang/returnn_chinese_char/returnn/returnn/tf/network.py", line 639, in construct_layer
    line: layer_class.transform_config_dict(layer_desc, network=self, get_layer=get_layer)
    locals:
      layer_class = <local> <class 'returnn.tf.layers.basic.SoftmaxLayer'>
      layer_class.transform_config_dict = <local> <bound method LayerBase.transform_config_dict of <class 'returnn.tf.layers.basic.SoftmaxLayer'>>
      layer_desc = <local> {'loss': 'ctc', 'target': 'classes', 'loss_opts': {'beam_width': 1, 'ctc_opts': {'ignore_longer_outputs_than_inputs': True}}}
      network = <not found>
      self = <local> <TFNetwork 'root' train=<tf.Tensor 'globals/train_flag:0' shape=() dtype=bool>>
      get_layer = <local> <function TFNetwork.construct_layer.<locals>.get_layer at 0x7fabe7c52ea0>
  File "/home/conan/jiayang/returnn_chinese_char/returnn/returnn/tf/layers/base.py", line 447, in transform_config_dict
    line: for src_name in src_names
    locals:
      src_name = <not found>
      src_names = <local> ['encoder'], _[0]: {len = 7}
  File "/home/conan/jiayang/returnn_chinese_char/returnn/returnn/tf/layers/base.py", line 448, in <listcomp>
    line: d["sources"] = [
            get_layer(src_name)
            for src_name in src_names
            if not src_name == "none"]
    locals:
      d = <not found>
      get_layer = <local> <function TFNetwork.construct_layer.<locals>.get_layer at 0x7fabe7c52ea0>
      src_name = <local> 'encoder', len = 7
      src_names = <not found>
  File "/home/conan/jiayang/returnn_chinese_char/returnn/returnn/tf/network.py", line 594, in get_layer
    line: return self.construct_layer(net_dict=net_dict, name=src_name)  # set get_layer to wrap construct_layer
    locals:
      self = <local> <TFNetwork 'root' train=<tf.Tensor 'globals/train_flag:0' shape=() dtype=bool>>
      self.construct_layer = <local> <bound method TFNetwork.construct_layer of <TFNetwork 'root' train=<tf.Tensor 'globals/train_flag:0' shape=() dtype=bool>>>
      net_dict = <local> {'source': {'class': 'eval', 'eval': 'tf.clip_by_value(source(0), -3.0, 3.0)'}, 'lstm0_fw': {'class': 'rec', 'unit': 'nativelstm2', 'n_out': 512, 'direction': 1, 'from': ['source']}, 'lstm0_bw': {'class': 'rec', 'unit': 'nativelstm2', 'n_out': 512, 'direction': -1, 'from': ['source']}, 'lstm0_poo..., len = 14
      name = <not found>
      src_name = <local> 'encoder', len = 7
  File "/home/conan/jiayang/returnn_chinese_char/returnn/returnn/tf/network.py", line 639, in construct_layer
    line: layer_class.transform_config_dict(layer_desc, network=self, get_layer=get_layer)
    locals:
      layer_class = <local> <class 'returnn.tf.layers.basic.CopyLayer'>
      layer_class.transform_config_dict = <local> <bound method CopyLayer.transform_config_dict of <class 'returnn.tf.layers.basic.CopyLayer'>>
      layer_desc = <local> {}
      network = <not found>
      self = <local> <TFNetwork 'root' train=<tf.Tensor 'globals/train_flag:0' shape=() dtype=bool>>
      get_layer = <local> <function TFNetwork.construct_layer.<locals>.get_layer at 0x7fa8631c0598>
  File "/home/conan/jiayang/returnn_chinese_char/returnn/returnn/tf/layers/basic.py", line 313, in transform_config_dict
    line: super(CopyLayer, cls).transform_config_dict(d, network=network, get_layer=get_layer)
    locals:
      super = <builtin> <class 'super'>
      CopyLayer = <global> <class 'returnn.tf.layers.basic.CopyLayer'>
      cls = <local> <class 'returnn.tf.layers.basic.CopyLayer'>
      transform_config_dict = <not found>
      d = <local> {}
      network = <local> <TFNetwork 'root' train=<tf.Tensor 'globals/train_flag:0' shape=() dtype=bool>>
      get_layer = <local> <function TFNetwork.construct_layer.<locals>.get_layer at 0x7fa8631c0598>
  File "/home/conan/jiayang/returnn_chinese_char/returnn/returnn/tf/layers/base.py", line 447, in transform_config_dict
    line: for src_name in src_names
    locals:
      src_name = <not found>
      src_names = <local> ['lstm5_fw', 'lstm5_bw'], _[0]: {len = 8}
  File "/home/conan/jiayang/returnn_chinese_char/returnn/returnn/tf/layers/base.py", line 448, in <listcomp>
    line: d["sources"] = [
            get_layer(src_name)
            for src_name in src_names
            if not src_name == "none"]
    locals:
      d = <not found>
      get_layer = <local> <function TFNetwork.construct_layer.<locals>.get_layer at 0x7fa8631c0598>
      src_name = <local> 'lstm5_fw', len = 8
      src_names = <not found>
  File "/home/conan/jiayang/returnn_chinese_char/returnn/returnn/tf/network.py", line 594, in get_layer
    line: return self.construct_layer(net_dict=net_dict, name=src_name)  # set get_layer to wrap construct_layer
    locals:
      self = <local> <TFNetwork 'root' train=<tf.Tensor 'globals/train_flag:0' shape=() dtype=bool>>
      self.construct_layer = <local> <bound method TFNetwork.construct_layer of <TFNetwork 'root' train=<tf.Tensor 'globals/train_flag:0' shape=() dtype=bool>>>
      net_dict = <local> {'source': {'class': 'eval', 'eval': 'tf.clip_by_value(source(0), -3.0, 3.0)'}, 'lstm0_fw': {'class': 'rec', 'unit': 'nativelstm2', 'n_out': 512, 'direction': 1, 'from': ['source']}, 'lstm0_bw': {'class': 'rec', 'unit': 'nativelstm2', 'n_out': 512, 'direction': -1, 'from': ['source']}, 'lstm0_poo..., len = 14
      name = <not found>
      src_name = <local> 'lstm5_fw', len = 8
  File "/home/conan/jiayang/returnn_chinese_char/returnn/returnn/tf/network.py", line 639, in construct_layer
    line: layer_class.transform_config_dict(layer_desc, network=self, get_layer=get_layer)
    locals:
      layer_class = <local> <class 'returnn.tf.layers.rec.RecLayer'>
      layer_class.transform_config_dict = <local> <bound method RecLayer.transform_config_dict of <class 'returnn.tf.layers.rec.RecLayer'>>
      layer_desc = <local> {'unit': 'nativelstm2', 'n_out': 512, 'direction': 1, 'dropout': 0}
      network = <not found>
      self = <local> <TFNetwork 'root' train=<tf.Tensor 'globals/train_flag:0' shape=() dtype=bool>>
      get_layer = <local> <function TFNetwork.construct_layer.<locals>.get_layer at 0x7fa863268158>
  File "/home/conan/jiayang/returnn_chinese_char/returnn/returnn/tf/layers/rec.py", line 297, in transform_config_dict
    line: super(RecLayer, cls).transform_config_dict(d, network=network, get_layer=get_layer)  # everything except "unit"
    locals:
      super = <builtin> <class 'super'>
      RecLayer = <global> <class 'returnn.tf.layers.rec.RecLayer'>
      cls = <local> <class 'returnn.tf.layers.rec.RecLayer'>
      transform_config_dict = <not found>
      d = <local> {'unit': 'nativelstm2', 'n_out': 512, 'direction': 1, 'dropout': 0}
      network = <local> <TFNetwork 'root' train=<tf.Tensor 'globals/train_flag:0' shape=() dtype=bool>>
      get_layer = <local> <function TFNetwork.construct_layer.<locals>.get_layer at 0x7fa863268158>
  File "/home/conan/jiayang/returnn_chinese_char/returnn/returnn/tf/layers/base.py", line 447, in transform_config_dict
    line: for src_name in src_names
    locals:
      src_name = <not found>
      src_names = <local> ['lstm0_pool'], _[0]: {len = 10}
  File "/home/conan/jiayang/returnn_chinese_char/returnn/returnn/tf/layers/base.py", line 448, in <listcomp>
    line: d["sources"] = [
            get_layer(src_name)
            for src_name in src_names
            if not src_name == "none"]
    locals:
      d = <not found>
      get_layer = <local> <function TFNetwork.construct_layer.<locals>.get_layer at 0x7fa863268158>
      src_name = <local> 'lstm0_pool', len = 10
      src_names = <not found>
  File "/home/conan/jiayang/returnn_chinese_char/returnn/returnn/tf/network.py", line 594, in get_layer
    line: return self.construct_layer(net_dict=net_dict, name=src_name)  # set get_layer to wrap construct_layer
    locals:
      self = <local> <TFNetwork 'root' train=<tf.Tensor 'globals/train_flag:0' shape=() dtype=bool>>
      self.construct_layer = <local> <bound method TFNetwork.construct_layer of <TFNetwork 'root' train=<tf.Tensor 'globals/train_flag:0' shape=() dtype=bool>>>
      net_dict = <local> {'source': {'class': 'eval', 'eval': 'tf.clip_by_value(source(0), -3.0, 3.0)'}, 'lstm0_fw': {'class': 'rec', 'unit': 'nativelstm2', 'n_out': 512, 'direction': 1, 'from': ['source']}, 'lstm0_bw': {'class': 'rec', 'unit': 'nativelstm2', 'n_out': 512, 'direction': -1, 'from': ['source']}, 'lstm0_poo..., len = 14
      name = <not found>
      src_name = <local> 'lstm0_pool', len = 10
  File "/home/conan/jiayang/returnn_chinese_char/returnn/returnn/tf/network.py", line 639, in construct_layer
    line: layer_class.transform_config_dict(layer_desc, network=self, get_layer=get_layer)
    locals:
      layer_class = <local> <class 'returnn.tf.layers.basic.PoolLayer'>
      layer_class.transform_config_dict = <local> <bound method LayerBase.transform_config_dict of <class 'returnn.tf.layers.basic.PoolLayer'>>
      layer_desc = <local> {'mode': 'max', 'padding': 'same', 'pool_size': (32,), 'trainable': False}
      network = <not found>
      self = <local> <TFNetwork 'root' train=<tf.Tensor 'globals/train_flag:0' shape=() dtype=bool>>
      get_layer = <local> <function TFNetwork.construct_layer.<locals>.get_layer at 0x7fa863268268>
  File "/home/conan/jiayang/returnn_chinese_char/returnn/returnn/tf/layers/base.py", line 447, in transform_config_dict
    line: for src_name in src_names
    locals:
      src_name = <not found>
      src_names = <local> ['lstm0_fw', 'lstm0_bw'], _[0]: {len = 8}
  File "/home/conan/jiayang/returnn_chinese_char/returnn/returnn/tf/layers/base.py", line 448, in <listcomp>
    line: d["sources"] = [
            get_layer(src_name)
            for src_name in src_names
            if not src_name == "none"]
    locals:
      d = <not found>
      get_layer = <local> <function TFNetwork.construct_layer.<locals>.get_layer at 0x7fa863268268>
      src_name = <local> 'lstm0_fw', len = 8
      src_names = <not found>
  File "/home/conan/jiayang/returnn_chinese_char/returnn/returnn/tf/network.py", line 594, in get_layer
    line: return self.construct_layer(net_dict=net_dict, name=src_name)  # set get_layer to wrap construct_layer
    locals:
      self = <local> <TFNetwork 'root' train=<tf.Tensor 'globals/train_flag:0' shape=() dtype=bool>>
      self.construct_layer = <local> <bound method TFNetwork.construct_layer of <TFNetwork 'root' train=<tf.Tensor 'globals/train_flag:0' shape=() dtype=bool>>>
      net_dict = <local> {'source': {'class': 'eval', 'eval': 'tf.clip_by_value(source(0), -3.0, 3.0)'}, 'lstm0_fw': {'class': 'rec', 'unit': 'nativelstm2', 'n_out': 512, 'direction': 1, 'from': ['source']}, 'lstm0_bw': {'class': 'rec', 'unit': 'nativelstm2', 'n_out': 512, 'direction': -1, 'from': ['source']}, 'lstm0_poo..., len = 14
      name = <not found>
      src_name = <local> 'lstm0_fw', len = 8
  File "/home/conan/jiayang/returnn_chinese_char/returnn/returnn/tf/network.py", line 642, in construct_layer
    line: return add_layer(name=name, layer_class=layer_class, **layer_desc)
    locals:
      add_layer = <local> <bound method TFNetwork.add_layer of <TFNetwork 'root' train=<tf.Tensor 'globals/train_flag:0' shape=() dtype=bool>>>
      name = <local> 'lstm0_fw', len = 8
      layer_class = <local> <class 'returnn.tf.layers.rec.RecLayer'>
      layer_desc = <local> {'unit': 'nativelstm2', 'n_out': 512, 'direction': 1, 'sources': [<EvalLayer 'source' out_type=Data(shape=(None, 40), batch_shape_meta=[B,T|'time:var:extern_data:data',F|40])>]}
  File "/home/conan/jiayang/returnn_chinese_char/returnn/returnn/tf/network.py", line 752, in add_layer
    line: layer = self._create_layer(name=name, layer_class=layer_class, **layer_desc)
    locals:
      layer = <not found>
      self = <local> <TFNetwork 'root' train=<tf.Tensor 'globals/train_flag:0' shape=() dtype=bool>>
      self._create_layer = <local> <bound method TFNetwork._create_layer of <TFNetwork 'root' train=<tf.Tensor 'globals/train_flag:0' shape=() dtype=bool>>>
      name = <local> 'lstm0_fw', len = 8
      layer_class = <local> <class 'returnn.tf.layers.rec.RecLayer'>
      layer_desc = <local> {'unit': 'nativelstm2', 'n_out': 512, 'direction': 1, 'sources': [<EvalLayer 'source' out_type=Data(shape=(None, 40), batch_shape_meta=[B,T|'time:var:extern_data:data',F|40])>]}
  File "/home/conan/jiayang/returnn_chinese_char/returnn/returnn/tf/network.py", line 701, in _create_layer
    line: layer = layer_class(**layer_desc)
    locals:
      layer = <not found>
      layer_class = <local> <class 'returnn.tf.layers.rec.RecLayer'>
      layer_desc = <local> {'unit': 'nativelstm2', 'n_out': 512, 'direction': 1, 'sources': [<EvalLayer 'source' out_type=Data(shape=(None, 40), batch_shape_meta=[B,T|'time:var:extern_data:data',F|40])>], 'name': 'lstm0_fw', 'network': <TFNetwork 'root' train=<tf.Tensor 'globals/train_flag:0' shape=() dtype=bool>>, 'output..., len = 7
  File "/home/conan/jiayang/returnn_chinese_char/returnn/returnn/tf/layers/rec.py", line 228, in __init__
    line: y = self._get_output_native_rec_op(self.cell)
    locals:
      y = <not found>
      self = <local> <RecLayer 'lstm0_fw' out_type=Data(shape=(None, 512), batch_dim_axis=1, batch_shape_meta=[T|'time:var:extern_data:data',B,F|512])>
      self._get_output_native_rec_op = <local> <bound method RecLayer._get_output_native_rec_op of <RecLayer 'lstm0_fw' out_type=Data(shape=(None, 512), batch_dim_axis=1, batch_shape_meta=[T|'time:var:extern_data:data',B,F|512])>>
      self.cell = <local> <returnn.tf.native_op.NativeLstm2 object at 0x7fa860c316a0>
  File "/home/conan/jiayang/returnn_chinese_char/returnn/returnn/tf/layers/rec.py", line 886, in _get_output_native_rec_op
    line: y, final_state = cell(
            inputs=x, index=index,
            initial_state=self._initial_state,
            recurrent_weights_initializer=self._rec_weights_initializer)
    locals:
      y = <not found>
      final_state = <not found>
      cell = <local> <returnn.tf.native_op.NativeLstm2 object at 0x7fa860c316a0>
      inputs = <not found>
      x = <local> <tf.Tensor 'lstm0_fw/rec/add:0' shape=(?, ?, 2048) dtype=float32>
      index = <local> <tf.Tensor 'extern_data/placeholders/data/sequence_mask_time_major/transpose:0' shape=(?, ?) dtype=bool>
      initial_state = <not found>
      self = <local> <RecLayer 'lstm0_fw' out_type=Data(shape=(None, 512), batch_dim_axis=1, batch_shape_meta=[T|'time:var:extern_data:data',B,F|512])>
      self._initial_state = <local> None
      recurrent_weights_initializer = <not found>
      self._rec_weights_initializer = <local> None
  File "/home/conan/jiayang/returnn_chinese_char/returnn/returnn/tf/native_op.py", line 905, in __call__
    line: out, _, _, final_cell_state = self.op(inputs, weights, y0, c0, index, start, step)  # noqa
    locals:
      out = <not found>
      _ = <not found>
      final_cell_state = <not found>
      self = <local> <returnn.tf.native_op.NativeLstm2 object at 0x7fa860c316a0>
      self.op = <local> <function native_lstm2 at 0x7fa863268a60>
      inputs = <local> <tf.Tensor 'lstm0_fw/rec/add:0' shape=(?, ?, 2048) dtype=float32>
      weights = <local> <tf.Variable 'lstm0_fw/rec/W_re:0' shape=(512, 2048) dtype=float32_ref>
      y0 = <local> <tf.Tensor 'lstm0_fw/rec/initial_h:0' shape=(?, 512) dtype=float32>
      c0 = <local> <tf.Tensor 'lstm0_fw/rec/initial_c:0' shape=(?, 512) dtype=float32>
      index = <local> <tf.Tensor 'extern_data/placeholders/data/sequence_mask_time_major/cast_float32:0' shape=(?, ?) dtype=float32>
      start = <local> <tf.Tensor 'lstm0_fw/rec/start:0' shape=() dtype=int32>
      step = <local> <tf.Tensor 'lstm0_fw/rec/step:0' shape=() dtype=int32>
  File "<string>", line 90, in native_lstm2
    -- code not available --
  File "/home/conan/anaconda3/envs/tf14/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
    line: op = g.create_op(op_type_name, inputs, dtypes=None, name=scope,
                           input_types=input_types, attrs=attr_protos,
                           op_def=op_def)
    locals:
      op = <not found>
      g = <local> <tensorflow.python.framework.ops.Graph object at 0x7faaff824048>
      g.create_op = <local> <bound method Graph.create_op of <tensorflow.python.framework.ops.Graph object at 0x7faaff824048>>
      op_type_name = <local> 'NativeLstm2', len = 11
      inputs = <local> [<tf.Tensor 'lstm0_fw/rec/add:0' shape=(?, ?, 2048) dtype=float32>, <tf.Tensor 'lstm0_fw/rec/W_re/read:0' shape=(512, 2048) dtype=float32>, <tf.Tensor 'lstm0_fw/rec/initial_h:0' shape=(?, 512) dtype=float32>, <tf.Tensor 'lstm0_fw/rec/initial_c:0' shape=(?, 512) dtype=float32>, <tf.Tensor 'extern_..., len = 7
      dtypes = <global> <module 'tensorflow.python.framework.dtypes' from '/home/conan/anaconda3/envs/tf14/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py'>
      name = <local> 'NativeLstm2', len = 11
      scope = <local> 'lstm0_fw/rec/NativeLstm2/', len = 25
      input_types = <local> [tf.float32, tf.float32, tf.float32, tf.float32, tf.float32, tf.int32, tf.int32], len = 7
      attrs = <not found>
      attr_protos = <local> {}
      op_def = <local> name: "NativeLstm2"
                       input_arg {
                         name: "x"
                         type: DT_FLOAT
                       }
                       input_arg {
                         name: "w"
                         type: DT_FLOAT
                       }
                       input_arg {
                         name: "y0"
                         type: DT_FLOAT
                       }
                       input_arg {
                         name: "c0"
                         type: DT_FLOAT
                       }
                       input_arg {
                         name: "i"
                         type: DT_FLOAT
                       }
                       input_arg {
                         name: "start"
                         type: DT_INT32
                       }
                       input_arg {
                        ...
  File "/home/conan/anaconda3/envs/tf14/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    line: return func(*args, **kwargs)
    locals:
      func = <local> <function Graph.create_op at 0x7fabe9112b70>
      args = <local> (<tensorflow.python.framework.ops.Graph object at 0x7faaff824048>, 'NativeLstm2', [<tf.Tensor 'lstm0_fw/rec/add:0' shape=(?, ?, 2048) dtype=float32>, <tf.Tensor 'lstm0_fw/rec/W_re/read:0' shape=(512, 2048) dtype=float32>, <tf.Tensor 'lstm0_fw/rec/initial_h:0' shape=(?, 512) dtype=float32>, <tf.Te...
      kwargs = <local> {'dtypes': None, 'name': 'lstm0_fw/rec/NativeLstm2/', 'input_types': [tf.float32, tf.float32, tf.float32, tf.float32, tf.float32, tf.int32, tf.int32], 'attrs': {}, 'op_def': name: "NativeLstm2"
                       input_arg {
                         name: "x"
                         type: DT_FLOAT
                       }
                       input_arg {
                         name: "w"
                         type: DT_FLOAT
                       }
                       input_arg {
                         nam...
  File "/home/conan/anaconda3/envs/tf14/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3616, in create_op
    line: ret = Operation(
              node_def,
              self,
              inputs=inputs,
              output_types=dtypes,
              control_inputs=control_inputs,
              input_types=input_types,
              original_op=self._default_original_op,
              op_def=op_def)
    locals:
      ret = <not found>
      Operation = <global> <class 'tensorflow.python.framework.ops.Operation'>
      node_def = <local> name: "lstm0_fw/rec/NativeLstm2"
                         op: "NativeLstm2"

      self = <local> <tensorflow.python.framework.ops.Graph object at 0x7faaff824048>
      inputs = <local> [<tf.Tensor 'lstm0_fw/rec/add:0' shape=(?, ?, 2048) dtype=float32>, <tf.Tensor 'lstm0_fw/rec/W_re/read:0' shape=(512, 2048) dtype=float32>, <tf.Tensor 'lstm0_fw/rec/initial_h:0' shape=(?, 512) dtype=float32>, <tf.Tensor 'lstm0_fw/rec/initial_c:0' shape=(?, 512) dtype=float32>, <tf.Tensor 'extern_..., len = 7
      output_types = <not found>
      dtypes = <local> None
      control_inputs = <local> []
      input_types = <local> [tf.float32, tf.float32, tf.float32, tf.float32, tf.float32, tf.int32, tf.int32], len = 7
      original_op = <not found>
      self._default_original_op = <local> None
      op_def = <local> name: "NativeLstm2"
                       input_arg {
                         name: "x"
                         type: DT_FLOAT
                       }
                       input_arg {
                         name: "w"
                         type: DT_FLOAT
                       }
                       input_arg {
                         name: "y0"
                         type: DT_FLOAT
                       }
                       input_arg {
                         name: "c0"
                         type: DT_FLOAT
                       }
                       input_arg {
                         name: "i"
                         type: DT_FLOAT
                       }
                       input_arg {
                         name: "start"
                         type: DT_INT32
                       }
                       input_arg {
                        ...
  File "/home/conan/anaconda3/envs/tf14/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 2027, in __init__
    line: self._c_op = _create_c_op(self._graph, node_def, grouped_inputs,
                                    control_input_ops)
    locals:
      self = <local> !AttributeError: 'Operation' object has no attribute '_c_op'
      self._c_op = <local> !AttributeError: 'Operation' object has no attribute '_c_op'
      _create_c_op = <global> <function _create_c_op at 0x7fabe910e488>
      self._graph = <local> <tensorflow.python.framework.ops.Graph object at 0x7faaff824048>
      node_def = <local> name: "lstm0_fw/rec/NativeLstm2"
                         op: "NativeLstm2"

      grouped_inputs = <local> [<tf.Tensor 'lstm0_fw/rec/add:0' shape=(?, ?, 2048) dtype=float32>, <tf.Tensor 'lstm0_fw/rec/W_re/read:0' shape=(512, 2048) dtype=float32>, <tf.Tensor 'lstm0_fw/rec/initial_h:0' shape=(?, 512) dtype=float32>, <tf.Tensor 'lstm0_fw/rec/initial_c:0' shape=(?, 512) dtype=float32>, <tf.Tensor 'extern_..., len = 7
      control_input_ops = <local> []
  File "/home/conan/anaconda3/envs/tf14/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1867, in _create_c_op
    line: raise ValueError(str(e))
    locals:
      ValueError = <builtin> <class 'ValueError'>
      str = <builtin> <class 'str'>
      e = <not found>
ValueError: undefined context (mismatch in C++ ABI?) for 'lstm0_fw/rec/NativeLstm2' (op: 'NativeLstm2') with input shapes: [?,?,2048], [512,2048], [?,512], [?,512], [?,?], [], [].
@yanghongjiazheng
Copy link
Author

My tf version is 1.14. And gcc version is 7.3. However,I didn't encounter this problem using tf1.13.1.

@albertz
Copy link
Member

albertz commented Nov 10, 2020

Related (maybe duplicate): #281 and #262

Can you check and report tf.__compiler_version__?

In /tmp/$USER/returnn_tf_cache, you should find some compile logs. Maybe clean up first, then run again, and then copy that log to a Gist, and report here. Specifically, I wonder whether it really picked that GCC version, or maybe some other version.

You could also try to run python3 returnn/tests/test_TFNativeOp.py test_NativeLstm2_run, put the output to some Gist, and report here.

@yanghongjiazheng
Copy link
Author

Both tf1.13.1 and tf1.14.0 tf.__compiler_version__ ==4.8.5 . And when I changed my gcc to 4.8.5, the problem was fixed

@albertz
Copy link
Member

albertz commented Nov 11, 2020

Both tf1.13.1 and tf1.14.0 tf.__compiler_version__ ==4.8.5 . And when I changed my gcc to 4.8.5, the problem was fixed

You mean, you needed to install GCC 4.8.5 and then it worked? Because RETURNN should automatically select the right compiler (if available).

So I guess then there was nothing wrong on RETURNN side.

@albertz albertz closed this as completed Nov 11, 2020
@albertz
Copy link
Member

albertz commented Nov 11, 2020

Btw, you might want to update to some newer TF (2.3 or so), which also is compiled using a newer GCC version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants