Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate to Tensorflow Keras to support TF2.1 #165

Merged
merged 21 commits into from
Feb 2, 2020

Conversation

ZerxXxes
Copy link

This will use native Tensorflow Keras and supports Tensorflow 2.1

yslai and others added 15 commits September 9, 2018 10:56
…del, missing the ellipsis in the list of punctuations
Fixes generate_to_file using unicode characters by adding encoding="utf-8" in the open command.
Fixes generate_to_file using unicode characters
…uation, added double quotes to targeted punctuation
We really want equality here.

$ __python__
```python
>>> hi = 'h'
>>> hi += 'i'
>>> hi
'hi'
>>> hi == 'hi'
True
>>> hi is 'hi'
False
>>> 0 == 0.0
True
>>> 0 is 0.0
False
```
allow_growth option for GeForce RTX could not create cuDNN issue
Identity is not the same thing as equality in Python
enabled removal of spaces around punctuation
unicode characters using python 3
Spaces for word-level training should be applied regardless of new_mo…
@ZerxXxes
Copy link
Author

Did not work for GPU and Multi-GPU yet. Hold on, will try to migrate to Tensorflow distribute.MirroredStrategy()

@ZerxXxes
Copy link
Author

Alright so now it works for CPU and GPU, but when using Multi-GPU there is an error when generating samples after training.

>>> textgen.train_from_largetext_file(fulltext_path, new_model=True, num_epochs=1, train_size=0.8, multi_gpu=True)
Training new model w/ 2-layer, 128-cell LSTMs                                               
Training on 66,910 character sequences.
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1', '/job:localhost/replica:0/task:0/dev
ice:GPU:2')                                                                                 
Training on 3 GPUs.                                                                                                                                                                      
WARNING:tensorflow:sample_weight modes were coerced from
  ...                                                                                       
    to                                 
  ['...']                                                                                                                                                                                
WARNING:tensorflow:sample_weight modes were coerced from               
  ...                                                                                       
    to     
  ['...']                                                                                   
Train for 174 steps, validate for 43 steps                                                  
INFO:tensorflow:batch_all_reduce: 9 all-reduces with algorithm = nccl, num_packs = 1, agg_small_grads_max_bytes = 0 and agg_small_grads_max_group = 10
WARNING:tensorflow:Efficient allreduce is not supported for 1 IndexedSlices
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:GPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1',
 '/job:localhost/replica:0/task:0/device:GPU:2').                      
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).  
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:batch_all_reduce: 9 all-reduces with algorithm = nccl, num_packs = 1, agg_small_grads_max_bytes = 0 and agg_small_grads_max_group = 10
WARNING:tensorflow:Efficient allreduce is not supported for 1 IndexedSlices
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:GPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1',
 '/job:localhost/replica:0/task:0/device:GPU:2').
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
2020-01-15 11:56:03.082815: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:561] function_optimizer failed: Invalid argument: Node 'replica_1/model_2/rnn_1/StatefulPartitionedCa
ll_replica_1/StatefulPartitionedCall_1_27': Connecting to invalid output 33 of source node replica_1/model_2/rnn_1/StatefulPartitionedCall which has 33 outputs.
2020-01-15 11:56:03.669908: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-01-15 11:56:04.117159: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
173/174 [============================>.] - ETA: 0s - loss: 3.4812INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task
:0/device:CPU:0',).                                                                         
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
####################
Temperature: 0.2                                                                                                                                                                         
####################                                                                        
2020-01-15 11:56:16.331083: W tensorflow/core/framework/op_kernel.cc:1655] OP_REQUIRES failed at cudnn_rnn_ops.cc:1517 : Unknown: CUDNN_STATUS_BAD_PARAM
in tensorflow/stream_executor/cuda/cuda_dnn.cc(1393): 'cudnnSetTensorNdDescriptor( tensor_desc.get(), data_type, sizeof(dims) / sizeof(dims[0]), dims, strides)'
2020-01-15 11:56:16.331135: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Unknown: CUDNN_STATUS_BAD_PARAM
in tensorflow/stream_executor/cuda/cuda_dnn.cc(1393): 'cudnnSetTensorNdDescriptor( tensor_desc.get(), data_type, sizeof(dims) / sizeof(dims[0]), dims, strides)'
         [[{{node CudnnRNN}}]]                                                              
2020-01-15 11:56:16.331182: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Unknown: {{function_node __inference_cudnn_lstm_with_fal
lback_23945_specialized_for_replica_1_model_1_rnn_1_StatefulPartitionedCall_at___inference_distributed_function_25546}} {{function_node __inference_cudnn_lstm_with_fallback_23945_specia
lized_for_replica_1_model_1_rnn_1_StatefulPartitionedCall_at___inference_distributed_function_25546}} CUDNN_STATUS_BAD_PARAM
in tensorflow/stream_executor/cuda/cuda_dnn.cc(1393): 'cudnnSetTensorNdDescriptor( tensor_desc.get(), data_type, sizeof(dims) / sizeof(dims[0]), dims, strides)'
         [[{{node CudnnRNN}}]]
         [[replica_1/model_1/rnn_1/StatefulPartitionedCall]]
         [[replica_1/model_1/output/Softmax/_130]]
2020-01-15 11:56:16.331229: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Unknown: {{function_node __inference_cudnn_lstm_with_fallback_23945_specialized_for_replica_1_model_1_rnn_1_StatefulPartitionedCall_at___inference_distributed_function_25546}} {{function_node __inference_cudnn_lstm_with_fallback_23945_specialized_for_replica_1_model_1_rnn_1_StatefulPartitionedCall_at___inference_distributed_function_25546}} CUDNN_STATUS_BAD_PARAM
in tensorflow/stream_executor/cuda/cuda_dnn.cc(1393): 'cudnnSetTensorNdDescriptor( tensor_desc.get(), data_type, sizeof(dims) / sizeof(dims[0]), dims, strides)'
         [[{{node CudnnRNN}}]]
         [[replica_1/model_1/rnn_1/StatefulPartitionedCall]]
         [[replica_1/model_1/rnn_2/StatefulPartitionedCall/_111]]
2020-01-15 11:56:16.331268: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Unknown: {{function_node __inference_cudnn_lstm_with_fallback_23945_specialized_for_replica_1_model_1_rnn_1_StatefulPartitionedCall_at___inference_distributed_function_25546}} {{function_node __inference_cudnn_lstm_with_fallback_23945_specialized_for_replica_1_model_1_rnn_1_StatefulPartitionedCall_at___inference_distributed_function_25546}} CUDNN_STATUS_BAD_PARAM
in tensorflow/stream_executor/cuda/cuda_dnn.cc(1393): 'cudnnSetTensorNdDescriptor( tensor_desc.get(), data_type, sizeof(dims) / sizeof(dims[0]), dims, strides)'
         [[{{node CudnnRNN}}]]
         [[replica_1/model_1/rnn_1/StatefulPartitionedCall]]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/markus.klock/textgenrnn/textgenrnn/textgenrnn.py", line 356, in train_from_largetext_file
    texts, single_text=True, **kwargs)
  File "/home/markus.klock/textgenrnn/textgenrnn/textgenrnn.py", line 317, in train_new_model
    **kwargs)
  File "/home/markus.klock/textgenrnn/textgenrnn/textgenrnn.py", line 243, in train_on_texts
    validation_steps=val_steps
  File "/home/markus.klock/tf2/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training.py", line 819, in fit
    use_multiprocessing=use_multiprocessing)
  File "/home/markus.klock/tf2/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 397, in fit
    prefix='val_')
  File "/usr/lib/python3.6/contextlib.py", line 88, in __exit__
    next(self.gen)
  File "/home/markus.klock/tf2/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 771, in on_epoch
    self.callbacks.on_epoch_end(epoch, epoch_logs)
  File "/home/markus.klock/tf2/lib/python3.6/site-packages/tensorflow_core/python/keras/callbacks.py", line 302, in on_epoch_end
    callback.on_epoch_end(epoch, logs)
  File "/home/markus.klock/textgenrnn/textgenrnn/utils.py", line 283, in on_epoch_end
    max_gen_length=self.max_gen_length)
  File "/home/markus.klock/textgenrnn/textgenrnn/textgenrnn.py", line 100, in generate_samples
    self.generate(n, temperature=temperature, progress=False, **kwargs)
  File "/home/markus.klock/textgenrnn/textgenrnn/textgenrnn.py", line 89, in generate
    prefix)
  File "/home/markus.klock/textgenrnn/textgenrnn/utils.py", line 97, in textgenrnn_generate
    model.predict(encoded_text, batch_size=1)[0],
  File "/home/markus.klock/tf2/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training.py", line 1013, in predict
    use_multiprocessing=use_multiprocessing)
  File "/home/markus.klock/tf2/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 498, in predict
    workers=workers, use_multiprocessing=use_multiprocessing, **kwargs)
  File "/home/markus.klock/tf2/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 475, in _model_iteration
    total_epochs=1)
  File "/home/markus.klock/tf2/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 128, in run_one_epoch
    batch_outs = execution_function(iterator)
  File "/home/markus.klock/tf2/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_v2_utils.py", line 98, in execution_function
    distributed_function(input_fn))
  File "/home/markus.klock/tf2/lib/python3.6/site-packages/tensorflow_core/python/eager/def_function.py", line 568, in __call__
    result = self._call(*args, **kwds)
  File "/home/markus.klock/tf2/lib/python3.6/site-packages/tensorflow_core/python/eager/def_function.py", line 638, in _call
    return self._concrete_stateful_fn._filtered_call(canon_args, canon_kwds)  # pylint: disable=protected-access
  File "/home/markus.klock/tf2/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 1611, in _filtered_call
    self.captured_inputs)
  File "/home/markus.klock/tf2/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 1692, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager))
  File "/home/markus.klock/tf2/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 545, in call
    ctx=ctx)
  File "/home/markus.klock/tf2/lib/python3.6/site-packages/tensorflow_core/python/eager/execute.py", line 67, in quick_execute
    six.raise_from(core._status_to_exception(e.code, message), None)
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.UnknownError:  [_Derived_]  CUDNN_STATUS_BAD_PARAM
in tensorflow/stream_executor/cuda/cuda_dnn.cc(1393): 'cudnnSetTensorNdDescriptor( tensor_desc.get(), data_type, sizeof(dims) / sizeof(dims[0]), dims, strides)'
         [[{{node CudnnRNN}}]]
         [[replica_1/model_1/rnn_1/StatefulPartitionedCall]]
         [[replica_1/model_1/output/Softmax/_130]] [Op:__inference_distributed_function_25546]

Function call stack:
distributed_function -> distributed_function -> distributed_function

@minimaxir Any clues to what causes this?
Probably the root cause is that I do not know how distribute.MirroredStrategy() actually works.

@minimaxir
Copy link
Owner

This looks great! Thanks! :) [I apologize for not being productive with my approach.]

I am OK with depreciating multi-GPU if necessary because I don't think that actually worked in the first place; I added it initially just for experimentation,

Have you verified that models created with earlier versions import correctly?

@ZerxXxes
Copy link
Author

Thanks yourself for writing this awesome application :)

No I have not verified that, I have basically just tried textgen.generate() and textgen.train_from_large_textfile() on both GPU and CPU-only.
This code probably need a lot more testing as there is a high risk of stuff breaking.

@ZerxXxes
Copy link
Author

Did a test now.

  1. Used TF1.15 and loaded textgenrnn
  2. Trained a new model, generated samples, printed model.summary and saved it to disk.
  3. Switched to TF2.1 and checked out this PR
  4. Loaded the model from disk, generated samples and printed model.summary

Worked without errors :)
Log: http://p.ip.fi/9w8K

@minimaxir
Copy link
Owner

(sorry for lack of response, will look at this PR this week!)

@ZerxXxes
Copy link
Author

No problem :) Just merged in the master to have the last commits also migrated to TF2.1 and switched to the Adam optimizer as it should perform better than RMSprop in all scenarios.
Multi GPU still does not work, training works but it crashes during generation, I'm trying to figure it out but might be above my knowledge about Tensorflow

@minimaxir
Copy link
Owner

Tested and LGTM, so will merge into the branch (and make a few changes before merging to master)

Two issues I see:

  1. External keras is still necessary for imports so will need to add that as a requirement, not sure if there is a tf.keras analogue for that.
  2. May be a performance regression; one of the scripts in the demo notebook ran twice as long. May be a TF 2.1 issue though.

@minimaxir minimaxir merged commit 2067c35 into minimaxir:TF-2.0 Feb 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants