Migrate to Tensorflow Keras to support TF2.1 #165

ZerxXxes · 2020-01-14T21:18:54Z

This will use native Tensorflow Keras and supports Tensorflow 2.1

…del, missing the ellipsis in the list of punctuations

Fixes generate_to_file using unicode characters by adding encoding="utf-8" in the open command.

Fixes generate_to_file using unicode characters

…uation, added double quotes to targeted punctuation

We really want equality here. $ __python__ ```python >>> hi = 'h' >>> hi += 'i' >>> hi 'hi' >>> hi == 'hi' True >>> hi is 'hi' False >>> 0 == 0.0 True >>> 0 is 0.0 False ```

allow_growth option for GeForce RTX could not create cuDNN issue

Identity is not the same thing as equality in Python

enabled removal of spaces around punctuation

unicode characters using python 3

Spaces for word-level training should be applied regardless of new_mo…

ZerxXxes · 2020-01-15T09:49:13Z

Did not work for GPU and Multi-GPU yet. Hold on, will try to migrate to Tensorflow distribute.MirroredStrategy()

…i-GPU

ZerxXxes · 2020-01-15T11:59:08Z

Alright so now it works for CPU and GPU, but when using Multi-GPU there is an error when generating samples after training.

>>> textgen.train_from_largetext_file(fulltext_path, new_model=True, num_epochs=1, train_size=0.8, multi_gpu=True)
Training new model w/ 2-layer, 128-cell LSTMs                                               
Training on 66,910 character sequences.
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1', '/job:localhost/replica:0/task:0/dev
ice:GPU:2')                                                                                 
Training on 3 GPUs.                                                                                                                                                                      
WARNING:tensorflow:sample_weight modes were coerced from
  ...                                                                                       
    to                                 
  ['...']                                                                                                                                                                                
WARNING:tensorflow:sample_weight modes were coerced from               
  ...                                                                                       
    to     
  ['...']                                                                                   
Train for 174 steps, validate for 43 steps                                                  
INFO:tensorflow:batch_all_reduce: 9 all-reduces with algorithm = nccl, num_packs = 1, agg_small_grads_max_bytes = 0 and agg_small_grads_max_group = 10
WARNING:tensorflow:Efficient allreduce is not supported for 1 IndexedSlices
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:GPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1',
 '/job:localhost/replica:0/task:0/device:GPU:2').                      
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).  
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:batch_all_reduce: 9 all-reduces with algorithm = nccl, num_packs = 1, agg_small_grads_max_bytes = 0 and agg_small_grads_max_group = 10
WARNING:tensorflow:Efficient allreduce is not supported for 1 IndexedSlices
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:GPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1',
 '/job:localhost/replica:0/task:0/device:GPU:2').
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
2020-01-15 11:56:03.082815: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:561] function_optimizer failed: Invalid argument: Node 'replica_1/model_2/rnn_1/StatefulPartitionedCa
ll_replica_1/StatefulPartitionedCall_1_27': Connecting to invalid output 33 of source node replica_1/model_2/rnn_1/StatefulPartitionedCall which has 33 outputs.
2020-01-15 11:56:03.669908: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-01-15 11:56:04.117159: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
173/174 [============================>.] - ETA: 0s - loss: 3.4812INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task
:0/device:CPU:0',).                                                                         
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
####################
Temperature: 0.2                                                                                                                                                                         
####################                                                                        
2020-01-15 11:56:16.331083: W tensorflow/core/framework/op_kernel.cc:1655] OP_REQUIRES failed at cudnn_rnn_ops.cc:1517 : Unknown: CUDNN_STATUS_BAD_PARAM
in tensorflow/stream_executor/cuda/cuda_dnn.cc(1393): 'cudnnSetTensorNdDescriptor( tensor_desc.get(), data_type, sizeof(dims) / sizeof(dims[0]), dims, strides)'
2020-01-15 11:56:16.331135: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Unknown: CUDNN_STATUS_BAD_PARAM
in tensorflow/stream_executor/cuda/cuda_dnn.cc(1393): 'cudnnSetTensorNdDescriptor( tensor_desc.get(), data_type, sizeof(dims) / sizeof(dims[0]), dims, strides)'
         [[{{node CudnnRNN}}]]                                                              
2020-01-15 11:56:16.331182: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Unknown: {{function_node __inference_cudnn_lstm_with_fal
lback_23945_specialized_for_replica_1_model_1_rnn_1_StatefulPartitionedCall_at___inference_distributed_function_25546}} {{function_node __inference_cudnn_lstm_with_fallback_23945_specia
lized_for_replica_1_model_1_rnn_1_StatefulPartitionedCall_at___inference_distributed_function_25546}} CUDNN_STATUS_BAD_PARAM
in tensorflow/stream_executor/cuda/cuda_dnn.cc(1393): 'cudnnSetTensorNdDescriptor( tensor_desc.get(), data_type, sizeof(dims) / sizeof(dims[0]), dims, strides)'
         [[{{node CudnnRNN}}]]
         [[replica_1/model_1/rnn_1/StatefulPartitionedCall]]
         [[replica_1/model_1/output/Softmax/_130]]
2020-01-15 11:56:16.331229: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Unknown: {{function_node __inference_cudnn_lstm_with_fallback_23945_specialized_for_replica_1_model_1_rnn_1_StatefulPartitionedCall_at___inference_distributed_function_25546}} {{function_node __inference_cudnn_lstm_with_fallback_23945_specialized_for_replica_1_model_1_rnn_1_StatefulPartitionedCall_at___inference_distributed_function_25546}} CUDNN_STATUS_BAD_PARAM
in tensorflow/stream_executor/cuda/cuda_dnn.cc(1393): 'cudnnSetTensorNdDescriptor( tensor_desc.get(), data_type, sizeof(dims) / sizeof(dims[0]), dims, strides)'
         [[{{node CudnnRNN}}]]
         [[replica_1/model_1/rnn_1/StatefulPartitionedCall]]
         [[replica_1/model_1/rnn_2/StatefulPartitionedCall/_111]]
2020-01-15 11:56:16.331268: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Unknown: {{function_node __inference_cudnn_lstm_with_fallback_23945_specialized_for_replica_1_model_1_rnn_1_StatefulPartitionedCall_at___inference_distributed_function_25546}} {{function_node __inference_cudnn_lstm_with_fallback_23945_specialized_for_replica_1_model_1_rnn_1_StatefulPartitionedCall_at___inference_distributed_function_25546}} CUDNN_STATUS_BAD_PARAM
in tensorflow/stream_executor/cuda/cuda_dnn.cc(1393): 'cudnnSetTensorNdDescriptor( tensor_desc.get(), data_type, sizeof(dims) / sizeof(dims[0]), dims, strides)'
         [[{{node CudnnRNN}}]]
         [[replica_1/model_1/rnn_1/StatefulPartitionedCall]]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/markus.klock/textgenrnn/textgenrnn/textgenrnn.py", line 356, in train_from_largetext_file
    texts, single_text=True, **kwargs)
  File "/home/markus.klock/textgenrnn/textgenrnn/textgenrnn.py", line 317, in train_new_model
    **kwargs)
  File "/home/markus.klock/textgenrnn/textgenrnn/textgenrnn.py", line 243, in train_on_texts
    validation_steps=val_steps
  File "/home/markus.klock/tf2/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training.py", line 819, in fit
    use_multiprocessing=use_multiprocessing)
  File "/home/markus.klock/tf2/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 397, in fit
    prefix='val_')
  File "/usr/lib/python3.6/contextlib.py", line 88, in __exit__
    next(self.gen)
  File "/home/markus.klock/tf2/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 771, in on_epoch
    self.callbacks.on_epoch_end(epoch, epoch_logs)
  File "/home/markus.klock/tf2/lib/python3.6/site-packages/tensorflow_core/python/keras/callbacks.py", line 302, in on_epoch_end
    callback.on_epoch_end(epoch, logs)
  File "/home/markus.klock/textgenrnn/textgenrnn/utils.py", line 283, in on_epoch_end
    max_gen_length=self.max_gen_length)
  File "/home/markus.klock/textgenrnn/textgenrnn/textgenrnn.py", line 100, in generate_samples
    self.generate(n, temperature=temperature, progress=False, **kwargs)
  File "/home/markus.klock/textgenrnn/textgenrnn/textgenrnn.py", line 89, in generate
    prefix)
  File "/home/markus.klock/textgenrnn/textgenrnn/utils.py", line 97, in textgenrnn_generate
    model.predict(encoded_text, batch_size=1)[0],
  File "/home/markus.klock/tf2/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training.py", line 1013, in predict
    use_multiprocessing=use_multiprocessing)
  File "/home/markus.klock/tf2/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 498, in predict
    workers=workers, use_multiprocessing=use_multiprocessing, **kwargs)
  File "/home/markus.klock/tf2/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 475, in _model_iteration
    total_epochs=1)
  File "/home/markus.klock/tf2/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 128, in run_one_epoch
    batch_outs = execution_function(iterator)
  File "/home/markus.klock/tf2/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_v2_utils.py", line 98, in execution_function
    distributed_function(input_fn))
  File "/home/markus.klock/tf2/lib/python3.6/site-packages/tensorflow_core/python/eager/def_function.py", line 568, in __call__
    result = self._call(*args, **kwds)
  File "/home/markus.klock/tf2/lib/python3.6/site-packages/tensorflow_core/python/eager/def_function.py", line 638, in _call
    return self._concrete_stateful_fn._filtered_call(canon_args, canon_kwds)  # pylint: disable=protected-access
  File "/home/markus.klock/tf2/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 1611, in _filtered_call
    self.captured_inputs)
  File "/home/markus.klock/tf2/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 1692, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager))
  File "/home/markus.klock/tf2/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 545, in call
    ctx=ctx)
  File "/home/markus.klock/tf2/lib/python3.6/site-packages/tensorflow_core/python/eager/execute.py", line 67, in quick_execute
    six.raise_from(core._status_to_exception(e.code, message), None)
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.UnknownError:  [_Derived_]  CUDNN_STATUS_BAD_PARAM
in tensorflow/stream_executor/cuda/cuda_dnn.cc(1393): 'cudnnSetTensorNdDescriptor( tensor_desc.get(), data_type, sizeof(dims) / sizeof(dims[0]), dims, strides)'
         [[{{node CudnnRNN}}]]
         [[replica_1/model_1/rnn_1/StatefulPartitionedCall]]
         [[replica_1/model_1/output/Softmax/_130]] [Op:__inference_distributed_function_25546]

Function call stack:
distributed_function -> distributed_function -> distributed_function

@minimaxir Any clues to what causes this?
Probably the root cause is that I do not know how distribute.MirroredStrategy() actually works.

minimaxir · 2020-01-15T18:37:02Z

This looks great! Thanks! :) [I apologize for not being productive with my approach.]

I am OK with depreciating multi-GPU if necessary because I don't think that actually worked in the first place; I added it initially just for experimentation,

Have you verified that models created with earlier versions import correctly?

ZerxXxes · 2020-01-15T20:07:36Z

Thanks yourself for writing this awesome application :)

No I have not verified that, I have basically just tried textgen.generate() and textgen.train_from_large_textfile() on both GPU and CPU-only.
This code probably need a lot more testing as there is a high risk of stuff breaking.

ZerxXxes · 2020-01-16T08:00:10Z

Did a test now.

Used TF1.15 and loaded textgenrnn
Trained a new model, generated samples, printed model.summary and saved it to disk.
Switched to TF2.1 and checked out this PR
Loaded the model from disk, generated samples and printed model.summary

Worked without errors :)
Log: http://p.ip.fi/9w8K

minimaxir · 2020-01-28T06:40:37Z

(sorry for lack of response, will look at this PR this week!)

ZerxXxes · 2020-01-28T07:04:27Z

No problem :) Just merged in the master to have the last commits also migrated to TF2.1 and switched to the Adam optimizer as it should perform better than RMSprop in all scenarios.
Multi GPU still does not work, training works but it crashes during generation, I'm trying to figure it out but might be above my knowledge about Tensorflow

minimaxir · 2020-02-02T17:17:46Z

Tested and LGTM, so will merge into the branch (and make a few changes before merging to master)

Two issues I see:

External keras is still necessary for imports so will need to add that as a requirement, not sure if there is a tf.keras analogue for that.
May be a performance regression; one of the scripts in the demo notebook ran twice as long. May be a TF 2.1 issue though.

yslai and others added 15 commits September 9, 2018 10:56

Spaces for word-level training should be applied regardless of new_mo…

25a273a

…del, missing the ellipsis in the list of punctuations

Fixes generate_to_file using unicode characters

bf2394e

Fixes generate_to_file using unicode characters by adding encoding="utf-8" in the open command.

Merge pull request minimaxir#1 from bafonso/bafonso-unicode-to-file

40b25ce

Fixes generate_to_file using unicode characters

fixed bug in regex pattern and enabled removal of spaces around punct…

5cded26

…uation, added double quotes to targeted punctuation

Identity is not the same thing as equality in Python

9688245

We really want equality here. $ __python__ ```python >>> hi = 'h' >>> hi += 'i' >>> hi 'hi' >>> hi == 'hi' True >>> hi is 'hi' False >>> 0 == 0.0 True >>> 0 is 0.0 False ```

allow_growth option for gtx could not create cuDNN issue

62a667b

pr minimaxir#115

e2fc8ed

pr minimaxir#106

1994509

Merge pull request minimaxir#147 from netr/master

b63a1b9

allow_growth option for GeForce RTX could not create cuDNN issue

Merge pull request minimaxir#131 from cclauss/patch-1

bc14847

Identity is not the same thing as equality in Python

Merge pull request minimaxir#129 from reedkavner/space-cleanup

8df6f33

enabled removal of spaces around punctuation

Merge pull request minimaxir#105 from bafonso/master

171ab6d

unicode characters using python 3

Merge pull request minimaxir#72 from yslai/master

9f6a3f2

Spaces for word-level training should be applied regardless of new_mo…

Update LICENSE

32708eb

Migrate to Tensorflow Keras

a86cf77

Remove CuDNNLSTM and switch to using TF2 distribute strategy for mult…

89db840

…i-GPU

wolfgang added 3 commits January 15, 2020 14:00

Fix trainable property

c708953

Typo

6a697ff

More typo omg

cd5967d

wolfgang added 2 commits January 28, 2020 07:36

Merge branch 'master' into TF2.1

520966b

Switch to Adam optimizer

8150887

minimaxir merged commit 2067c35 into minimaxir:TF-2.0 Feb 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrate to Tensorflow Keras to support TF2.1 #165

Migrate to Tensorflow Keras to support TF2.1 #165

ZerxXxes commented Jan 14, 2020

ZerxXxes commented Jan 15, 2020

ZerxXxes commented Jan 15, 2020

minimaxir commented Jan 15, 2020

ZerxXxes commented Jan 15, 2020

ZerxXxes commented Jan 16, 2020

minimaxir commented Jan 28, 2020

ZerxXxes commented Jan 28, 2020

minimaxir commented Feb 2, 2020

Migrate to Tensorflow Keras to support TF2.1 #165

Migrate to Tensorflow Keras to support TF2.1 #165

Conversation

ZerxXxes commented Jan 14, 2020

ZerxXxes commented Jan 15, 2020

ZerxXxes commented Jan 15, 2020

minimaxir commented Jan 15, 2020

ZerxXxes commented Jan 15, 2020

ZerxXxes commented Jan 16, 2020

minimaxir commented Jan 28, 2020

ZerxXxes commented Jan 28, 2020

minimaxir commented Feb 2, 2020