Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error: Error reading from file '.' #480

Closed
wangxw1023 opened this issue Aug 14, 2019 · 27 comments
Closed

Error: Error reading from file '.' #480

wangxw1023 opened this issue Aug 14, 2019 · 27 comments
Assignees

Comments

@wangxw1023
Copy link

Hi, I have no problem with marian training before(Chinese-English), but recently I changed a larger training corpus ( train.bpe.zh:18G train.bpe.en:20G total 38G), which is always irregular throwing out this error during training. Why? And what should I do to train these corpus normally? Thank you very much.
During training, free -h:
total used free shared buff/cache available
Mem: 125G 53G 2.6G 48M 69G 71G
Swap: 3.8G 2.9G 925M

nvidia-smi:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.54 Driver Version: 396.54 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... Off | 00000000:02:00.0 Off | N/A |
| 59% 80C P2 268W / 250W | 7205MiB / 11178MiB | 78% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 108... Off | 00000000:03:00.0 Off | N/A |
| 64% 83C P2 227W / 250W | 7205MiB / 11178MiB | 70% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX 108... Off | 00000000:82:00.0 Off | N/A |
| 60% 81C P2 217W / 250W | 7205MiB / 11178MiB | 97% Default |
+-------------------------------+----------------------+----------------------+
| 3 GeForce GTX 108... Off | 00000000:83:00.0 Off | N/A |
| 66% 83C P2 294W / 250W | 7205MiB / 11178MiB | 79% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 10283 C /media/wangxiuwan/marian/build/marian 7195MiB |
| 1 10283 C /media/wangxiuwan/marian/build/marian 7195MiB |
| 2 10283 C /media/wangxiuwan/marian/build/marian 7195MiB |
| 3 10283 C /media/wangxiuwan/marian/build/marian 7195MiB |
+-----------------------------------------------------------------------------+

train.log:
[2019-08-14 10:55:34] [marian] Marian v1.7.6 02f4af4 2018-12-12 18:51:10 -0800
[2019-08-14 10:55:34] [marian] Running on dbcloud-Super-Server as process 31002 with command line:
[2019-08-14 10:55:34] [marian] /media/wangxiuwan/marian/build/marian --model /media/wangxiuwan/marian/examples/transformer/back_dataset/model_zhen/model.npz --type transformer --pretrained-model /media/wangxiuwan/marian/examples/transformer/back_dataset/model_zhen/model.npz --train-sets /media/tmxmall/marian_nmt/general.gen.back.0807/middle/train.bpe.zh /media/tmxmall/marian_nmt/general.gen.back.0807/middle/train.bpe.en --max-length 100 --vocabs /media/wangxiuwan/marian/examples/transformer/back_dataset/model_vocab_big/vocab.zh.yml /media/wangxiuwan/marian/examples/transformer/back_dataset/model_vocab_big/vocab.en.yml --mini-batch-fit -w 6000 --maxi-batch 1000 --early-stopping 40 --cost-type=ce-mean-words --valid-freq 5000 --save-freq 5000 --disp-freq 1000 --valid-metrics ce-mean-words perplexity translation --valid-sets /media/tmxmall/marian_nmt/general.gen.back.0807/middle/valid.bpe.zh /media/tmxmall/marian_nmt/general.gen.back.0807/middle/valid.bpe.en --valid-script-path 'bash /media/wangxiuwan/marian/examples/transformer/back_dataset/scripts/validate_zhen.sh' --valid-translation-output /media/wangxiuwan/marian/examples/transformer/back_dataset/tmxmall_valid_data/valid.en.output --quiet-translation --valid-mini-batch 16 --beam-size 6 --normalize 0.6 --overwrite --keep-best --log /media/wangxiuwan/marian/examples/transformer/back_dataset/model_zhen/train.log --valid-log /media/wangxiuwan/marian/examples/transformer/back_dataset/model_zhen/valid.log --enc-depth 6 --dec-depth 6 --transformer-heads 8 --transformer-postprocess-emb d --transformer-postprocess dan --transformer-dropout 0.1 --label-smoothing 0.1 --learn-rate 0.0003 --lr-warmup 16000 --lr-decay-inv-sqrt 16000 --lr-report --optimizer-params 0.9 0.98 1e-09 --clip-norm 5 --tied-embeddings --devices 0 1 2 3 --sync-sgd --seed 1111 --exponential-smoothing
[2019-08-14 10:55:34] [config] after-batches: 0
[2019-08-14 10:55:34] [config] after-epochs: 0
[2019-08-14 10:55:34] [config] allow-unk: false
[2019-08-14 10:55:34] [config] beam-size: 6
[2019-08-14 10:55:34] [config] best-deep: false
[2019-08-14 10:55:34] [config] clip-gemm: 0
[2019-08-14 10:55:34] [config] clip-norm: 5
[2019-08-14 10:55:34] [config] cost-type: ce-mean-words
[2019-08-14 10:55:34] [config] cpu-threads: 0
[2019-08-14 10:55:34] [config] data-weighting-type: sentence
[2019-08-14 10:55:34] [config] dec-cell: gru
[2019-08-14 10:55:34] [config] dec-cell-base-depth: 2
[2019-08-14 10:55:34] [config] dec-cell-high-depth: 1
[2019-08-14 10:55:34] [config] dec-depth: 6
[2019-08-14 10:55:34] [config] devices:
[2019-08-14 10:55:34] [config] - 0
[2019-08-14 10:55:34] [config] - 1
[2019-08-14 10:55:34] [config] - 2
[2019-08-14 10:55:34] [config] - 3
[2019-08-14 10:55:34] [config] dim-emb: 512
[2019-08-14 10:55:34] [config] dim-rnn: 1024
[2019-08-14 10:55:34] [config] dim-vocabs:
[2019-08-14 10:55:34] [config] - 36000
[2019-08-14 10:55:34] [config] - 34366
[2019-08-14 10:55:34] [config] disp-first: 0
[2019-08-14 10:55:34] [config] disp-freq: 1000
[2019-08-14 10:55:34] [config] disp-label-counts: false
[2019-08-14 10:55:34] [config] dropout-rnn: 0
[2019-08-14 10:55:34] [config] dropout-src: 0
[2019-08-14 10:55:34] [config] dropout-trg: 0
[2019-08-14 10:55:34] [config] early-stopping: 40
[2019-08-14 10:55:34] [config] embedding-fix-src: false
[2019-08-14 10:55:34] [config] embedding-fix-trg: false
[2019-08-14 10:55:34] [config] embedding-normalization: false
[2019-08-14 10:55:34] [config] enc-cell: gru
[2019-08-14 10:55:34] [config] enc-cell-depth: 1
[2019-08-14 10:55:34] [config] enc-depth: 6
[2019-08-14 10:55:34] [config] enc-type: bidirectional
[2019-08-14 10:55:34] [config] exponential-smoothing: 0.0001
[2019-08-14 10:55:34] [config] grad-dropping-momentum: 0
[2019-08-14 10:55:34] [config] grad-dropping-rate: 0
[2019-08-14 10:55:34] [config] grad-dropping-warmup: 100
[2019-08-14 10:55:34] [config] guided-alignment: none
[2019-08-14 10:55:34] [config] guided-alignment-cost: mse
[2019-08-14 10:55:34] [config] guided-alignment-weight: 0.1
[2019-08-14 10:55:34] [config] ignore-model-config: false
[2019-08-14 10:55:34] [config] interpolate-env-vars: false
[2019-08-14 10:55:34] [config] keep-best: true
[2019-08-14 10:55:34] [config] label-smoothing: 0.1
[2019-08-14 10:55:34] [config] layer-normalization: false
[2019-08-14 10:55:34] [config] learn-rate: 0.0003
[2019-08-14 10:55:34] [config] log: /media/wangxiuwan/marian/examples/transformer/back_dataset/model_zhen/train.log
[2019-08-14 10:55:34] [config] log-level: info
[2019-08-14 10:55:34] [config] lr-decay: 0
[2019-08-14 10:55:34] [config] lr-decay-freq: 50000
[2019-08-14 10:55:34] [config] lr-decay-inv-sqrt: 16000
[2019-08-14 10:55:34] [config] lr-decay-repeat-warmup: false
[2019-08-14 10:55:34] [config] lr-decay-reset-optimizer: false
[2019-08-14 10:55:34] [config] lr-decay-start:
[2019-08-14 10:55:34] [config] - 10
[2019-08-14 10:55:34] [config] - 1
[2019-08-14 10:55:34] [config] lr-decay-strategy: epoch+stalled
[2019-08-14 10:55:34] [config] lr-report: true
[2019-08-14 10:55:34] [config] lr-warmup: 16000
[2019-08-14 10:55:34] [config] lr-warmup-at-reload: false
[2019-08-14 10:55:34] [config] lr-warmup-cycle: false
[2019-08-14 10:55:34] [config] lr-warmup-start-rate: 0
[2019-08-14 10:55:34] [config] max-length: 100
[2019-08-14 10:55:34] [config] max-length-crop: false
[2019-08-14 10:55:34] [config] max-length-factor: 3
[2019-08-14 10:55:34] [config] maxi-batch: 1000
[2019-08-14 10:55:34] [config] maxi-batch-sort: trg
[2019-08-14 10:55:34] [config] mini-batch: 64
[2019-08-14 10:55:34] [config] mini-batch-fit: true
[2019-08-14 10:55:34] [config] mini-batch-fit-step: 10
[2019-08-14 10:55:34] [config] mini-batch-words: 0
[2019-08-14 10:55:34] [config] model: /media/wangxiuwan/marian/examples/transformer/back_dataset/model_zhen/model.npz
[2019-08-14 10:55:34] [config] multi-node: false
[2019-08-14 10:55:34] [config] multi-node-overlap: true
[2019-08-14 10:55:34] [config] n-best: false
[2019-08-14 10:55:34] [config] no-nccl: false
[2019-08-14 10:55:34] [config] no-reload: false
[2019-08-14 10:55:34] [config] no-restore-corpus: false
[2019-08-14 10:55:34] [config] no-shuffle: false
[2019-08-14 10:55:34] [config] normalize: 0.6
[2019-08-14 10:55:34] [config] optimizer: adam
[2019-08-14 10:55:34] [config] optimizer-delay: 1
[2019-08-14 10:55:34] [config] optimizer-params:
[2019-08-14 10:55:34] [config] - 0.9
[2019-08-14 10:55:34] [config] - 0.98
[2019-08-14 10:55:34] [config] - 1e-09
[2019-08-14 10:55:34] [config] overwrite: true
[2019-08-14 10:55:34] [config] pretrained-model: /media/wangxiuwan/marian/examples/transformer/back_dataset/model_zhen/model.npz
[2019-08-14 10:55:34] [config] quiet: false
[2019-08-14 10:55:34] [config] quiet-translation: true
[2019-08-14 10:55:34] [config] relative-paths: false
[2019-08-14 10:55:34] [config] right-left: false
[2019-08-14 10:55:34] [config] save-freq: 5000
[2019-08-14 10:55:34] [config] seed: 1111
[2019-08-14 10:55:34] [config] shuffle-in-ram: false
[2019-08-14 10:55:34] [config] skip: false
[2019-08-14 10:55:34] [config] sqlite: ""
[2019-08-14 10:55:34] [config] sqlite-drop: false
[2019-08-14 10:55:34] [config] sync-sgd: true
[2019-08-14 10:55:34] [config] tempdir: /tmp
[2019-08-14 10:55:34] [config] tied-embeddings: true
[2019-08-14 10:55:34] [config] tied-embeddings-all: false
[2019-08-14 10:55:34] [config] tied-embeddings-src: false
[2019-08-14 10:55:34] [config] train-sets:
[2019-08-14 10:55:34] [config] - /media/tmxmall/marian_nmt/general.gen.back.0807/middle/train.bpe.zh
[2019-08-14 10:55:34] [config] - /media/tmxmall/marian_nmt/general.gen.back.0807/middle/train.bpe.en
[2019-08-14 10:55:34] [config] transformer-aan-activation: swish
[2019-08-14 10:55:34] [config] transformer-aan-depth: 2
[2019-08-14 10:55:34] [config] transformer-aan-nogate: false
[2019-08-14 10:55:34] [config] transformer-decoder-autoreg: self-attention
[2019-08-14 10:55:34] [config] transformer-dim-aan: 2048
[2019-08-14 10:55:34] [config] transformer-dim-ffn: 2048
[2019-08-14 10:55:34] [config] transformer-dropout: 0.1
[2019-08-14 10:55:34] [config] transformer-dropout-attention: 0
[2019-08-14 10:55:34] [config] transformer-dropout-ffn: 0
[2019-08-14 10:55:34] [config] transformer-ffn-activation: swish
[2019-08-14 10:55:34] [config] transformer-ffn-depth: 2
[2019-08-14 10:55:34] [config] transformer-guided-alignment-layer: last
[2019-08-14 10:55:34] [config] transformer-heads: 8
[2019-08-14 10:55:34] [config] transformer-no-projection: false
[2019-08-14 10:55:34] [config] transformer-postprocess: dan
[2019-08-14 10:55:34] [config] transformer-postprocess-emb: d
[2019-08-14 10:55:34] [config] transformer-preprocess: ""
[2019-08-14 10:55:34] [config] transformer-tied-layers:
[2019-08-14 10:55:34] [config] []
[2019-08-14 10:55:34] [config] type: transformer
[2019-08-14 10:55:34] [config] ulr: false
[2019-08-14 10:55:34] [config] ulr-dim-emb: 0
[2019-08-14 10:55:34] [config] ulr-dropout: 0
[2019-08-14 10:55:34] [config] ulr-keys-vectors: ""
[2019-08-14 10:55:34] [config] ulr-query-vectors: ""
[2019-08-14 10:55:34] [config] ulr-softmax-temperature: 1
[2019-08-14 10:55:34] [config] ulr-trainable-transformation: false
[2019-08-14 10:55:34] [config] valid-freq: 5000
[2019-08-14 10:55:34] [config] valid-log: /media/wangxiuwan/marian/examples/transformer/back_dataset/model_zhen/valid.log
[2019-08-14 10:55:34] [config] valid-max-length: 1000
[2019-08-14 10:55:34] [config] valid-metrics:
[2019-08-14 10:55:34] [config] - ce-mean-words
[2019-08-14 10:55:34] [config] - perplexity
[2019-08-14 10:55:34] [config] - translation
[2019-08-14 10:55:34] [config] valid-mini-batch: 16
[2019-08-14 10:55:34] [config] valid-script-path: bash /media/wangxiuwan/marian/examples/transformer/back_dataset/scripts/validate_zhen.sh
[2019-08-14 10:55:34] [config] valid-sets:
[2019-08-14 10:55:34] [config] - /media/tmxmall/marian_nmt/general.gen.back.0807/middle/valid.bpe.zh
[2019-08-14 10:55:34] [config] - /media/tmxmall/marian_nmt/general.gen.back.0807/middle/valid.bpe.en
[2019-08-14 10:55:34] [config] valid-translation-output: /media/wangxiuwan/marian/examples/transformer/back_dataset/tmxmall_valid_data/valid.en.output
[2019-08-14 10:55:34] [config] version: v1.7.6 02f4af4 2018-12-12 18:51:10 -0800
[2019-08-14 10:55:34] [config] vocabs:
[2019-08-14 10:55:34] [config] - /media/wangxiuwan/marian/examples/transformer/back_dataset/model_vocab_big/vocab.zh.yml
[2019-08-14 10:55:34] [config] - /media/wangxiuwan/marian/examples/transformer/back_dataset/model_vocab_big/vocab.en.yml
[2019-08-14 10:55:34] [config] word-penalty: 0
[2019-08-14 10:55:34] [config] workspace: 6000
[2019-08-14 10:55:34] [config] Loaded model has been created with Marian v1.7.6 02f4af4 2018-12-12 18:51:10 -0800
[2019-08-14 10:55:34] Using synchronous training
[2019-08-14 10:55:34] [data] Loading vocabulary from JSON/Yaml file /media/wangxiuwan/marian/examples/transformer/back_dataset/model_vocab_big/vocab.zh.yml
[2019-08-14 10:55:34] [data] Setting vocabulary size for input 0 to 36000
[2019-08-14 10:55:34] [data] Loading vocabulary from JSON/Yaml file /media/wangxiuwan/marian/examples/transformer/back_dataset/model_vocab_big/vocab.en.yml
[2019-08-14 10:55:34] [data] Setting vocabulary size for input 1 to 34366
[2019-08-14 10:55:34] [batching] Collecting statistics for batch fitting with step size 10
[2019-08-14 10:55:34] Compiled without MPI support. Falling back to FakeMPIWrapper
[2019-08-14 10:55:36] [memory] Extending reserved space to 6016 MB (device gpu0)
[2019-08-14 10:55:36] [memory] Extending reserved space to 6016 MB (device gpu1)
[2019-08-14 10:55:37] [memory] Extending reserved space to 6016 MB (device gpu2)
[2019-08-14 10:55:37] [memory] Extending reserved space to 6016 MB (device gpu3)
[2019-08-14 10:55:37] [comm] Using NCCL 2.3.7 for GPU communication
[2019-08-14 10:55:37] [memory] Reserving 305 MB, device gpu0
[2019-08-14 10:55:38] [memory] Reserving 305 MB, device gpu0
[2019-08-14 10:55:46] [batching] Done
[2019-08-14 10:55:47] [memory] Extending reserved space to 6016 MB (device gpu0)
[2019-08-14 10:55:47] [memory] Extending reserved space to 6016 MB (device gpu1)
[2019-08-14 10:55:47] [memory] Extending reserved space to 6016 MB (device gpu2)
[2019-08-14 10:55:47] [memory] Extending reserved space to 6016 MB (device gpu3)
[2019-08-14 10:55:47] [comm] Using NCCL 2.3.7 for GPU communication
[2019-08-14 10:55:47] Loading model from /media/wangxiuwan/marian/examples/transformer/back_dataset/model_zhen/model.npz.orig.npz
[2019-08-14 10:55:47] Loading model from /media/wangxiuwan/marian/examples/transformer/back_dataset/model_zhen/model.npz.orig.npz
[2019-08-14 10:55:48] Loading model from /media/wangxiuwan/marian/examples/transformer/back_dataset/model_zhen/model.npz.orig.npz
[2019-08-14 10:55:48] Loading model from /media/wangxiuwan/marian/examples/transformer/back_dataset/model_zhen/model.npz.orig.npz
[2019-08-14 10:55:49] Loading Adam parameters from /media/wangxiuwan/marian/examples/transformer/back_dataset/model_zhen/model.npz.optimizer.npz
[2019-08-14 10:55:50] [memory] Reserving 152 MB, device gpu0
[2019-08-14 10:55:50] [memory] Reserving 152 MB, device gpu1
[2019-08-14 10:55:50] [memory] Reserving 152 MB, device gpu2
[2019-08-14 10:55:50] [memory] Reserving 152 MB, device gpu3
[2019-08-14 10:55:50] [data] Restoring the corpus state to epoch 1, batch 65000
[2019-08-14 10:55:50] [data] Shuffling files
[2019-08-14 11:00:34] [data] Done reading 183177554 sentences
[2019-08-14 11:13:07] [data] Done shuffling 183177554 sentences to temp files
[2019-08-14 11:22:06] Training started
[2019-08-14 11:22:06] [memory] Reserving 305 MB, device gpu0
[2019-08-14 11:22:07] [memory] Reserving 305 MB, device gpu2
[2019-08-14 11:22:07] [memory] Reserving 305 MB, device gpu1
[2019-08-14 11:22:07] [memory] Reserving 305 MB, device gpu3
[2019-08-14 11:22:07] Loading model from /media/wangxiuwan/marian/examples/transformer/back_dataset/model_zhen/model.npz
[2019-08-14 11:22:10] [memory] Reserving 305 MB, device cpu0
[2019-08-14 11:22:10] [memory] Reserving 76 MB, device gpu0
[2019-08-14 11:22:10] [memory] Reserving 76 MB, device gpu1
[2019-08-14 11:22:10] [memory] Reserving 76 MB, device gpu2
[2019-08-14 11:22:10] [memory] Reserving 76 MB, device gpu3
[2019-08-14 11:22:10] [memory] Reserving 305 MB, device gpu3
[2019-08-14 11:22:10] [memory] Reserving 305 MB, device gpu2
[2019-08-14 11:22:10] [memory] Reserving 305 MB, device gpu1
[2019-08-14 11:22:10] [memory] Reserving 305 MB, device gpu0
[2019-08-14 11:27:43] Ep. 1 : Up. 66000 : Sen. 16,738,027 : Cost 3.13637638 : Time 1916.10s : 5001.61 words/s : L.r. 1.4771e-04
[2019-08-14 11:33:18] Ep. 1 : Up. 67000 : Sen. 17,173,219 : Cost 3.11160016 : Time 335.53s : 28835.85 words/s : L.r. 1.4660e-04
[2019-08-14 11:38:55] Ep. 1 : Up. 68000 : Sen. 17,606,919 : Cost 3.10447025 : Time 337.14s : 28981.41 words/s : L.r. 1.4552e-04
[2019-08-14 11:44:31] Ep. 1 : Up. 69000 : Sen. 18,040,301 : Cost 3.09355903 : Time 336.26s : 28664.33 words/s : L.r. 1.4446e-04
[2019-08-14 11:50:07] Ep. 1 : Up. 70000 : Sen. 18,465,808 : Cost 3.09279132 : Time 335.46s : 28678.63 words/s : L.r. 1.4343e-04
[2019-08-14 11:50:07] Saving model weights and runtime parameters to /media/wangxiuwan/marian/examples/transformer/back_dataset/model_zhen/model.npz.orig.npz
[2019-08-14 11:50:11] Saving model weights and runtime parameters to /media/wangxiuwan/marian/examples/transformer/back_dataset/model_zhen/model.npz
[2019-08-14 11:50:14] Saving Adam parameters to /media/wangxiuwan/marian/examples/transformer/back_dataset/model_zhen/model.npz.optimizer.npz
[2019-08-14 11:50:28] Saving model weights and runtime parameters to /media/wangxiuwan/marian/examples/transformer/back_dataset/model_zhen/model.npz.best-ce-mean-words.npz
[2019-08-14 11:50:31] [valid] Ep. 1 : Up. 70000 : ce-mean-words : 1.93185 : new best
[2019-08-14 11:50:37] Saving model weights and runtime parameters to /media/wangxiuwan/marian/examples/transformer/back_dataset/model_zhen/model.npz.best-perplexity.npz
[2019-08-14 11:50:40] [valid] Ep. 1 : Up. 70000 : perplexity : 6.90226 : new best
[2019-08-14 11:53:29] Saving model weights and runtime parameters to /media/wangxiuwan/marian/examples/transformer/back_dataset/model_zhen/model.npz.best-translation.npz
[2019-08-14 11:53:32] [valid] Ep. 1 : Up. 70000 : translation : 26.8 : new best
[2019-08-14 11:59:09] Ep. 1 : Up. 71000 : Sen. 18,895,607 : Cost 3.08209920 : Time 542.52s : 17812.32 words/s : L.r. 1.4241e-04
[2019-08-14 12:04:48] Ep. 1 : Up. 72000 : Sen. 19,330,863 : Cost 3.07790542 : Time 338.44s : 28752.45 words/s : L.r. 1.4142e-04
[2019-08-14 12:08:34] Error: Error reading from file '.'
[2019-08-14 12:08:34] Error: Aborted from marian::io::InputFileStream& marian::io::getline(marian::io::InputFileStream&, std::__cxx11::string&) in /media/wangxiuwan/marian/src/common/file_stream.h:218

[CALL STACK]
[0x5b3f82]
[0x5b49f5]
[0x5a58cf]
[0x51638d]
[0x5171cb]
[0x517bae]
[0x43fab9]
[0x7f57f98d9a99] + 0xea99
[0x439142]
[0x440ee1]
[0x468d04]
[0x7f57f93f9c80] + 0xb8c80
[0x7f57f98d26ba] + 0x76ba
[0x7f57f8b5f41d] clone + 0x6d

@wangxw1023
Copy link
Author

And another question is:
when training, I have set --tempdir /media/wangxiuwan/tmp
and I see the log info:
[2019-08-14 15:00:36] [data] Done shuffling 183177554 sentences to temp files

then I check the directory /media/wangxiuwan/tmp, there is nothing.
why? I used to think the tempdir is to save temp files, and the required memory is equal to the corpus' size.

It is clear that I have misunderstood. Can you explain to me?
Thank you very much.

@emjotde
Copy link
Member

emjotde commented Aug 14, 2019

Hi,
The temporary files are invisible as they get delete as soon as they are opened. That way the still exist but immediately get removed by the OS once the process finishes. That is a relative fail-proof to make sure temporary files are not kept during irregular process termination. The directory is still being used.

I would indeed guess that the error is connected to temporary space, so changing to a different folder would be my suggestion. Also, maybe update to current master of marian-dev. That should be version 1.7.8. There might be better error reporting.

@emjotde emjotde self-assigned this Aug 14, 2019
@wangxw1023
Copy link
Author

@emjotde Thank you very much. I have update my marian version to marian-dev 1.7.8. And the training is started normally and lasted for 6 hours. If the error "Error reading from file '.'" is thrown out again, then I contact you again.

@emjotde
Copy link
Member

emjotde commented Aug 14, 2019

Great. I am closing this issue then. Feel free to re-open if you still have problems.

@emjotde emjotde closed this as completed Aug 14, 2019
@frankseide
Copy link
Contributor

frankseide commented Aug 14, 2019

But do we know where this comes from? Are we failing to detect errors while writing to the temp file? Then that's a bug.

@frankseide frankseide reopened this Aug 14, 2019
@emjotde
Copy link
Member

emjotde commented Aug 14, 2019

We changed quite a lot on error reporting, error bits and stream handling between the version that was used and current master. I would suspect a mix of user error like too little temp space and bad reporting behavior in that case of the older Marian version. I would consider this closed unless we get information that there is a bug. The version used was from December last year: v1.7.6 02f4af4 2018-12-12

@wangxw1023
Copy link
Author

wangxw1023 commented Aug 15, 2019

@emjotde Hi, after 6 hours, the error throwed out again.
tempdir: /tmp
I think the tempdir's memory is enough. So I am confused. Error reading from file? Is my server's cache insufficient? then if my corpus's size is 38G, when I use "free -h" and "df -h" check my server, how I can assess the memory is OK to complete the training? Thank you very much! If you need more information, please feel free to contact me.
df -h:
(base) work@dbcloud-Super-Server:/tmp$ df -h
Filesystem Size Used Avail Use% Mounted on
udev 63G 0 63G 0% /dev
tmpfs 13G 91M 13G 1% /run
/dev/sda3 3.5T 2.5T 822G 76% /
tmpfs 63G 216K 63G 1% /dev/shm
tmpfs 5.0M 4.0K 5.0M 1% /run/lock
tmpfs 63G 0 63G 0% /sys/fs/cgroup
/dev/sda2 512M 3.7M 509M 1% /boot/efi
tmpfs 13G 28K 13G 1% /run/user/108
tmpfs 13G 0 13G 0% /run/user/1001
tmpfs 13G 0 13G 0% /run/user/0

(base) work@dbcloud-Super-Server:/tmp$ free -h
total used free shared buff/cache available
Mem: 125G 3.5G 87G 15M 35G 120G
Swap: 3.8G 2.9G 913M

train.log
`[2019-08-14 15:08:05] [marian] Marian v1.7.8 c65c26d 2019-08-11 18:27:00 +0100
[2019-08-14 15:08:05] [marian] Running on dbcloud-Super-Server as process 18137 with command line:
[2019-08-14 15:08:05] [marian] /media/wangxiuwan/marian-dev/build/marian --model /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz --type transformer --train-sets /media/tmxmall/marian_nmt/general.gen.back.0807/middle/train.bpe.zh /media/tmxmall/marian_nmt/general.gen.back.0807/middle/train.bpe.en --max-length 100 --vocabs /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_vocab_big/vocab.zh.yml /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_vocab_big/vocab.en.yml --mini-batch-fit -w 6000 --maxi-batch 1000 --early-stopping 40 --cost-type=ce-mean-words --valid-freq 5000 --save-freq 5000 --disp-freq 1000 --valid-metrics ce-mean-words perplexity translation --valid-sets /media/tmxmall/marian_nmt/general.gen.back.0807/middle/valid.bpe.zh /media/tmxmall/marian_nmt/general.gen.back.0807/middle/valid.bpe.en --valid-script-path 'bash /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/scripts/validate_zhen.sh' --valid-translation-output /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/tmxmall_valid_data/valid.en.output --quiet-translation --valid-mini-batch 16 --beam-size 6 --normalize 0.6 --overwrite --keep-best --log /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/train.log --valid-log /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/valid.log --enc-depth 6 --dec-depth 6 --transformer-heads 8 --transformer-postprocess-emb d --transformer-postprocess dan --transformer-dropout 0.1 --label-smoothing 0.1 --learn-rate 0.0003 --lr-warmup 16000 --lr-decay-inv-sqrt 16000 --lr-report --optimizer-params 0.9 0.98 1e-09 --clip-norm 5 --tied-embeddings --devices 0 1 2 3 --sync-sgd --seed 1111 --exponential-smoothing
[2019-08-14 15:08:05] [config] after-batches: 0
[2019-08-14 15:08:05] [config] after-epochs: 0
[2019-08-14 15:08:05] [config] allow-unk: false
[2019-08-14 15:08:05] [config] beam-size: 6
[2019-08-14 15:08:05] [config] bert-class-symbol: "[CLS]"
[2019-08-14 15:08:05] [config] bert-mask-symbol: "[MASK]"
[2019-08-14 15:08:05] [config] bert-masking-fraction: 0.15
[2019-08-14 15:08:05] [config] bert-sep-symbol: "[SEP]"
[2019-08-14 15:08:05] [config] bert-train-type-embeddings: true
[2019-08-14 15:08:05] [config] bert-type-vocab-size: 2
[2019-08-14 15:08:05] [config] clip-gemm: 0
[2019-08-14 15:08:05] [config] clip-norm: 5
[2019-08-14 15:08:05] [config] cost-type: ce-mean-words
[2019-08-14 15:08:05] [config] cpu-threads: 0
[2019-08-14 15:08:05] [config] data-weighting: ""
[2019-08-14 15:08:05] [config] data-weighting-type: sentence
[2019-08-14 15:08:05] [config] dec-cell: gru
[2019-08-14 15:08:05] [config] dec-cell-base-depth: 2
[2019-08-14 15:08:05] [config] dec-cell-high-depth: 1
[2019-08-14 15:08:05] [config] dec-depth: 6
[2019-08-14 15:08:05] [config] devices:
[2019-08-14 15:08:05] [config] - 0
[2019-08-14 15:08:05] [config] - 1
[2019-08-14 15:08:05] [config] - 2
[2019-08-14 15:08:05] [config] - 3
[2019-08-14 15:08:05] [config] dim-emb: 512
[2019-08-14 15:08:05] [config] dim-rnn: 1024
[2019-08-14 15:08:05] [config] dim-vocabs:
[2019-08-14 15:08:05] [config] - 0
[2019-08-14 15:08:05] [config] - 0
[2019-08-14 15:08:05] [config] disp-first: 0
[2019-08-14 15:08:05] [config] disp-freq: 1000
[2019-08-14 15:08:05] [config] disp-label-counts: false
[2019-08-14 15:08:05] [config] dropout-rnn: 0
[2019-08-14 15:08:05] [config] dropout-src: 0
[2019-08-14 15:08:05] [config] dropout-trg: 0
[2019-08-14 15:08:05] [config] dump-config: ""
[2019-08-14 15:08:05] [config] early-stopping: 40
[2019-08-14 15:08:05] [config] embedding-fix-src: false
[2019-08-14 15:08:05] [config] embedding-fix-trg: false
[2019-08-14 15:08:05] [config] embedding-normalization: false
[2019-08-14 15:08:05] [config] embedding-vectors:
[2019-08-14 15:08:05] [config] []
[2019-08-14 15:08:05] [config] enc-cell: gru
[2019-08-14 15:08:05] [config] enc-cell-depth: 1
[2019-08-14 15:08:05] [config] enc-depth: 6
[2019-08-14 15:08:05] [config] enc-type: bidirectional
[2019-08-14 15:08:05] [config] exponential-smoothing: 0.0001
[2019-08-14 15:08:05] [config] grad-dropping-momentum: 0
[2019-08-14 15:08:05] [config] grad-dropping-rate: 0
[2019-08-14 15:08:05] [config] grad-dropping-warmup: 100
[2019-08-14 15:08:05] [config] guided-alignment: none
[2019-08-14 15:08:05] [config] guided-alignment-cost: mse
[2019-08-14 15:08:05] [config] guided-alignment-weight: 0.1
[2019-08-14 15:08:05] [config] ignore-model-config: false
[2019-08-14 15:08:05] [config] input-types:
[2019-08-14 15:08:05] [config] []
[2019-08-14 15:08:05] [config] interpolate-env-vars: false
[2019-08-14 15:08:05] [config] keep-best: true
[2019-08-14 15:08:05] [config] label-smoothing: 0.1
[2019-08-14 15:08:05] [config] layer-normalization: false
[2019-08-14 15:08:05] [config] learn-rate: 0.0003
[2019-08-14 15:08:05] [config] log: /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/train.log
[2019-08-14 15:08:05] [config] log-level: info
[2019-08-14 15:08:05] [config] log-time-zone: ""
[2019-08-14 15:08:05] [config] lr-decay: 0
[2019-08-14 15:08:05] [config] lr-decay-freq: 50000
[2019-08-14 15:08:05] [config] lr-decay-inv-sqrt:
[2019-08-14 15:08:05] [config] - 16000
[2019-08-14 15:08:05] [config] lr-decay-repeat-warmup: false
[2019-08-14 15:08:05] [config] lr-decay-reset-optimizer: false
[2019-08-14 15:08:05] [config] lr-decay-start:
[2019-08-14 15:08:05] [config] - 10
[2019-08-14 15:08:05] [config] - 1
[2019-08-14 15:08:05] [config] lr-decay-strategy: epoch+stalled
[2019-08-14 15:08:05] [config] lr-report: true
[2019-08-14 15:08:05] [config] lr-warmup: 16000
[2019-08-14 15:08:05] [config] lr-warmup-at-reload: false
[2019-08-14 15:08:05] [config] lr-warmup-cycle: false
[2019-08-14 15:08:05] [config] lr-warmup-start-rate: 0
[2019-08-14 15:08:05] [config] max-length: 100
[2019-08-14 15:08:05] [config] max-length-crop: false
[2019-08-14 15:08:05] [config] max-length-factor: 3
[2019-08-14 15:08:05] [config] maxi-batch: 1000
[2019-08-14 15:08:05] [config] maxi-batch-sort: trg
[2019-08-14 15:08:05] [config] mini-batch: 64
[2019-08-14 15:08:05] [config] mini-batch-fit: true
[2019-08-14 15:08:05] [config] mini-batch-fit-step: 10
[2019-08-14 15:08:05] [config] mini-batch-overstuff: 1
[2019-08-14 15:08:05] [config] mini-batch-track-lr: false
[2019-08-14 15:08:05] [config] mini-batch-understuff: 1
[2019-08-14 15:08:05] [config] mini-batch-warmup: 0
[2019-08-14 15:08:05] [config] mini-batch-words: 0
[2019-08-14 15:08:05] [config] mini-batch-words-ref: 0
[2019-08-14 15:08:05] [config] model: /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz
[2019-08-14 15:08:05] [config] multi-loss-type: sum
[2019-08-14 15:08:05] [config] multi-node: false
[2019-08-14 15:08:05] [config] multi-node-overlap: true
[2019-08-14 15:08:05] [config] n-best: false
[2019-08-14 15:08:05] [config] no-nccl: false
[2019-08-14 15:08:05] [config] no-reload: false
[2019-08-14 15:08:05] [config] no-restore-corpus: false
[2019-08-14 15:08:05] [config] no-shuffle: false
[2019-08-14 15:08:05] [config] normalize: 0.6
[2019-08-14 15:08:05] [config] num-devices: 0
[2019-08-14 15:08:05] [config] optimizer: adam
[2019-08-14 15:08:05] [config] optimizer-delay: 1
[2019-08-14 15:08:05] [config] optimizer-params:
[2019-08-14 15:08:05] [config] - 0.9
[2019-08-14 15:08:05] [config] - 0.98
[2019-08-14 15:08:05] [config] - 1e-09
[2019-08-14 15:08:05] [config] overwrite: true
[2019-08-14 15:08:05] [config] pretrained-model: ""
[2019-08-14 15:08:05] [config] quiet: false
[2019-08-14 15:08:05] [config] quiet-translation: true
[2019-08-14 15:08:05] [config] relative-paths: false
[2019-08-14 15:08:05] [config] right-left: false
[2019-08-14 15:08:05] [config] save-freq: 5000
[2019-08-14 15:08:05] [config] seed: 1111
[2019-08-14 15:08:05] [config] shuffle-in-ram: false
[2019-08-14 15:08:05] [config] skip: false
[2019-08-14 15:08:05] [config] sqlite: ""
[2019-08-14 15:08:05] [config] sqlite-drop: false
[2019-08-14 15:08:05] [config] sync-sgd: true
[2019-08-14 15:08:05] [config] tempdir: /tmp
[2019-08-14 15:08:05] [config] tied-embeddings: true
[2019-08-14 15:08:05] [config] tied-embeddings-all: false
[2019-08-14 15:08:05] [config] tied-embeddings-src: false
[2019-08-14 15:08:05] [config] train-sets:
[2019-08-14 15:08:05] [config] - /media/tmxmall/marian_nmt/general.gen.back.0807/middle/train.bpe.zh
[2019-08-14 15:08:05] [config] - /media/tmxmall/marian_nmt/general.gen.back.0807/middle/train.bpe.en
[2019-08-14 15:08:05] [config] transformer-aan-activation: swish
[2019-08-14 15:08:05] [config] transformer-aan-depth: 2
[2019-08-14 15:08:05] [config] transformer-aan-nogate: false
[2019-08-14 15:08:05] [config] transformer-decoder-autoreg: self-attention
[2019-08-14 15:08:05] [config] transformer-dim-aan: 2048
[2019-08-14 15:08:05] [config] transformer-dim-ffn: 2048
[2019-08-14 15:08:05] [config] transformer-dropout: 0.1
[2019-08-14 15:08:05] [config] transformer-dropout-attention: 0
[2019-08-14 15:08:05] [config] transformer-dropout-ffn: 0
[2019-08-14 15:08:05] [config] transformer-ffn-activation: swish
[2019-08-14 15:08:05] [config] transformer-ffn-depth: 2
[2019-08-14 15:08:05] [config] transformer-guided-alignment-layer: last
[2019-08-14 15:08:05] [config] transformer-heads: 8
[2019-08-14 15:08:05] [config] transformer-no-projection: false
[2019-08-14 15:08:05] [config] transformer-postprocess: dan
[2019-08-14 15:08:05] [config] transformer-postprocess-emb: d
[2019-08-14 15:08:05] [config] transformer-preprocess: ""
[2019-08-14 15:08:05] [config] transformer-tied-layers:
[2019-08-14 15:08:05] [config] []
[2019-08-14 15:08:05] [config] transformer-train-position-embeddings: false
[2019-08-14 15:08:05] [config] type: transformer
[2019-08-14 15:08:05] [config] ulr: false
[2019-08-14 15:08:05] [config] ulr-dim-emb: 0
[2019-08-14 15:08:05] [config] ulr-dropout: 0
[2019-08-14 15:08:05] [config] ulr-keys-vectors: ""
[2019-08-14 15:08:05] [config] ulr-query-vectors: ""
[2019-08-14 15:08:05] [config] ulr-softmax-temperature: 1
[2019-08-14 15:08:05] [config] ulr-trainable-transformation: false
[2019-08-14 15:08:05] [config] valid-freq: 5000
[2019-08-14 15:08:05] [config] valid-log: /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/valid.log
[2019-08-14 15:08:05] [config] valid-max-length: 1000
[2019-08-14 15:08:05] [config] valid-metrics:
[2019-08-14 15:08:05] [config] - ce-mean-words
[2019-08-14 15:08:05] [config] - perplexity
[2019-08-14 15:08:05] [config] - translation
[2019-08-14 15:08:05] [config] valid-mini-batch: 16
[2019-08-14 15:08:05] [config] valid-script-path: bash /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/scripts/validate_zhen.sh
[2019-08-14 15:08:05] [config] valid-sets:
[2019-08-14 15:08:05] [config] - /media/tmxmall/marian_nmt/general.gen.back.0807/middle/valid.bpe.zh
[2019-08-14 15:08:05] [config] - /media/tmxmall/marian_nmt/general.gen.back.0807/middle/valid.bpe.en
[2019-08-14 15:08:05] [config] valid-translation-output: /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/tmxmall_valid_data/valid.en.output
[2019-08-14 15:08:05] [config] vocabs:
[2019-08-14 15:08:05] [config] - /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_vocab_big/vocab.zh.yml
[2019-08-14 15:08:05] [config] - /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_vocab_big/vocab.en.yml
[2019-08-14 15:08:05] [config] word-penalty: 0
[2019-08-14 15:08:05] [config] workspace: 6000
[2019-08-14 15:08:05] [config] Model is being created with Marian v1.7.8 c65c26d 2019-08-11 18:27:00 +0100
[2019-08-14 15:08:05] Using synchronous training
[2019-08-14 15:08:05] [data] Loading vocabulary from JSON/Yaml file /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_vocab_big/vocab.zh.yml
[2019-08-14 15:08:06] [data] Setting vocabulary size for input 0 to 36000
[2019-08-14 15:08:06] [data] Loading vocabulary from JSON/Yaml file /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_vocab_big/vocab.en.yml
[2019-08-14 15:08:06] [data] Setting vocabulary size for input 1 to 34366
[2019-08-14 15:08:06] Compiled without MPI support. Falling back to FakeMPIWrapper
[2019-08-14 15:08:06] [batching] Collecting statistics for batch fitting with step size 10
[2019-08-14 15:08:08] [memory] Extending reserved space to 6016 MB (device gpu0)
[2019-08-14 15:08:09] [memory] Extending reserved space to 6016 MB (device gpu1)
[2019-08-14 15:08:10] [memory] Extending reserved space to 6016 MB (device gpu2)
[2019-08-14 15:08:10] [memory] Extending reserved space to 6016 MB (device gpu3)
[2019-08-14 15:08:10] [comm] Using NCCL 2.4.2 for GPU communication
[2019-08-14 15:08:10] [comm] NCCLCommunicator constructed successfully.
[2019-08-14 15:08:10] [training] Using 4 GPUs
[2019-08-14 15:08:10] [memory] Reserving 305 MB, device gpu0
[2019-08-14 15:08:10] [gpu] 16-bit TensorCores enabled for float32 matrix operations
[2019-08-14 15:08:11] [memory] Reserving 305 MB, device gpu0
[2019-08-14 15:08:19] [batching] Done. Typical MB size is 16312 target words
[2019-08-14 15:08:20] [memory] Extending reserved space to 6016 MB (device gpu0)
[2019-08-14 15:08:20] [memory] Extending reserved space to 6016 MB (device gpu1)
[2019-08-14 15:08:20] [memory] Extending reserved space to 6016 MB (device gpu2)
[2019-08-14 15:08:20] [memory] Extending reserved space to 6016 MB (device gpu3)
[2019-08-14 15:08:20] [comm] Using NCCL 2.4.2 for GPU communication
[2019-08-14 15:08:20] [comm] NCCLCommunicator constructed successfully.
[2019-08-14 15:08:20] [training] Using 4 GPUs
[2019-08-14 15:08:20] Training started
[2019-08-14 15:08:20] [data] Shuffling data
[2019-08-14 15:10:29] [data] Done reading 183177554 sentences
[2019-08-14 15:23:10] [data] Done shuffling 183177554 sentences to temp files
[2019-08-14 15:23:59] [training] Batches are processed as 1 process(es) x 4 devices/process
[2019-08-14 15:23:59] [memory] Reserving 305 MB, device gpu2
[2019-08-14 15:23:59] [memory] Reserving 305 MB, device gpu1
[2019-08-14 15:23:59] [memory] Reserving 305 MB, device gpu0
[2019-08-14 15:23:59] [memory] Reserving 305 MB, device gpu3
[2019-08-14 15:23:59] [memory] Reserving 305 MB, device gpu3
[2019-08-14 15:23:59] [memory] Reserving 305 MB, device gpu2
[2019-08-14 15:23:59] [memory] Reserving 305 MB, device gpu1
[2019-08-14 15:23:59] [memory] Reserving 305 MB, device gpu0
[2019-08-14 15:23:59] [memory] Reserving 76 MB, device gpu0
[2019-08-14 15:23:59] [memory] Reserving 76 MB, device gpu1
[2019-08-14 15:23:59] [memory] Reserving 76 MB, device gpu2
[2019-08-14 15:23:59] [memory] Reserving 76 MB, device gpu3
[2019-08-14 15:23:59] [memory] Reserving 152 MB, device gpu2
[2019-08-14 15:23:59] [memory] Reserving 152 MB, device gpu1
[2019-08-14 15:23:59] [memory] Reserving 152 MB, device gpu0
[2019-08-14 15:23:59] [memory] Reserving 152 MB, device gpu3
[2019-08-14 15:29:25] Ep. 1 : Up. 1000 : Sen. 430,572 : Cost 8.71003532 : Time 1279.01s : 7530.88 words/s : L.r. 1.8750e-05
[2019-08-14 15:34:51] Ep. 1 : Up. 2000 : Sen. 860,866 : Cost 7.37228441 : Time 326.23s : 29710.28 words/s : L.r. 3.7500e-05
[2019-08-14 15:40:17] Ep. 1 : Up. 3000 : Sen. 1,294,446 : Cost 6.85132647 : Time 325.73s : 29540.07 words/s : L.r. 5.6250e-05
[2019-08-14 15:45:45] Ep. 1 : Up. 4000 : Sen. 1,725,695 : Cost 6.48512745 : Time 327.94s : 29417.02 words/s : L.r. 7.5000e-05
[2019-08-14 15:51:12] Ep. 1 : Up. 5000 : Sen. 2,154,290 : Cost 6.19205332 : Time 326.97s : 29413.35 words/s : L.r. 9.3750e-05
[2019-08-14 15:51:12] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.orig.npz
[2019-08-14 15:51:14] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz
[2019-08-14 15:51:15] Saving Adam parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.optimizer.npz
[2019-08-14 15:51:25] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.best-ce-mean-words.npz
[2019-08-14 15:51:26] [valid] Ep. 1 : Up. 5000 : ce-mean-words : 5.18464 : new best
[2019-08-14 15:51:32] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.best-perplexity.npz
[2019-08-14 15:51:33] [valid] Ep. 1 : Up. 5000 : perplexity : 178.508 : new best
[2019-08-14 16:01:44] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.best-translation.npz
[2019-08-14 16:01:45] [valid] Ep. 1 : Up. 5000 : translation : 2.82 : new best
[2019-08-14 16:07:14] Ep. 1 : Up. 6000 : Sen. 2,587,084 : Cost 5.97804403 : Time 962.52s : 10096.13 words/s : L.r. 1.1250e-04
[2019-08-14 16:12:44] Ep. 1 : Up. 7000 : Sen. 3,021,933 : Cost 5.73995399 : Time 329.30s : 29539.79 words/s : L.r. 1.3125e-04
[2019-08-14 16:18:10] Ep. 1 : Up. 8000 : Sen. 3,453,915 : Cost 5.47970676 : Time 326.02s : 29519.30 words/s : L.r. 1.5000e-04
[2019-08-14 16:23:38] Ep. 1 : Up. 9000 : Sen. 3,885,778 : Cost 5.15550852 : Time 328.07s : 29405.33 words/s : L.r. 1.6875e-04
[2019-08-14 16:29:05] Ep. 1 : Up. 10000 : Sen. 4,312,859 : Cost 4.82228327 : Time 327.86s : 29455.02 words/s : L.r. 1.8750e-04
[2019-08-14 16:29:05] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.orig.npz
[2019-08-14 16:29:09] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz
[2019-08-14 16:29:13] Saving Adam parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.optimizer.npz
[2019-08-14 16:29:27] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.best-ce-mean-words.npz
[2019-08-14 16:29:30] [valid] Ep. 1 : Up. 10000 : ce-mean-words : 3.72119 : new best
[2019-08-14 16:29:36] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.best-perplexity.npz
[2019-08-14 16:29:38] [valid] Ep. 1 : Up. 10000 : perplexity : 41.3133 : new best
[2019-08-14 16:37:04] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.best-translation.npz
[2019-08-14 16:37:07] [valid] Ep. 1 : Up. 10000 : translation : 11.21 : new best
[2019-08-14 16:42:39] Ep. 1 : Up. 11000 : Sen. 4,753,050 : Cost 4.46836233 : Time 813.71s : 11942.40 words/s : L.r. 2.0625e-04
[2019-08-14 16:48:12] Ep. 1 : Up. 12000 : Sen. 5,187,010 : Cost 4.21248579 : Time 332.70s : 29257.87 words/s : L.r. 2.2500e-04
[2019-08-14 16:53:44] Ep. 1 : Up. 13000 : Sen. 5,619,950 : Cost 4.04143047 : Time 332.22s : 29239.11 words/s : L.r. 2.4375e-04
[2019-08-14 16:59:17] Ep. 1 : Up. 14000 : Sen. 6,052,294 : Cost 3.91244173 : Time 332.71s : 29292.93 words/s : L.r. 2.6250e-04
[2019-08-14 17:04:46] Ep. 1 : Up. 15000 : Sen. 6,487,970 : Cost 3.82174230 : Time 329.43s : 29337.17 words/s : L.r. 2.8125e-04
[2019-08-14 17:04:46] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.orig.npz
[2019-08-14 17:04:49] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz
[2019-08-14 17:04:52] Saving Adam parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.optimizer.npz
[2019-08-14 17:05:05] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.best-ce-mean-words.npz
[2019-08-14 17:05:08] [valid] Ep. 1 : Up. 15000 : ce-mean-words : 2.63367 : new best
[2019-08-14 17:05:14] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.best-perplexity.npz
[2019-08-14 17:05:17] [valid] Ep. 1 : Up. 15000 : perplexity : 13.9248 : new best
[2019-08-14 17:09:30] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.best-translation.npz
[2019-08-14 17:09:33] [valid] Ep. 1 : Up. 15000 : translation : 20.84 : new best
[2019-08-14 17:15:10] Ep. 1 : Up. 16000 : Sen. 6,925,326 : Cost 3.74980569 : Time 623.42s : 15682.17 words/s : L.r. 3.0000e-04
[2019-08-14 17:20:38] Ep. 1 : Up. 17000 : Sen. 7,357,788 : Cost 3.68259025 : Time 328.55s : 29373.51 words/s : L.r. 2.9104e-04
[2019-08-14 17:26:05] Ep. 1 : Up. 18000 : Sen. 7,788,773 : Cost 3.61811399 : Time 326.90s : 29333.35 words/s : L.r. 2.8284e-04
[2019-08-14 17:31:32] Ep. 1 : Up. 19000 : Sen. 8,215,891 : Cost 3.56085038 : Time 327.35s : 29362.99 words/s : L.r. 2.7530e-04
[2019-08-14 17:37:01] Ep. 1 : Up. 20000 : Sen. 8,641,960 : Cost 3.51233172 : Time 328.51s : 29077.22 words/s : L.r. 2.6833e-04
[2019-08-14 17:37:01] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.orig.npz
[2019-08-14 17:37:04] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz
[2019-08-14 17:37:07] Saving Adam parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.optimizer.npz
[2019-08-14 17:37:21] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.best-ce-mean-words.npz
[2019-08-14 17:37:23] [valid] Ep. 1 : Up. 20000 : ce-mean-words : 2.31458 : new best
[2019-08-14 17:37:29] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.best-perplexity.npz
[2019-08-14 17:37:32] [valid] Ep. 1 : Up. 20000 : perplexity : 10.1206 : new best
[2019-08-14 17:41:41] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.best-translation.npz
[2019-08-14 17:41:44] [valid] Ep. 1 : Up. 20000 : translation : 23.69 : new best
[2019-08-14 17:47:14] Ep. 1 : Up. 21000 : Sen. 9,068,108 : Cost 3.47109151 : Time 613.25s : 15683.43 words/s : L.r. 2.6186e-04
[2019-08-14 17:52:46] Ep. 1 : Up. 22000 : Sen. 9,508,053 : Cost 3.43533826 : Time 332.24s : 29343.18 words/s : L.r. 2.5584e-04
[2019-08-14 17:58:16] Ep. 1 : Up. 23000 : Sen. 9,945,064 : Cost 3.40507197 : Time 329.95s : 29401.35 words/s : L.r. 2.5022e-04
[2019-08-14 18:03:45] Ep. 1 : Up. 24000 : Sen. 10,368,000 : Cost 3.38107204 : Time 328.61s : 29104.86 words/s : L.r. 2.4495e-04
[2019-08-14 18:09:16] Ep. 1 : Up. 25000 : Sen. 10,800,209 : Cost 3.35002422 : Time 330.99s : 29296.81 words/s : L.r. 2.4000e-04
[2019-08-14 18:09:16] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.orig.npz
[2019-08-14 18:09:20] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz
[2019-08-14 18:09:22] Saving Adam parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.optimizer.npz
[2019-08-14 18:09:36] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.best-ce-mean-words.npz
[2019-08-14 18:09:38] [valid] Ep. 1 : Up. 25000 : ce-mean-words : 2.14763 : new best
[2019-08-14 18:09:44] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.best-perplexity.npz
[2019-08-14 18:09:47] [valid] Ep. 1 : Up. 25000 : perplexity : 8.5645 : new best
[2019-08-14 18:13:37] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.best-translation.npz
[2019-08-14 18:13:40] [valid] Ep. 1 : Up. 25000 : translation : 25.34 : new best
[2019-08-14 18:19:13] Ep. 1 : Up. 26000 : Sen. 11,236,544 : Cost 3.32940030 : Time 596.65s : 16186.05 words/s : L.r. 2.3534e-04
[2019-08-14 18:24:45] Ep. 1 : Up. 27000 : Sen. 11,663,576 : Cost 3.30576849 : Time 332.28s : 29219.47 words/s : L.r. 2.3094e-04
[2019-08-14 18:30:17] Ep. 1 : Up. 28000 : Sen. 12,099,503 : Cost 3.28457904 : Time 332.27s : 29206.43 words/s : L.r. 2.2678e-04
[2019-08-14 18:35:50] Ep. 1 : Up. 29000 : Sen. 12,535,422 : Cost 3.26705837 : Time 333.12s : 29155.51 words/s : L.r. 2.2283e-04
[2019-08-14 18:41:24] Ep. 1 : Up. 30000 : Sen. 12,970,064 : Cost 3.25107145 : Time 333.26s : 29033.30 words/s : L.r. 2.1909e-04
[2019-08-14 18:41:24] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.orig.npz
[2019-08-14 18:41:27] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz
[2019-08-14 18:41:30] Saving Adam parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.optimizer.npz
[2019-08-14 18:41:43] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.best-ce-mean-words.npz
[2019-08-14 18:41:46] [valid] Ep. 1 : Up. 30000 : ce-mean-words : 2.04728 : new best
[2019-08-14 18:41:53] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.best-perplexity.npz
[2019-08-14 18:41:55] [valid] Ep. 1 : Up. 30000 : perplexity : 7.74683 : new best
[2019-08-14 18:45:39] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.best-translation.npz
[2019-08-14 18:45:42] [valid] Ep. 1 : Up. 30000 : translation : 26.25 : new best
[2019-08-14 18:51:14] Ep. 1 : Up. 31000 : Sen. 13,403,037 : Cost 3.23336005 : Time 590.54s : 16521.07 words/s : L.r. 2.1553e-04
[2019-08-14 18:56:47] Ep. 1 : Up. 32000 : Sen. 13,837,498 : Cost 3.22317600 : Time 332.36s : 29167.47 words/s : L.r. 2.1213e-04
[2019-08-14 19:02:20] Ep. 1 : Up. 33000 : Sen. 14,272,712 : Cost 3.21346664 : Time 333.69s : 29228.31 words/s : L.r. 2.0889e-04
[2019-08-14 19:07:52] Ep. 1 : Up. 34000 : Sen. 14,702,637 : Cost 3.20045567 : Time 332.20s : 29039.14 words/s : L.r. 2.0580e-04
[2019-08-14 19:13:26] Ep. 1 : Up. 35000 : Sen. 15,141,594 : Cost 3.18303347 : Time 333.22s : 29171.15 words/s : L.r. 2.0284e-04
[2019-08-14 19:13:26] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.orig.npz
[2019-08-14 19:13:29] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz
[2019-08-14 19:13:32] Saving Adam parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.optimizer.npz
[2019-08-14 19:13:46] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.best-ce-mean-words.npz
[2019-08-14 19:13:48] [valid] Ep. 1 : Up. 35000 : ce-mean-words : 1.97774 : new best
[2019-08-14 19:13:54] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.best-perplexity.npz
[2019-08-14 19:13:57] [valid] Ep. 1 : Up. 35000 : perplexity : 7.22638 : new best
[2019-08-14 19:17:51] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.best-translation.npz
[2019-08-14 19:17:54] [valid] Ep. 1 : Up. 35000 : translation : 26.98 : new best
[2019-08-14 19:23:26] Ep. 1 : Up. 36000 : Sen. 15,567,649 : Cost 3.17901897 : Time 600.46s : 16035.94 words/s : L.r. 2.0000e-04
[2019-08-14 19:28:59] Ep. 1 : Up. 37000 : Sen. 16,005,163 : Cost 3.16255951 : Time 333.06s : 29264.98 words/s : L.r. 1.9728e-04
[2019-08-14 19:34:33] Ep. 1 : Up. 38000 : Sen. 16,434,421 : Cost 3.15619040 : Time 333.69s : 28981.12 words/s : L.r. 1.9467e-04
[2019-08-14 19:40:06] Ep. 1 : Up. 39000 : Sen. 16,869,501 : Cost 3.14585876 : Time 333.45s : 28927.82 words/s : L.r. 1.9215e-04
[2019-08-14 19:45:39] Ep. 1 : Up. 40000 : Sen. 17,299,517 : Cost 3.14065385 : Time 333.17s : 28857.12 words/s : L.r. 1.8974e-04
[2019-08-14 19:45:39] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.orig.npz
[2019-08-14 19:45:43] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz
[2019-08-14 19:45:47] Saving Adam parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.optimizer.npz
[2019-08-14 19:46:00] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.best-ce-mean-words.npz
[2019-08-14 19:46:03] [valid] Ep. 1 : Up. 40000 : ce-mean-words : 1.92773 : new best
[2019-08-14 19:46:09] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.best-perplexity.npz
[2019-08-14 19:46:12] [valid] Ep. 1 : Up. 40000 : perplexity : 6.87392 : new best
[2019-08-14 19:50:01] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.best-translation.npz
[2019-08-14 19:50:04] [valid] Ep. 1 : Up. 40000 : translation : 27.47 : new best
[2019-08-14 19:55:38] Ep. 1 : Up. 41000 : Sen. 17,735,126 : Cost 3.12336731 : Time 598.70s : 16245.60 words/s : L.r. 1.8741e-04
[2019-08-14 20:01:12] Ep. 1 : Up. 42000 : Sen. 18,167,700 : Cost 3.12053299 : Time 334.24s : 28995.10 words/s : L.r. 1.8516e-04
[2019-08-14 20:06:46] Ep. 1 : Up. 43000 : Sen. 18,595,935 : Cost 3.11462998 : Time 333.61s : 29069.59 words/s : L.r. 1.8300e-04
[2019-08-14 20:12:18] Ep. 1 : Up. 44000 : Sen. 19,030,271 : Cost 3.10356069 : Time 332.21s : 29121.34 words/s : L.r. 1.8091e-04
[2019-08-14 20:17:47] Ep. 1 : Up. 45000 : Sen. 19,464,326 : Cost 3.10118794 : Time 328.89s : 29393.89 words/s : L.r. 1.7889e-04
[2019-08-14 20:17:47] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.orig.npz
[2019-08-14 20:17:51] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz
[2019-08-14 20:17:54] Saving Adam parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.optimizer.npz
[2019-08-14 20:18:08] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.best-ce-mean-words.npz
[2019-08-14 20:18:11] [valid] Ep. 1 : Up. 45000 : ce-mean-words : 1.88996 : new best
[2019-08-14 20:18:17] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.best-perplexity.npz
[2019-08-14 20:18:21] [valid] Ep. 1 : Up. 45000 : perplexity : 6.61913 : new best
[2019-08-14 20:22:03] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.best-translation.npz
[2019-08-14 20:22:07] [valid] Ep. 1 : Up. 45000 : translation : 27.86 : new best
[2019-08-14 20:27:38] Ep. 1 : Up. 46000 : Sen. 19,896,865 : Cost 3.09044385 : Time 590.47s : 16412.72 words/s : L.r. 1.7693e-04
[2019-08-14 20:33:07] Ep. 1 : Up. 47000 : Sen. 20,326,121 : Cost 3.08517575 : Time 329.21s : 29284.41 words/s : L.r. 1.7504e-04
[2019-08-14 20:38:39] Ep. 1 : Up. 48000 : Sen. 20,757,964 : Cost 3.08068657 : Time 332.40s : 28818.26 words/s : L.r. 1.7321e-04
[2019-08-14 20:44:12] Ep. 1 : Up. 49000 : Sen. 21,189,484 : Cost 3.07408929 : Time 333.21s : 29174.55 words/s : L.r. 1.7143e-04
[2019-08-14 20:49:45] Ep. 1 : Up. 50000 : Sen. 21,623,640 : Cost 3.07062435 : Time 332.32s : 29025.05 words/s : L.r. 1.6971e-04
[2019-08-14 20:49:45] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.orig.npz
[2019-08-14 20:49:49] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz
[2019-08-14 20:49:52] Saving Adam parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.optimizer.npz
[2019-08-14 20:50:06] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.best-ce-mean-words.npz
[2019-08-14 20:50:09] [valid] Ep. 1 : Up. 50000 : ce-mean-words : 1.85912 : new best
[2019-08-14 20:50:15] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.best-perplexity.npz
[2019-08-14 20:50:18] [valid] Ep. 1 : Up. 50000 : perplexity : 6.41809 : new best
[2019-08-14 20:53:59] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.best-translation.npz
[2019-08-14 20:54:02] [valid] Ep. 1 : Up. 50000 : translation : 28.23 : new best
[2019-08-14 20:59:35] Ep. 1 : Up. 51000 : Sen. 22,055,283 : Cost 3.05950332 : Time 590.78s : 16367.92 words/s : L.r. 1.6803e-04
[2019-08-14 21:05:09] Ep. 1 : Up. 52000 : Sen. 22,487,320 : Cost 3.05860353 : Time 333.43s : 29001.64 words/s : L.r. 1.6641e-04
[2019-08-14 21:10:40] Ep. 1 : Up. 53000 : Sen. 22,916,589 : Cost 3.05387068 : Time 331.30s : 28959.10 words/s : L.r. 1.6483e-04
[2019-08-14 21:16:13] Ep. 1 : Up. 54000 : Sen. 23,348,763 : Cost 3.04580259 : Time 332.33s : 29114.50 words/s : L.r. 1.6330e-04
[2019-08-14 21:21:44] Ep. 1 : Up. 55000 : Sen. 23,778,911 : Cost 3.04406929 : Time 331.78s : 29113.74 words/s : L.r. 1.6181e-04
[2019-08-14 21:21:44] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.orig.npz
[2019-08-14 21:21:48] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz
[2019-08-14 21:21:51] Saving Adam parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.optimizer.npz
[2019-08-14 21:22:05] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.best-ce-mean-words.npz
[2019-08-14 21:22:08] [valid] Ep. 1 : Up. 55000 : ce-mean-words : 1.83448 : new best
[2019-08-14 21:22:14] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.best-perplexity.npz
[2019-08-14 21:22:17] [valid] Ep. 1 : Up. 55000 : perplexity : 6.26187 : new best
[2019-08-14 21:26:05] Saving model weights and runtime parameters to /media/wangxiuwan/marian-dev/examples/transformer/back_dataset/model_zhen/model.npz.best-translation.npz
[2019-08-14 21:26:09] [valid] Ep. 1 : Up. 55000 : translation : 28.47 : new best
[2019-08-14 21:29:53] Error: Error reading from file '.'
[2019-08-14 21:29:53] Error: Aborted from marian::io::InputFileStream& marian::io::getline(marian::io::InputFileStream&, std::__cxx11::string&) in /media/wangxiuwan/marian-dev/src/common/file_stream.h:216

[CALL STACK]
[0x5cd0e2]
[0x5ce1c8]
[0x5bc0cf]
[0x51e62d]
[0x51f67b]
[0x52005e]
[0x441459]
[0x7f5dff4c1a99] + 0xea99
[0x43ad72]
[0x442881]
[0x46b684]
[0x7f5e0c686678] + 0xb8678
[0x7f5dff4ba6ba] + 0x76ba
[0x7f5dfecdf41d] clone + 0x6d

`

@frankseide
Copy link
Contributor

frankseide commented Aug 15, 2019 via email

@wangxw1023
Copy link
Author

@frankseide I think what you said is very reasonable. Let me try. Can you tell me where is the code to delete temp files and what change is I need to do? I have search global in marian's code, but I am not sure where to make changes.

@wangxw1023
Copy link
Author

@frankseide Maybe you can tell me what is the rule for judging the terminator, how to judge the last line does end in a newline character? I can check my corpus, is there a sentence pair with no terminator?

@frankseide
Copy link
Contributor

frankseide commented Aug 15, 2019 via email

@wangxw1023
Copy link
Author

@frankseide Thank you very much. And you have said that we may Need to temporarily change the code to not delete the tmp file, so that we can inspect it. I would like to ask you where to change? Is it setting the trainging paramters:
--shuffle-in-ram Keep shuffled corpus in RAM, do not write to temp file

@emjotde emjotde reopened this Aug 15, 2019
@emjotde
Copy link
Member

emjotde commented Aug 15, 2019

Hi again. Interesting.
I have a couple of questions:

  • As you said, can you try using --shuffle-in-ram. That will not use the temporary file and if the error does not occur again we have a hint that it is indeed the temporary file.
  • Can you post your full compilation command? I cannot understand why your stack trace does not contain function names etc. That should be compiled in by default and would make error diagnosis a bit easier.
  • Can you run df -h /tmp before and during training and post the results?

@snukky
Copy link
Member

snukky commented Aug 15, 2019

A comment to the first question: the --sqlite option should also skip using temporary files, and use a SQLite DB file for storing and shuffling the data.

@emjotde
Copy link
Member

emjotde commented Aug 15, 2019

Let's ignore --sqlite for the moment. By default the sqlite database is created also in the temporary folder and it's larger than the raw text. So if there are some weird hidden space problems that might not really help.

@wangxw1023
Copy link
Author

wangxw1023 commented Aug 15, 2019

@emjotde @snukky Thank you for both of your reply.
Currently, I have restarted the training with --shuffle-in-ram. It seems normal for three hours. I'm not sure whether the error would appear again.
about emjotde's two other questions:

  • compilation command:

mkdir build
cd build
cmake .. -DCUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda-9.2
-DOPENSSL_ROOT_DIR=/usr/local/ssl -DOPENSSL_LIBRARIES=/usr/local/ssl/lib
-DBOOST_ROOT=/media/wangxiuwan/boost_1_65_1
make -j

  • df -h /tmp: before and during training is the same.

(base) work@dbcloud-Super-Server:~$ df -h /tmp
Filesystem Size Used Avail Use% Mounted on
/dev/sda3 3.5T 2.5T 822G 76% /

@emjotde
Copy link
Member

emjotde commented Aug 15, 2019

OK, this is a lot of space. So it should not be a space problem.
Can you please add -DCMAKE_BUILD_TYPE=Release to your cmake command like this:

cmake .. -DCMAKE_BUILD_TYPE=Release -DCUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda-9.2 
-DOPENSSL_ROOT_DIR=/usr/local/ssl -DOPENSSL_LIBRARIES=/usr/local/ssl/lib 
-DBOOST_ROOT=/media/wangxiuwan/boost_1_65_1

We made this the default yesterday, but you might not have that version yet. This will compile with function names and the stack trace should be more informative.

@kpu
Copy link
Member

kpu commented Aug 15, 2019

In any case, no space should have triggered ENOSPC on write, assuming proper error checking.

@emjotde
Copy link
Member

emjotde commented Aug 15, 2019

"assuming proper error checking", well, that's a strong assumption :)

@emjotde
Copy link
Member

emjotde commented Aug 15, 2019

One bug is that we do not set the file path in the input stream when handing in a temporary file. That at least explains why the error message is saying '.' (default Pathie path) instead of a proper temporary filename from tempnam. So this seems to make Frank's ideas more likely.

@emjotde
Copy link
Member

emjotde commented Aug 15, 2019

I'll add an option later today to keep temporary files and fix the name issue. Will let you when it's ready to try.

@emjotde
Copy link
Member

emjotde commented Aug 15, 2019

Branch tempfile now has an option --keep-temp which will keep the temporary files inside the folder instead of unlinking. I also fixed the name handling, so the error should tell you now which of the temporary files failed. We should probably also add a line counter and add it to the error message.

@wangxw1023
Copy link
Author

@emjotde Hi, until now, the training is still going on and no error is reported. It seems that the error is really related to temp files. When the training is completed, I will try the cmake method (add -DCMAKE_BUILD_TYPE=Release) and Branch tempfile, and I will post the result.

@wangxw1023
Copy link
Author

Hi, the training is still going on. But I think it's a bit strange,Why did the training just arrive at epoch 2, and bleu won't rise? I plan to train for 7 days, but this is only two days. Can you give me some advice on the reasons?
image

train.log

@emjotde
Copy link
Member

emjotde commented Sep 4, 2019

Looks quite normal to me. That's a lot of iterations, I would not expect the score to improve. Do you have reason do believe that the results are bad?

@wangxw1023 wangxw1023 reopened this Sep 7, 2019
@wangxw1023
Copy link
Author

wangxw1023 commented Sep 7, 2019

@emjotde
Hi, emjotde
I fell so sorry to tell you that as our server hard drive was broken some time ago, and we lost a lot of corpus, including this issues's related corpus. So branch tempfile's work only can started until we generate the corpus again, which may take a long time. When there is a result, I will upload it.

And about the train log, we both trained the transformer model with marian and tensor2tensor, which used the same corpus. However, the max bleu of marian is 32 while the max bleu of tensor2tensor is 41. Then I believe the training results of marian are bad.

The following is the training curve of tensor2tensor.

image

@emjotde
Copy link
Member

emjotde commented Jan 6, 2020

Closing this now due to inactivity (ours). Feel free to reopen. Usually we do not have problems to match T2T performance, no idea where that would come from. On the other hand there were a few bugs in the marian-dev code around that time. Maybe that has solved itself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants