You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
First of all thank you very much for your work .
when I train x4 super-resolution completely with your code, but after a period of training,it will occur "Bus error (core dumped)".when I use "python -X faulthandler train.py -opt
./confs/SRFlow_DF2K_4X.yml"
First of all thank you very much for your work .
when I train x4 super-resolution completely with your code, but after a period of training,it will occur "Bus error (core dumped)".when I use "python -X faulthandler train.py -opt
./confs/SRFlow_DF2K_4X.yml"
it will output:
"
21-02-15 21:58:11.131 - INFO: Model [SRFlowModel] is created.
21-02-15 21:58:11.131 - INFO: Resuming training from epoch: 2, iter: 51000.
21-02-15 21:58:11.450 - INFO: Start training from epoch: 2, iter: 51000
/run/media/root/7de46b27-ca07-4d98-8955-0d77387c5764/test/SRFlow/myenv/lib/python3.7/site-packages/torch/nn/functional.py:3103: UserWarning: The default behavior for interpolate/upsample with float scale_factor changed in 1.6.0 to align with other frameworks/libraries, and now uses scale_factor directly, instead of relying on the computed output size. If you wish to restore the old behavior, please set recompute_scale_factor=True. See the documentation of nn.Upsample for details.
warnings.warn("The default behavior for interpolate/upsample with float scale_factor changed "
<epoch: 2, iter: 51,001, lr:2.500e-04, t:-1.00e+00, td:9.45e-01, eta:-4.14e+01, nll:-1.566e+01>
<epoch: 2, iter: 51,002, lr:2.500e-04, t:-1.00e+00, td:8.32e-04, eta:-4.14e+01, nll:-1.597e+01>
<epoch: 2, iter: 51,003, lr:2.500e-04, t:1.94e+00, td:2.45e-03, eta:8.01e+01, nll:-1.660e+01>
<epoch: 2, iter: 51,004, lr:2.500e-04, t:1.78e+00, td:3.97e-03, eta:7.37e+01, nll:-1.757e+01>
<epoch: 2, iter: 51,005, lr:2.500e-04, t:1.77e+00, td:8.54e-04, eta:7.32e+01, nll:-1.686e+01>
<epoch: 2, iter: 51,006, lr:2.500e-04, t:2.06e+00, td:6.81e-04, eta:8.52e+01, nll:-1.774e+01>
<epoch: 2, iter: 51,007, lr:2.500e-04, t:1.71e+00, td:1.89e-03, eta:7.06e+01, nll:-1.683e+01>
<epoch: 2, iter: 51,008, lr:2.500e-04, t:1.93e+00, td:2.01e-03, eta:7.98e+01, nll:-1.652e+01>
<epoch: 2, iter: 51,009, lr:2.500e-04, t:1.97e+00, td:2.18e-03, eta:8.16e+01, nll:-1.687e+01>
<epoch: 2, iter: 51,010, lr:2.500e-04, t:1.87e+00, td:2.10e-03, eta:7.72e+01, nll:-1.748e+01>
<epoch: 2, iter: 51,011, lr:2.500e-04, t:1.78e+00, td:3.10e-03, eta:7.36e+01, nll:-1.672e+01>
<epoch: 2, iter: 51,012, lr:2.500e-04, t:2.06e+00, td:3.12e-03, eta:8.51e+01, nll:-1.859e+01>
<epoch: 2, iter: 51,013, lr:2.500e-04, t:1.83e+00, td:2.23e-03, eta:7.57e+01, nll:-1.672e+01>
<epoch: 2, iter: 51,014, lr:2.500e-04, t:1.81e+00, td:2.39e-03, eta:7.50e+01, nll:-1.772e+01>
<epoch: 2, iter: 51,015, lr:2.500e-04, t:1.84e+00, td:1.94e-03, eta:7.60e+01, nll:-1.877e+01>
<epoch: 2, iter: 51,016, lr:2.500e-04, t:1.73e+00, td:3.45e-03, eta:7.17e+01, nll:-1.696e+01>
<epoch: 2, iter: 51,017, lr:2.500e-04, t:1.84e+00, td:2.32e-03, eta:7.62e+01, nll:-1.874e+01>
<epoch: 2, iter: 51,018, lr:2.500e-04, t:2.22e+00, td:2.27e-03, eta:9.18e+01, nll:-1.709e+01>
<epoch: 2, iter: 51,019, lr:2.500e-04, t:1.90e+00, td:1.72e-03, eta:7.87e+01, nll:-1.638e+01>
<epoch: 2, iter: 51,020, lr:2.500e-04, t:1.77e+00, td:2.30e-03, eta:7.31e+01, nll:-1.529e+01>
<epoch: 2, iter: 51,021, lr:2.500e-04, t:1.86e+00, td:3.02e-03, eta:7.70e+01, nll:-1.642e+01>
<epoch: 2, iter: 51,022, lr:2.500e-04, t:1.81e+00, td:2.15e-03, eta:7.48e+01, nll:-1.789e+01>
<epoch: 2, iter: 51,023, lr:2.500e-04, t:1.85e+00, td:2.35e-03, eta:7.65e+01, nll:-1.866e+01>
<epoch: 2, iter: 51,024, lr:2.500e-04, t:1.83e+00, td:2.18e-03, eta:7.57e+01, nll:-1.676e+01>
<epoch: 2, iter: 51,100, lr:2.500e-04, t:1.88e+00, td:2.37e-03, eta:7.78e+01, nll:-1.536e+01>
<epoch: 2, iter: 51,200, lr:2.500e-04, t:1.90e+00, td:2.51e-03, eta:7.86e+01, nll:-1.572e+01>
<epoch: 2, iter: 51,300, lr:2.500e-04, t:1.88e+00, td:2.45e-03, eta:7.75e+01, nll:-1.708e+01>
<epoch: 2, iter: 51,400, lr:2.500e-04, t:1.86e+00, td:2.42e-03, eta:7.68e+01, nll:-1.943e+01>
<epoch: 2, iter: 51,500, lr:2.500e-04, t:1.88e+00, td:2.47e-03, eta:7.76e+01, nll:-1.640e+01>
<epoch: 2, iter: 51,600, lr:2.500e-04, t:1.87e+00, td:2.39e-03, eta:7.71e+01, nll:-1.571e+01>
<epoch: 2, iter: 51,700, lr:2.500e-04, t:1.88e+00, td:2.43e-03, eta:7.74e+01, nll:-1.633e+01>
<epoch: 2, iter: 51,800, lr:2.500e-04, t:1.88e+00, td:2.47e-03, eta:7.73e+01, nll:-1.499e+01>
<epoch: 2, iter: 51,900, lr:2.500e-04, t:1.87e+00, td:2.43e-03, eta:7.71e+01, nll:-1.538e+01>
<epoch: 2, iter: 52,000, lr:2.500e-04, t:1.87e+00, td:2.40e-03, eta:7.70e+01, nll:-1.629e+01>
21-02-15 22:29:40.137 - INFO: Saving models and training states.
<epoch: 2, iter: 52,100, lr:2.500e-04, t:1.90e+00, td:2.42e-03, eta:7.79e+01, nll:-1.673e+01>
<epoch: 2, iter: 52,200, lr:2.500e-04, t:1.89e+00, td:2.46e-03, eta:7.77e+01, nll:-1.898e+01>
<epoch: 2, iter: 52,300, lr:2.500e-04, t:1.89e+00, td:2.43e-03, eta:7.77e+01, nll:-1.815e+01>
<epoch: 2, iter: 52,400, lr:2.500e-04, t:1.86e+00, td:2.44e-03, eta:7.64e+01, nll:-1.801e+01>
<epoch: 2, iter: 52,500, lr:2.500e-04, t:1.89e+00, td:2.48e-03, eta:7.76e+01, nll:-1.746e+01>
<epoch: 2, iter: 52,600, lr:2.500e-04, t:1.88e+00, td:2.53e-03, eta:7.70e+01, nll:-1.614e+01>
<epoch: 2, iter: 52,700, lr:2.500e-04, t:1.87e+00, td:2.44e-03, eta:7.66e+01, nll:-1.496e+01>
<epoch: 2, iter: 52,800, lr:2.500e-04, t:1.89e+00, td:2.44e-03, eta:7.71e+01, nll:-1.682e+01>
<epoch: 2, iter: 52,900, lr:2.500e-04, t:1.87e+00, td:2.48e-03, eta:7.66e+01, nll:-1.676e+01>
<epoch: 2, iter: 53,000, lr:2.500e-04, t:1.87e+00, td:2.42e-03, eta:7.62e+01, nll:-1.719e+01>
21-02-15 23:01:01.845 - INFO: Saving models and training states.
<epoch: 2, iter: 53,100, lr:2.500e-04, t:1.87e+00, td:2.37e-03, eta:7.62e+01, nll:-1.640e+01>
<epoch: 2, iter: 53,200, lr:2.500e-04, t:1.87e+00, td:2.41e-03, eta:7.64e+01, nll:-1.765e+01>
<epoch: 2, iter: 53,300, lr:2.500e-04, t:1.89e+00, td:2.45e-03, eta:7.69e+01, nll:-1.725e+01>
<epoch: 2, iter: 53,400, lr:2.500e-04, t:1.89e+00, td:2.45e-03, eta:7.70e+01, nll:-1.702e+01>
<epoch: 2, iter: 53,500, lr:2.500e-04, t:1.88e+00, td:2.43e-03, eta:7.64e+01, nll:-1.803e+01>
<epoch: 2, iter: 53,600, lr:2.500e-04, t:1.88e+00, td:2.42e-03, eta:7.65e+01, nll:-1.760e+01>
<epoch: 2, iter: 53,700, lr:2.500e-04, t:1.86e+00, td:2.40e-03, eta:7.57e+01, nll:-1.747e+01>
<epoch: 2, iter: 53,800, lr:2.500e-04, t:1.89e+00, td:2.43e-03, eta:7.66e+01, nll:-2.144e+01>
<epoch: 2, iter: 53,900, lr:2.500e-04, t:1.90e+00, td:2.43e-03, eta:7.72e+01, nll:-1.826e+01>
<epoch: 2, iter: 54,000, lr:2.500e-04, t:1.88e+00, td:2.40e-03, eta:7.64e+01, nll:-1.700e+01>
21-02-15 23:32:23.089 - INFO: Saving models and training states.
<epoch: 2, iter: 54,100, lr:2.500e-04, t:1.90e+00, td:2.55e-03, eta:7.70e+01, nll:-1.809e+01>
<epoch: 2, iter: 54,200, lr:2.500e-04, t:1.89e+00, td:2.46e-03, eta:7.64e+01, nll:-1.832e+01>
<epoch: 2, iter: 54,300, lr:2.500e-04, t:1.90e+00, td:2.47e-03, eta:7.67e+01, nll:-1.641e+01>
<epoch: 2, iter: 54,400, lr:2.500e-04, t:1.89e+00, td:2.47e-03, eta:7.63e+01, nll:-1.669e+01>
<epoch: 2, iter: 54,500, lr:2.500e-04, t:1.88e+00, td:2.43e-03, eta:7.60e+01, nll:-1.491e+01>
<epoch: 2, iter: 54,600, lr:2.500e-04, t:1.88e+00, td:2.46e-03, eta:7.58e+01, nll:-1.798e+01>
<epoch: 2, iter: 54,700, lr:2.500e-04, t:1.90e+00, td:2.41e-03, eta:7.65e+01, nll:-1.596e+01>
<epoch: 2, iter: 54,800, lr:2.500e-04, t:1.88e+00, td:2.53e-03, eta:7.59e+01, nll:-1.580e+01>
<epoch: 2, iter: 54,900, lr:2.500e-04, t:1.88e+00, td:2.40e-03, eta:7.58e+01, nll:-1.713e+01>
<epoch: 2, iter: 55,000, lr:2.500e-04, t:1.89e+00, td:2.43e-03, eta:7.62e+01, nll:-1.874e+01>
21-02-16 00:03:50.708 - INFO: Saving models and training states.
<epoch: 2, iter: 55,100, lr:2.500e-04, t:1.89e+00, td:2.44e-03, eta:7.62e+01, nll:-1.506e+01>
<epoch: 2, iter: 55,200, lr:2.500e-04, t:1.88e+00, td:2.39e-03, eta:7.55e+01, nll:-1.786e+01>
<epoch: 2, iter: 55,300, lr:2.500e-04, t:1.88e+00, td:2.47e-03, eta:7.58e+01, nll:-1.834e+01>
<epoch: 2, iter: 55,400, lr:2.500e-04, t:1.88e+00, td:2.39e-03, eta:7.54e+01, nll:-1.841e+01>
<epoch: 2, iter: 55,500, lr:2.500e-04, t:1.89e+00, td:2.41e-03, eta:7.58e+01, nll:-1.820e+01>
<epoch: 2, iter: 55,600, lr:2.500e-04, t:1.87e+00, td:2.46e-03, eta:7.51e+01, nll:-1.633e+01>
<epoch: 2, iter: 55,700, lr:2.500e-04, t:1.89e+00, td:2.44e-03, eta:7.57e+01, nll:-1.660e+01>
<epoch: 2, iter: 55,800, lr:2.500e-04, t:1.89e+00, td:2.44e-03, eta:7.59e+01, nll:-1.856e+01>
<epoch: 2, iter: 55,900, lr:2.500e-04, t:1.90e+00, td:2.45e-03, eta:7.59e+01, nll:-1.613e+01>
CUBLAS error: out of memory (3) in magma_sgetrf_gpu_expert at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/sgetrf_gpu.cpp:126
CUBLAS error: not initialized (1) in magma_sgetrf_gpu_expert at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/sgetrf_gpu.cpp:126
Skipping ERROR caught in nll = model.optimize_parameters(current_step):
Caught RuntimeError in replica 1 on device 1.
Original Traceback (most recent call last):
File "/run/media/root/7de46b27-ca07-4d98-8955-0d77387c5764/test/SRFlow/myenv/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, **kwargs)
File "/run/media/root/7de46b27-ca07-4d98-8955-0d77387c5764/test/SRFlow/myenv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/run/media/root/7de46b27-ca07-4d98-8955-0d77387c5764/test/SRFlow/code/models/modules/SRFlowNet_arch.py", line 65, in forward
y_onehot=y_label)
File "/run/media/root/7de46b27-ca07-4d98-8955-0d77387c5764/test/SRFlow/code/models/modules/SRFlowNet_arch.py", line 101, in normal_flow
y_onehot=y_onehot)
File "/run/media/root/7de46b27-ca07-4d98-8955-0d77387c5764/test/SRFlow/myenv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/run/media/root/7de46b27-ca07-4d98-8955-0d77387c5764/test/SRFlow/code/models/modules/FlowUpsamplerNet.py", line 213, in forward
z, logdet = self.encode(gt, rrdbResults, logdet=logdet, epses=epses, y_onehot=y_onehot)
File "/run/media/root/7de46b27-ca07-4d98-8955-0d77387c5764/test/SRFlow/code/models/modules/FlowUpsamplerNet.py", line 238, in encode
fl_fea, logdet = layer(fl_fea, logdet, reverse=reverse, rrdbResults=level_conditionals[level])
File "/run/media/root/7de46b27-ca07-4d98-8955-0d77387c5764/test/SRFlow/myenv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/run/media/root/7de46b27-ca07-4d98-8955-0d77387c5764/test/SRFlow/code/models/modules/FlowStep.py", line 84, in forward
return self.normal_flow(input, logdet, rrdbResults)
File "/run/media/root/7de46b27-ca07-4d98-8955-0d77387c5764/test/SRFlow/code/models/modules/FlowStep.py", line 103, in normal_flow
self, z, logdet, False)
File "/run/media/root/7de46b27-ca07-4d98-8955-0d77387c5764/test/SRFlow/code/models/modules/FlowStep.py", line 35, in
"invconv": lambda obj, z, logdet, rev: obj.invconv(z, logdet, rev),
File "/run/media/root/7de46b27-ca07-4d98-8955-0d77387c5764/test/SRFlow/myenv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/run/media/root/7de46b27-ca07-4d98-8955-0d77387c5764/test/SRFlow/code/models/modules/Permutations.py", line 48, in forward
weight, dlogdet = self.get_weight(input, reverse)
File "/run/media/root/7de46b27-ca07-4d98-8955-0d77387c5764/test/SRFlow/code/models/modules/Permutations.py", line 37, in get_weight
dlogdet = torch.slogdet(self.weight)[1] * pixels
RuntimeError: CUDA error: resource already mapped
<epoch: 2, iter: 56,000, lr:2.500e-04, t:1.86e+00, td:2.41e-03, eta:7.44e+01, nll:-1.589e+01>
21-02-16 00:35:13.687 - INFO: Saving models and training states.
<epoch: 2, iter: 56,100, lr:2.500e-04, t:1.88e+00, td:2.41e-03, eta:7.51e+01, nll:-1.545e+01>
<epoch: 2, iter: 56,200, lr:2.500e-04, t:1.87e+00, td:2.44e-03, eta:7.48e+01, nll:-1.524e+01>
<epoch: 2, iter: 56,300, lr:2.500e-04, t:1.88e+00, td:2.50e-03, eta:7.49e+01, nll:-1.727e+01>
<epoch: 2, iter: 56,400, lr:2.500e-04, t:1.85e+00, td:2.40e-03, eta:7.40e+01, nll:-1.717e+01>
<epoch: 2, iter: 56,500, lr:2.500e-04, t:1.88e+00, td:2.48e-03, eta:7.48e+01, nll:-1.548e+01>
<epoch: 2, iter: 56,600, lr:2.500e-04, t:1.86e+00, td:2.48e-03, eta:7.42e+01, nll:-1.752e+01>
<epoch: 2, iter: 56,700, lr:2.500e-04, t:1.88e+00, td:2.48e-03, eta:7.47e+01, nll:-1.669e+01>
<epoch: 2, iter: 56,800, lr:2.500e-04, t:1.86e+00, td:2.43e-03, eta:7.40e+01, nll:-1.632e+01>
<epoch: 2, iter: 56,900, lr:2.500e-04, t:1.86e+00, td:2.44e-03, eta:7.40e+01, nll:-1.778e+01>
<epoch: 2, iter: 57,000, lr:2.500e-04, t:1.88e+00, td:2.45e-03, eta:7.47e+01, nll:-1.696e+01>
21-02-16 01:06:23.673 - INFO: Saving models and training states.
<epoch: 2, iter: 57,100, lr:2.500e-04, t:1.90e+00, td:2.48e-03, eta:7.54e+01, nll:-1.575e+01>
<epoch: 2, iter: 57,200, lr:2.500e-04, t:1.86e+00, td:2.44e-03, eta:7.39e+01, nll:-1.667e+01>
<epoch: 2, iter: 57,300, lr:2.500e-04, t:1.88e+00, td:2.47e-03, eta:7.47e+01, nll:-1.871e+01>
<epoch: 2, iter: 57,400, lr:2.500e-04, t:1.88e+00, td:2.50e-03, eta:7.44e+01, nll:-1.781e+01>
<epoch: 2, iter: 57,500, lr:2.500e-04, t:1.86e+00, td:2.43e-03, eta:7.36e+01, nll:-1.881e+01>
<epoch: 2, iter: 57,600, lr:2.500e-04, t:1.86e+00, td:2.46e-03, eta:7.38e+01, nll:-1.742e+01>
<epoch: 2, iter: 57,700, lr:2.500e-04, t:1.87e+00, td:2.43e-03, eta:7.38e+01, nll:-1.726e+01>
<epoch: 2, iter: 57,800, lr:2.500e-04, t:1.86e+00, td:2.42e-03, eta:7.34e+01, nll:-1.844e+01>
<epoch: 2, iter: 57,900, lr:2.500e-04, t:1.87e+00, td:2.44e-03, eta:7.36e+01, nll:-1.622e+01>
<epoch: 2, iter: 58,000, lr:2.500e-04, t:1.87e+00, td:2.46e-03, eta:7.38e+01, nll:-1.635e+01>
21-02-16 01:37:34.238 - INFO: Saving models and training states.
<epoch: 2, iter: 58,100, lr:2.500e-04, t:1.89e+00, td:2.43e-03, eta:7.44e+01, nll:-1.692e+01>
<epoch: 2, iter: 58,200, lr:2.500e-04, t:1.84e+00, td:2.40e-03, eta:7.25e+01, nll:-1.594e+01>
<epoch: 2, iter: 58,300, lr:2.500e-04, t:1.87e+00, td:2.43e-03, eta:7.36e+01, nll:-1.747e+01>
<epoch: 2, iter: 58,400, lr:2.500e-04, t:1.87e+00, td:2.49e-03, eta:7.34e+01, nll:-1.949e+01>
<epoch: 2, iter: 58,500, lr:2.500e-04, t:1.86e+00, td:2.46e-03, eta:7.32e+01, nll:-1.595e+01>
<epoch: 2, iter: 58,600, lr:2.500e-04, t:1.86e+00, td:2.44e-03, eta:7.32e+01, nll:-1.600e+01>
<epoch: 2, iter: 58,700, lr:2.500e-04, t:1.88e+00, td:2.47e-03, eta:7.39e+01, nll:-1.668e+01>
<epoch: 2, iter: 58,800, lr:2.500e-04, t:1.87e+00, td:2.46e-03, eta:7.33e+01, nll:-1.868e+01>
<epoch: 2, iter: 58,900, lr:2.500e-04, t:1.86e+00, td:2.46e-03, eta:7.28e+01, nll:-1.802e+01>
<epoch: 2, iter: 59,000, lr:2.500e-04, t:1.86e+00, td:2.47e-03, eta:7.27e+01, nll:-1.569e+01>
21-02-16 02:08:39.673 - INFO: Saving models and training states.
<epoch: 2, iter: 59,100, lr:2.500e-04, t:1.87e+00, td:2.42e-03, eta:7.34e+01, nll:-1.721e+01>
<epoch: 2, iter: 59,200, lr:2.500e-04, t:1.84e+00, td:2.39e-03, eta:7.21e+01, nll:-1.866e+01>
<epoch: 2, iter: 59,300, lr:2.500e-04, t:1.85e+00, td:2.47e-03, eta:7.22e+01, nll:-1.685e+01>
<epoch: 2, iter: 59,400, lr:2.500e-04, t:1.88e+00, td:2.45e-03, eta:7.35e+01, nll:-1.809e+01>
<epoch: 2, iter: 59,500, lr:2.500e-04, t:1.88e+00, td:2.42e-03, eta:7.35e+01, nll:-1.618e+01>
Fatal Python error: Bus error
Thread 0x00007f2e50c69700 (most recent call first):
File "/run/media/root/7de46b27-ca07-4d98-8955-0d77387c5764/test/SRFlow/code/models/modules/Permutations.py", line 37 in get_weight
File "/run/media/root/7de46b27-ca07-4d98-8955-0d77387c5764/test/SRFlow/code/models/modules/Permutations.py", line 48 in fFatal Python error: orwarSegmentation faultd
File "/run/meSegmentation fault (core dumped)"
(myenv) (python37) [root@master code]#
what should i do ?
The text was updated successfully, but these errors were encountered: