Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bus error (core dumped) #19

Open
18813185122 opened this issue Feb 16, 2021 · 0 comments
Open

Bus error (core dumped) #19

18813185122 opened this issue Feb 16, 2021 · 0 comments

Comments

@18813185122
Copy link

First of all thank you very much for your work .
when I train x4 super-resolution completely with your code, but after a period of training,it will occur "Bus error (core dumped)".when I use "python -X faulthandler train.py -opt
./confs/SRFlow_DF2K_4X.yml"

it will output:
"
21-02-15 21:58:11.131 - INFO: Model [SRFlowModel] is created.
21-02-15 21:58:11.131 - INFO: Resuming training from epoch: 2, iter: 51000.
21-02-15 21:58:11.450 - INFO: Start training from epoch: 2, iter: 51000
/run/media/root/7de46b27-ca07-4d98-8955-0d77387c5764/test/SRFlow/myenv/lib/python3.7/site-packages/torch/nn/functional.py:3103: UserWarning: The default behavior for interpolate/upsample with float scale_factor changed in 1.6.0 to align with other frameworks/libraries, and now uses scale_factor directly, instead of relying on the computed output size. If you wish to restore the old behavior, please set recompute_scale_factor=True. See the documentation of nn.Upsample for details.
warnings.warn("The default behavior for interpolate/upsample with float scale_factor changed "
<epoch: 2, iter: 51,001, lr:2.500e-04, t:-1.00e+00, td:9.45e-01, eta:-4.14e+01, nll:-1.566e+01>
<epoch: 2, iter: 51,002, lr:2.500e-04, t:-1.00e+00, td:8.32e-04, eta:-4.14e+01, nll:-1.597e+01>
<epoch: 2, iter: 51,003, lr:2.500e-04, t:1.94e+00, td:2.45e-03, eta:8.01e+01, nll:-1.660e+01>
<epoch: 2, iter: 51,004, lr:2.500e-04, t:1.78e+00, td:3.97e-03, eta:7.37e+01, nll:-1.757e+01>
<epoch: 2, iter: 51,005, lr:2.500e-04, t:1.77e+00, td:8.54e-04, eta:7.32e+01, nll:-1.686e+01>
<epoch: 2, iter: 51,006, lr:2.500e-04, t:2.06e+00, td:6.81e-04, eta:8.52e+01, nll:-1.774e+01>
<epoch: 2, iter: 51,007, lr:2.500e-04, t:1.71e+00, td:1.89e-03, eta:7.06e+01, nll:-1.683e+01>
<epoch: 2, iter: 51,008, lr:2.500e-04, t:1.93e+00, td:2.01e-03, eta:7.98e+01, nll:-1.652e+01>
<epoch: 2, iter: 51,009, lr:2.500e-04, t:1.97e+00, td:2.18e-03, eta:8.16e+01, nll:-1.687e+01>
<epoch: 2, iter: 51,010, lr:2.500e-04, t:1.87e+00, td:2.10e-03, eta:7.72e+01, nll:-1.748e+01>
<epoch: 2, iter: 51,011, lr:2.500e-04, t:1.78e+00, td:3.10e-03, eta:7.36e+01, nll:-1.672e+01>
<epoch: 2, iter: 51,012, lr:2.500e-04, t:2.06e+00, td:3.12e-03, eta:8.51e+01, nll:-1.859e+01>
<epoch: 2, iter: 51,013, lr:2.500e-04, t:1.83e+00, td:2.23e-03, eta:7.57e+01, nll:-1.672e+01>
<epoch: 2, iter: 51,014, lr:2.500e-04, t:1.81e+00, td:2.39e-03, eta:7.50e+01, nll:-1.772e+01>
<epoch: 2, iter: 51,015, lr:2.500e-04, t:1.84e+00, td:1.94e-03, eta:7.60e+01, nll:-1.877e+01>
<epoch: 2, iter: 51,016, lr:2.500e-04, t:1.73e+00, td:3.45e-03, eta:7.17e+01, nll:-1.696e+01>
<epoch: 2, iter: 51,017, lr:2.500e-04, t:1.84e+00, td:2.32e-03, eta:7.62e+01, nll:-1.874e+01>
<epoch: 2, iter: 51,018, lr:2.500e-04, t:2.22e+00, td:2.27e-03, eta:9.18e+01, nll:-1.709e+01>
<epoch: 2, iter: 51,019, lr:2.500e-04, t:1.90e+00, td:1.72e-03, eta:7.87e+01, nll:-1.638e+01>
<epoch: 2, iter: 51,020, lr:2.500e-04, t:1.77e+00, td:2.30e-03, eta:7.31e+01, nll:-1.529e+01>
<epoch: 2, iter: 51,021, lr:2.500e-04, t:1.86e+00, td:3.02e-03, eta:7.70e+01, nll:-1.642e+01>
<epoch: 2, iter: 51,022, lr:2.500e-04, t:1.81e+00, td:2.15e-03, eta:7.48e+01, nll:-1.789e+01>
<epoch: 2, iter: 51,023, lr:2.500e-04, t:1.85e+00, td:2.35e-03, eta:7.65e+01, nll:-1.866e+01>
<epoch: 2, iter: 51,024, lr:2.500e-04, t:1.83e+00, td:2.18e-03, eta:7.57e+01, nll:-1.676e+01>
<epoch: 2, iter: 51,100, lr:2.500e-04, t:1.88e+00, td:2.37e-03, eta:7.78e+01, nll:-1.536e+01>
<epoch: 2, iter: 51,200, lr:2.500e-04, t:1.90e+00, td:2.51e-03, eta:7.86e+01, nll:-1.572e+01>
<epoch: 2, iter: 51,300, lr:2.500e-04, t:1.88e+00, td:2.45e-03, eta:7.75e+01, nll:-1.708e+01>
<epoch: 2, iter: 51,400, lr:2.500e-04, t:1.86e+00, td:2.42e-03, eta:7.68e+01, nll:-1.943e+01>
<epoch: 2, iter: 51,500, lr:2.500e-04, t:1.88e+00, td:2.47e-03, eta:7.76e+01, nll:-1.640e+01>
<epoch: 2, iter: 51,600, lr:2.500e-04, t:1.87e+00, td:2.39e-03, eta:7.71e+01, nll:-1.571e+01>
<epoch: 2, iter: 51,700, lr:2.500e-04, t:1.88e+00, td:2.43e-03, eta:7.74e+01, nll:-1.633e+01>
<epoch: 2, iter: 51,800, lr:2.500e-04, t:1.88e+00, td:2.47e-03, eta:7.73e+01, nll:-1.499e+01>
<epoch: 2, iter: 51,900, lr:2.500e-04, t:1.87e+00, td:2.43e-03, eta:7.71e+01, nll:-1.538e+01>
<epoch: 2, iter: 52,000, lr:2.500e-04, t:1.87e+00, td:2.40e-03, eta:7.70e+01, nll:-1.629e+01>
21-02-15 22:29:40.137 - INFO: Saving models and training states.
<epoch: 2, iter: 52,100, lr:2.500e-04, t:1.90e+00, td:2.42e-03, eta:7.79e+01, nll:-1.673e+01>
<epoch: 2, iter: 52,200, lr:2.500e-04, t:1.89e+00, td:2.46e-03, eta:7.77e+01, nll:-1.898e+01>
<epoch: 2, iter: 52,300, lr:2.500e-04, t:1.89e+00, td:2.43e-03, eta:7.77e+01, nll:-1.815e+01>
<epoch: 2, iter: 52,400, lr:2.500e-04, t:1.86e+00, td:2.44e-03, eta:7.64e+01, nll:-1.801e+01>
<epoch: 2, iter: 52,500, lr:2.500e-04, t:1.89e+00, td:2.48e-03, eta:7.76e+01, nll:-1.746e+01>
<epoch: 2, iter: 52,600, lr:2.500e-04, t:1.88e+00, td:2.53e-03, eta:7.70e+01, nll:-1.614e+01>
<epoch: 2, iter: 52,700, lr:2.500e-04, t:1.87e+00, td:2.44e-03, eta:7.66e+01, nll:-1.496e+01>
<epoch: 2, iter: 52,800, lr:2.500e-04, t:1.89e+00, td:2.44e-03, eta:7.71e+01, nll:-1.682e+01>
<epoch: 2, iter: 52,900, lr:2.500e-04, t:1.87e+00, td:2.48e-03, eta:7.66e+01, nll:-1.676e+01>
<epoch: 2, iter: 53,000, lr:2.500e-04, t:1.87e+00, td:2.42e-03, eta:7.62e+01, nll:-1.719e+01>
21-02-15 23:01:01.845 - INFO: Saving models and training states.
<epoch: 2, iter: 53,100, lr:2.500e-04, t:1.87e+00, td:2.37e-03, eta:7.62e+01, nll:-1.640e+01>
<epoch: 2, iter: 53,200, lr:2.500e-04, t:1.87e+00, td:2.41e-03, eta:7.64e+01, nll:-1.765e+01>
<epoch: 2, iter: 53,300, lr:2.500e-04, t:1.89e+00, td:2.45e-03, eta:7.69e+01, nll:-1.725e+01>
<epoch: 2, iter: 53,400, lr:2.500e-04, t:1.89e+00, td:2.45e-03, eta:7.70e+01, nll:-1.702e+01>
<epoch: 2, iter: 53,500, lr:2.500e-04, t:1.88e+00, td:2.43e-03, eta:7.64e+01, nll:-1.803e+01>
<epoch: 2, iter: 53,600, lr:2.500e-04, t:1.88e+00, td:2.42e-03, eta:7.65e+01, nll:-1.760e+01>
<epoch: 2, iter: 53,700, lr:2.500e-04, t:1.86e+00, td:2.40e-03, eta:7.57e+01, nll:-1.747e+01>
<epoch: 2, iter: 53,800, lr:2.500e-04, t:1.89e+00, td:2.43e-03, eta:7.66e+01, nll:-2.144e+01>
<epoch: 2, iter: 53,900, lr:2.500e-04, t:1.90e+00, td:2.43e-03, eta:7.72e+01, nll:-1.826e+01>
<epoch: 2, iter: 54,000, lr:2.500e-04, t:1.88e+00, td:2.40e-03, eta:7.64e+01, nll:-1.700e+01>
21-02-15 23:32:23.089 - INFO: Saving models and training states.
<epoch: 2, iter: 54,100, lr:2.500e-04, t:1.90e+00, td:2.55e-03, eta:7.70e+01, nll:-1.809e+01>
<epoch: 2, iter: 54,200, lr:2.500e-04, t:1.89e+00, td:2.46e-03, eta:7.64e+01, nll:-1.832e+01>
<epoch: 2, iter: 54,300, lr:2.500e-04, t:1.90e+00, td:2.47e-03, eta:7.67e+01, nll:-1.641e+01>
<epoch: 2, iter: 54,400, lr:2.500e-04, t:1.89e+00, td:2.47e-03, eta:7.63e+01, nll:-1.669e+01>
<epoch: 2, iter: 54,500, lr:2.500e-04, t:1.88e+00, td:2.43e-03, eta:7.60e+01, nll:-1.491e+01>
<epoch: 2, iter: 54,600, lr:2.500e-04, t:1.88e+00, td:2.46e-03, eta:7.58e+01, nll:-1.798e+01>
<epoch: 2, iter: 54,700, lr:2.500e-04, t:1.90e+00, td:2.41e-03, eta:7.65e+01, nll:-1.596e+01>
<epoch: 2, iter: 54,800, lr:2.500e-04, t:1.88e+00, td:2.53e-03, eta:7.59e+01, nll:-1.580e+01>
<epoch: 2, iter: 54,900, lr:2.500e-04, t:1.88e+00, td:2.40e-03, eta:7.58e+01, nll:-1.713e+01>
<epoch: 2, iter: 55,000, lr:2.500e-04, t:1.89e+00, td:2.43e-03, eta:7.62e+01, nll:-1.874e+01>
21-02-16 00:03:50.708 - INFO: Saving models and training states.
<epoch: 2, iter: 55,100, lr:2.500e-04, t:1.89e+00, td:2.44e-03, eta:7.62e+01, nll:-1.506e+01>
<epoch: 2, iter: 55,200, lr:2.500e-04, t:1.88e+00, td:2.39e-03, eta:7.55e+01, nll:-1.786e+01>
<epoch: 2, iter: 55,300, lr:2.500e-04, t:1.88e+00, td:2.47e-03, eta:7.58e+01, nll:-1.834e+01>
<epoch: 2, iter: 55,400, lr:2.500e-04, t:1.88e+00, td:2.39e-03, eta:7.54e+01, nll:-1.841e+01>
<epoch: 2, iter: 55,500, lr:2.500e-04, t:1.89e+00, td:2.41e-03, eta:7.58e+01, nll:-1.820e+01>
<epoch: 2, iter: 55,600, lr:2.500e-04, t:1.87e+00, td:2.46e-03, eta:7.51e+01, nll:-1.633e+01>
<epoch: 2, iter: 55,700, lr:2.500e-04, t:1.89e+00, td:2.44e-03, eta:7.57e+01, nll:-1.660e+01>
<epoch: 2, iter: 55,800, lr:2.500e-04, t:1.89e+00, td:2.44e-03, eta:7.59e+01, nll:-1.856e+01>
<epoch: 2, iter: 55,900, lr:2.500e-04, t:1.90e+00, td:2.45e-03, eta:7.59e+01, nll:-1.613e+01>
CUBLAS error: out of memory (3) in magma_sgetrf_gpu_expert at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/sgetrf_gpu.cpp:126
CUBLAS error: not initialized (1) in magma_sgetrf_gpu_expert at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/sgetrf_gpu.cpp:126
Skipping ERROR caught in nll = model.optimize_parameters(current_step):
Caught RuntimeError in replica 1 on device 1.
Original Traceback (most recent call last):
File "/run/media/root/7de46b27-ca07-4d98-8955-0d77387c5764/test/SRFlow/myenv/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, **kwargs)
File "/run/media/root/7de46b27-ca07-4d98-8955-0d77387c5764/test/SRFlow/myenv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/run/media/root/7de46b27-ca07-4d98-8955-0d77387c5764/test/SRFlow/code/models/modules/SRFlowNet_arch.py", line 65, in forward
y_onehot=y_label)
File "/run/media/root/7de46b27-ca07-4d98-8955-0d77387c5764/test/SRFlow/code/models/modules/SRFlowNet_arch.py", line 101, in normal_flow
y_onehot=y_onehot)
File "/run/media/root/7de46b27-ca07-4d98-8955-0d77387c5764/test/SRFlow/myenv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/run/media/root/7de46b27-ca07-4d98-8955-0d77387c5764/test/SRFlow/code/models/modules/FlowUpsamplerNet.py", line 213, in forward
z, logdet = self.encode(gt, rrdbResults, logdet=logdet, epses=epses, y_onehot=y_onehot)
File "/run/media/root/7de46b27-ca07-4d98-8955-0d77387c5764/test/SRFlow/code/models/modules/FlowUpsamplerNet.py", line 238, in encode
fl_fea, logdet = layer(fl_fea, logdet, reverse=reverse, rrdbResults=level_conditionals[level])
File "/run/media/root/7de46b27-ca07-4d98-8955-0d77387c5764/test/SRFlow/myenv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/run/media/root/7de46b27-ca07-4d98-8955-0d77387c5764/test/SRFlow/code/models/modules/FlowStep.py", line 84, in forward
return self.normal_flow(input, logdet, rrdbResults)
File "/run/media/root/7de46b27-ca07-4d98-8955-0d77387c5764/test/SRFlow/code/models/modules/FlowStep.py", line 103, in normal_flow
self, z, logdet, False)
File "/run/media/root/7de46b27-ca07-4d98-8955-0d77387c5764/test/SRFlow/code/models/modules/FlowStep.py", line 35, in
"invconv": lambda obj, z, logdet, rev: obj.invconv(z, logdet, rev),
File "/run/media/root/7de46b27-ca07-4d98-8955-0d77387c5764/test/SRFlow/myenv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/run/media/root/7de46b27-ca07-4d98-8955-0d77387c5764/test/SRFlow/code/models/modules/Permutations.py", line 48, in forward
weight, dlogdet = self.get_weight(input, reverse)
File "/run/media/root/7de46b27-ca07-4d98-8955-0d77387c5764/test/SRFlow/code/models/modules/Permutations.py", line 37, in get_weight
dlogdet = torch.slogdet(self.weight)[1] * pixels
RuntimeError: CUDA error: resource already mapped

<epoch: 2, iter: 56,000, lr:2.500e-04, t:1.86e+00, td:2.41e-03, eta:7.44e+01, nll:-1.589e+01>
21-02-16 00:35:13.687 - INFO: Saving models and training states.
<epoch: 2, iter: 56,100, lr:2.500e-04, t:1.88e+00, td:2.41e-03, eta:7.51e+01, nll:-1.545e+01>
<epoch: 2, iter: 56,200, lr:2.500e-04, t:1.87e+00, td:2.44e-03, eta:7.48e+01, nll:-1.524e+01>
<epoch: 2, iter: 56,300, lr:2.500e-04, t:1.88e+00, td:2.50e-03, eta:7.49e+01, nll:-1.727e+01>
<epoch: 2, iter: 56,400, lr:2.500e-04, t:1.85e+00, td:2.40e-03, eta:7.40e+01, nll:-1.717e+01>
<epoch: 2, iter: 56,500, lr:2.500e-04, t:1.88e+00, td:2.48e-03, eta:7.48e+01, nll:-1.548e+01>
<epoch: 2, iter: 56,600, lr:2.500e-04, t:1.86e+00, td:2.48e-03, eta:7.42e+01, nll:-1.752e+01>
<epoch: 2, iter: 56,700, lr:2.500e-04, t:1.88e+00, td:2.48e-03, eta:7.47e+01, nll:-1.669e+01>
<epoch: 2, iter: 56,800, lr:2.500e-04, t:1.86e+00, td:2.43e-03, eta:7.40e+01, nll:-1.632e+01>
<epoch: 2, iter: 56,900, lr:2.500e-04, t:1.86e+00, td:2.44e-03, eta:7.40e+01, nll:-1.778e+01>
<epoch: 2, iter: 57,000, lr:2.500e-04, t:1.88e+00, td:2.45e-03, eta:7.47e+01, nll:-1.696e+01>
21-02-16 01:06:23.673 - INFO: Saving models and training states.
<epoch: 2, iter: 57,100, lr:2.500e-04, t:1.90e+00, td:2.48e-03, eta:7.54e+01, nll:-1.575e+01>
<epoch: 2, iter: 57,200, lr:2.500e-04, t:1.86e+00, td:2.44e-03, eta:7.39e+01, nll:-1.667e+01>
<epoch: 2, iter: 57,300, lr:2.500e-04, t:1.88e+00, td:2.47e-03, eta:7.47e+01, nll:-1.871e+01>
<epoch: 2, iter: 57,400, lr:2.500e-04, t:1.88e+00, td:2.50e-03, eta:7.44e+01, nll:-1.781e+01>
<epoch: 2, iter: 57,500, lr:2.500e-04, t:1.86e+00, td:2.43e-03, eta:7.36e+01, nll:-1.881e+01>
<epoch: 2, iter: 57,600, lr:2.500e-04, t:1.86e+00, td:2.46e-03, eta:7.38e+01, nll:-1.742e+01>
<epoch: 2, iter: 57,700, lr:2.500e-04, t:1.87e+00, td:2.43e-03, eta:7.38e+01, nll:-1.726e+01>
<epoch: 2, iter: 57,800, lr:2.500e-04, t:1.86e+00, td:2.42e-03, eta:7.34e+01, nll:-1.844e+01>
<epoch: 2, iter: 57,900, lr:2.500e-04, t:1.87e+00, td:2.44e-03, eta:7.36e+01, nll:-1.622e+01>
<epoch: 2, iter: 58,000, lr:2.500e-04, t:1.87e+00, td:2.46e-03, eta:7.38e+01, nll:-1.635e+01>
21-02-16 01:37:34.238 - INFO: Saving models and training states.
<epoch: 2, iter: 58,100, lr:2.500e-04, t:1.89e+00, td:2.43e-03, eta:7.44e+01, nll:-1.692e+01>
<epoch: 2, iter: 58,200, lr:2.500e-04, t:1.84e+00, td:2.40e-03, eta:7.25e+01, nll:-1.594e+01>
<epoch: 2, iter: 58,300, lr:2.500e-04, t:1.87e+00, td:2.43e-03, eta:7.36e+01, nll:-1.747e+01>
<epoch: 2, iter: 58,400, lr:2.500e-04, t:1.87e+00, td:2.49e-03, eta:7.34e+01, nll:-1.949e+01>
<epoch: 2, iter: 58,500, lr:2.500e-04, t:1.86e+00, td:2.46e-03, eta:7.32e+01, nll:-1.595e+01>
<epoch: 2, iter: 58,600, lr:2.500e-04, t:1.86e+00, td:2.44e-03, eta:7.32e+01, nll:-1.600e+01>
<epoch: 2, iter: 58,700, lr:2.500e-04, t:1.88e+00, td:2.47e-03, eta:7.39e+01, nll:-1.668e+01>
<epoch: 2, iter: 58,800, lr:2.500e-04, t:1.87e+00, td:2.46e-03, eta:7.33e+01, nll:-1.868e+01>
<epoch: 2, iter: 58,900, lr:2.500e-04, t:1.86e+00, td:2.46e-03, eta:7.28e+01, nll:-1.802e+01>
<epoch: 2, iter: 59,000, lr:2.500e-04, t:1.86e+00, td:2.47e-03, eta:7.27e+01, nll:-1.569e+01>
21-02-16 02:08:39.673 - INFO: Saving models and training states.
<epoch: 2, iter: 59,100, lr:2.500e-04, t:1.87e+00, td:2.42e-03, eta:7.34e+01, nll:-1.721e+01>
<epoch: 2, iter: 59,200, lr:2.500e-04, t:1.84e+00, td:2.39e-03, eta:7.21e+01, nll:-1.866e+01>
<epoch: 2, iter: 59,300, lr:2.500e-04, t:1.85e+00, td:2.47e-03, eta:7.22e+01, nll:-1.685e+01>
<epoch: 2, iter: 59,400, lr:2.500e-04, t:1.88e+00, td:2.45e-03, eta:7.35e+01, nll:-1.809e+01>
<epoch: 2, iter: 59,500, lr:2.500e-04, t:1.88e+00, td:2.42e-03, eta:7.35e+01, nll:-1.618e+01>
Fatal Python error: Bus error

Thread 0x00007f2e50c69700 (most recent call first):
File "/run/media/root/7de46b27-ca07-4d98-8955-0d77387c5764/test/SRFlow/code/models/modules/Permutations.py", line 37 in get_weight
File "/run/media/root/7de46b27-ca07-4d98-8955-0d77387c5764/test/SRFlow/code/models/modules/Permutations.py", line 48 in fFatal Python error: orwarSegmentation faultd

File "/run/meSegmentation fault (core dumped)"
(myenv) (python37) [root@master code]#

what should i do ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant