Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault when differencing cubes #2063

Closed
PAGWatson opened this issue Jun 28, 2016 · 17 comments
Closed

Segmentation fault when differencing cubes #2063

PAGWatson opened this issue Jun 28, 2016 · 17 comments
Assignees

Comments

@PAGWatson
Copy link

Hello,

I get a seg fault when I take the difference between two cubes and then try to access the data. I've attached the files containing the data at https://groups.google.com/forum/#!topic/scitools-iris/OgFbHKtNGqU .

cube1=iris.load_cube('get_autocorr_data_lab0_tp_2.5x2.5_lat_-90.0-90.0_daily_mean_1998-2007_lag1.nc')
cube2=iris.load_cube('get_autocorr_data_gpcp_tp_2.5x2.5_lat_-90.0-90.0_daily_mean_1998-2007_lag1.nc')
diff=cube1-cube2
diff_data=diff.data  #The seg fault happens at this line

However, replacing the last two lines with the line below works fine.

diff_data=cube1.data - cube2.data

(I'm using Iris 1.9.2 in ipython 4.0.3.)

Andrew Dawson reproduced the issue and posted the full traceback at the above link.

@shoyer
Copy link
Contributor

shoyer commented Jun 30, 2016

As @ajdawson notes "Interestingly, it is sufficient to load only one of the cubes data payloads for this bug to disappear... This should be reported as a bug on the iris issue tracker."

I'm pretty sure the solution is that Iris needs to use a thread lock when accessing data from netCDF4, because the HDF5 library is not thread safe. We use such a thread lock in xarray/dask.

@ajdawson
Copy link
Member

ajdawson commented Jul 8, 2016

Thanks @shoyer this is useful information. I did a quick test by adding a global lock to data reads within a biggus.ProducerNode, which does seem to prevent the error described by @PAGWatson.

@marqh
Copy link
Member

marqh commented Sep 26, 2016

https://github.com/SciTools/iris/blob/master/INSTALL#L89

also provides a suggestion that there may be an issue here if the HDF5 build is not thread safe. I have tripped over this in the past

@PAGWatson is this still an open issue, from your p.o.v.?

@PAGWatson
Copy link
Author

Well I can work around this, so it's not a huge problem for me for now. I've not checked if the bug is still present in the latest version of Iris, though.

@shoyer
Copy link
Contributor

shoyer commented Sep 26, 2016

There's no harm in adding a defensive threading.Lock() around HDF5/netCDF4 calls, given that HDF5 doesn't support multi-threading anyways.

@mheikenfeld
Copy link

I've run into the same issue over the last couple of days (iris 1.10.0). Any updates on this? I can work around it as well for now, but seems like something that people are very prone to run into and seems hard to make sense of since all you get is the segmentation fault.

@DPeterK
Copy link
Member

DPeterK commented Nov 23, 2016

Hi @mheikenfeld, thanks for letting us know you're also encountering this issue. Our advice remains to ensure that you're using a thread-safe install of hdf5 (see the Iris install note). For more information, see §4.3.11 of the HDF5 INSTALL document.

@ajdawson
Copy link
Member

We should work towards actually solving this (in biggus) because many people are using a pre-built binary HDF5 and have no control over whether it is thread-safe or not.

@ajdawson
Copy link
Member

See SciTools/biggus#194.

@cpelley
Copy link

cpelley commented Mar 29, 2017

conda-forge now has a thread-safe version of hdf5 available

@marqh
Copy link
Member

marqh commented Apr 7, 2017

@ajdawson

We should work towards actually solving this (in biggus) because many people are using a pre-built binary HDF5 and have no control over whether it is thread-safe or not.

I think this would be a neat thing to deliver

However, there is an ongoing activity to replace biggus with dask for irisv2
the code for accessing netcdf files is here:
https://github.com/SciTools/iris/blob/dask/lib/iris/fileformats/netcdf.py#L374

the __getitem__ does the work

do you think that:

There's no harm in adding a defensive threading.Lock() around HDF5/netCDF4 calls, given that HDF5 doesn't support multi-threading anyways.

could fit into that structure

@bjlittle @dkillick @pp-mo @lbdreyer

  • do you think that this might have performance implications?
  • do you think this might cause other issues?
  • do you think we can test such behaviour?
  • given that a failure mode is SegFault bang!

@bouweandela
Copy link
Member

bouweandela commented May 19, 2020

Is this issue still current? It looks like it has not been solved in Unidata/netcdf4-python#844. I can see that the proposed lock has not been implemented yet:

def __getitem__(self, keys):
dataset = netCDF4.Dataset(self.path)
try:
variable = dataset.variables[self.variable_name]
# Get the NetCDF variable data and slice.
var = variable[keys]
finally:
dataset.close()
return np.asanyarray(var)

We seem to be running into this with iris 2.4, see ESMValGroup/ESMValCore#644, example of the crash happening: https://app.circleci.com/pipelines/github/ESMValGroup/ESMValCore/2482/workflows/f8e73729-c4cf-408c-bdae-beec24238ac1/jobs/10300/steps

Using these libraries installed from conda:

iris                      2.4.0                    py38_0    conda-forge
libnetcdf                 4.7.4           mpi_mpich_h755db7c_1    conda-forge
netcdf4                   1.5.3           mpi_mpich_py38h894258e_3    conda-forge
hdf5                      1.10.5          mpi_mpich_ha7d0aea_1004    conda-forge

@bjlittle

@rcomer
Copy link
Member

rcomer commented May 19, 2020

Is this issue still current?

I just downloaded OP's sample files and could not reproduce the problem at Iris2.2. (The files won't load in the Iris2.4 environment I have access to as they have invalid variable names "pearson's_r" ).

@valeriupredoi
Copy link

might be useful for you guys to know the rates at which a SegFault occurs at the point of realizing the data (have a look at this comment and this comment) - a segfault happens 1-2% times a call to realize a cube's data is executed; this may seem low for a rate, but this is the rate for a single event (single call to realize), and since these segfaults are Poisson distributed if you have 100 such calls in a script, statistically, your script will segfault everytime you run it. Have you guys looked into this for iris3 by any chance?

@github-actions
Copy link
Contributor

In order to maintain a backlog of relevant issues, we automatically label them as stale after 500 days of inactivity.

If this issue is still important to you, then please comment on this issue and the stale label will be removed.

Otherwise this issue will be automatically closed in 28 days time.

@github-actions github-actions bot added the Stale A stale issue/pull-request label May 30, 2022
@bouweandela
Copy link
Member

@valeriupredoi Have you seen any segfaults in our CI recently?

@github-actions github-actions bot removed the Stale A stale issue/pull-request label May 31, 2022
@rcomer
Copy link
Member

rcomer commented Feb 21, 2023

The OP’s example here has not been reproducible for some time (#2063 (comment)), and significant recent work has gone into thread safety (#5095). So I reckon this issue could be closed. I’ll leave it to @trexfeathers’ decision as he is assignee.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests