segfault with a particular netcdf4 file #8289

hmaarrfk · 2023-10-09T20:07:17Z

What happened?

The following code yields a segfault on my machine (and many other machines with a similar environment)

import xarray
filename = 'tiny.nc.txt'
engine = "netcdf4"

dataset = xarray.open_dataset(filename, engine=engine)

i = 0
for i in range(60):
    xarray.open_dataset(filename, engine=engine)

tiny.nc.txt
mrc.nc.txt

What did you expect to happen?

Not to segfault.

Minimal Complete Verifiable Example

Generate some netcdf4 with my application.
Trim the netcdf4 file down (load it, and drop all the vars I can while still reproducing this bug)
Try to read it.

import xarray
from tqdm import tqdm
filename = 'mrc.nc.txt'
engine = "h5netcdf"
dataset = xarray.open_dataset(filename, engine=engine)

for i in tqdm(range(60), desc=f"filename={filename}, enine={engine}"):
    xarray.open_dataset(filename, engine=engine)


engine = "netcdf4"

dataset = xarray.open_dataset(filename, engine=engine)
for i in tqdm(range(60), desc=f"filename={filename}, enine={engine}"):
    xarray.open_dataset(filename, engine=engine)

filename = 'tiny.nc.txt'

engine = "h5netcdf"
dataset = xarray.open_dataset(filename, engine=engine)
for i in tqdm(range(60), desc=f"filename={filename}, enine={engine}"):
    xarray.open_dataset(filename, engine=engine)


engine = "netcdf4"

dataset = xarray.open_dataset(filename, engine=engine)
for i in tqdm(range(60), desc=f"filename={filename}, enine={engine}"):
    xarray.open_dataset(filename, engine=engine)

hand crafting the file from start to finish seems to not segfault:

import xarray
import numpy as np
engine = 'netcdf4'

dataset = xarray.Dataset()

coords = {}
coords['image_x'] = np.arange(1, dtype='int')
dataset = dataset.assign_coords(coords)

dataset['image'] = xarray.DataArray(
    np.zeros((1,), dtype='uint8'),
    dims=('image_x',)
)

# %%
dataset.to_netcdf('mrc.nc.txt')
# %%
dataset = xarray.open_dataset('mrc.nc.txt', engine=engine)


for i in range(10):
    xarray.open_dataset('mrc.nc.txt', engine=engine)

MVCE confirmation

Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
Complete example — the example is self-contained, including all data and the text of any traceback.
Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
New issue — a search of GitHub Issues suggests this is not a duplicate.
Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

i=0 passes
i=1 mostly segfaults, but sometimes it can take more than 1 iteration

Anything else we need to know?

At first I thought it was deep in hdf5, but I am less convinced now

xref: HDFGroup/hdf5#3649

Environment

INSTALLED VERSIONS
------------------
commit: None
python: 3.10.12 | packaged by Ramona Optics | (main, Jun 27 2023, 02:59:09) [GCC 12.3.0]
python-bits: 64
OS: Linux
OS-release: 6.5.1-060501-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.14.2
libnetcdf: 4.9.2

xarray: 2023.9.1.dev25+g46643bb1.d20231009
pandas: 2.1.1
numpy: 1.24.4
scipy: 1.11.3
netCDF4: 1.6.4
pydap: None
h5netcdf: 1.2.0
h5py: 3.9.0
Nio: None
zarr: 2.16.1
cftime: 1.6.2
nc_time_axis: None
PseudoNetCDF: None
iris: None
bottleneck: None
dask: 2023.3.0
distributed: 2023.3.0
matplotlib: 3.8.0
cartopy: None
seaborn: None
numbagg: None
fsspec: 2023.9.2
cupy: None
pint: 0.22
sparse: None
flox: None
numpy_groupies: None
setuptools: 68.2.2
pip: 23.2.1
conda: 23.7.4
pytest: 7.4.2
mypy: None
IPython: 8.16.1
sphinx: 7.2.6

The text was updated successfully, but these errors were encountered:

hmaarrfk · 2023-10-09T21:56:44Z

Ok this was bugging me enough, but here is a reproducer that just "runs" without any data:

import xarray
engine = 'netcdf4'

dataset = xarray.Dataset()

dataset.coords['x'] = ['a']
dataset.to_netcdf('mrc.nc')
dataset = xarray.open_dataset('mrc.nc', engine=engine)

for i in range(10):
    print(f"i={i}")
    xarray.open_dataset('mrc.nc', engine=engine)

the key was making the coordinate a H5T_STRING type.

string_mrc.nc.txt

hmaarrfk · 2023-10-10T00:47:19Z

Sorry to rapid fire post, but the following "hack" seems to resolve the issues I am observing:

diff --git a/xarray/backends/netCDF4_.py b/xarray/backends/netCDF4_.py
index f21f15bf..8f1243da 100644
--- a/xarray/backends/netCDF4_.py
+++ b/xarray/backends/netCDF4_.py
@@ -394,8 +394,8 @@ class NetCDF4DataStore(WritableCFDataStore):
         kwargs = dict(
             clobber=clobber, diskless=diskless, persist=persist, format=format
         )
-        manager = CachingFileManager(
-            netCDF4.Dataset, filename, mode=mode, kwargs=kwargs
+        manager = DummyFileManager(
+            netCDF4.Dataset(filename, mode=mode, **kwargs)
         )
         return cls(manager, group=group, mode=mode, lock=lock, autoclose=autoclose)

I have a feeling some reference isn't being kept, and the file is being free'ed somehow during garbage collection.

While this "hack" somewhat "works", if I try to open the same file with two different backends, it really likes to complain.

It may be that libnetcdf4 just expects to be in control of the file at a all times..

hmaarrfk · 2023-10-10T09:29:41Z

Running a similar segfaulting benchmark on xarray's main branch

import xarray
import numpy as np

write_engine = 'h5netcdf'
hold_engine = 'h5netcdf'
read_engine = 'netcdf4'

filename = f'{write_engine}_mrc.nc'

# %%
dataset = xarray.Dataset()

dataset.coords['x'] = ['a']
dataset.coords['my_version'] = '1.2.3.4.5.6'
dataset['images'] = (('x', ), np.zeros((1,)))

dataset.to_netcdf(filename, engine=write_engine)
# %%
dataset = xarray.open_dataset(filename, engine=hold_engine)
for i in range(100):
    print(f"i={i}")
    xarray.open_dataset(filename, engine=read_engine)

	write engine
hold/read engine	h5netcdf	netcdf4
netcdf4/netcdf4	pass	segfault
netcdf4/h5netcdf	pass	segfault
h5netcdf/h5netcdf	pass	pass
h5netcdf/netcdf4	pass	pass

hmaarrfk · 2023-10-30T20:26:56Z

While I know these issues are hard, can anybody else confirm that this happens on their system as well? Maybe my machine is really weird....

As a final reproducer:

import xarray
xarray.set_options(warn_for_unclosed_files=True)
# Also needs a small patch....
"""
diff --git a/xarray/backends/file_manager.py b/xarray/backends/file_manager.py
index df901f9a..a2e8af03 100644
--- a/xarray/backends/file_manager.py
+++ b/xarray/backends/file_manager.py
@@ -252,11 +252,10 @@ class CachingFileManager(FileManager):
                     self._lock.release()
 
             if OPTIONS["warn_for_unclosed_files"]:
-                warnings.warn(
+                print(
                     f"deallocating {self}, but file is not already closed. "
                     "This may indicate a bug.",
-                    RuntimeWarning,
-                    stacklevel=2,
+                    flush=True
                 )
 
     def __getstate__(self):

"""

dataset = xarray.Dataset()

dataset.coords['x'] = ['a']
dataset.to_netcdf('mrc.nc', engine='netcdf4')

dataset = xarray.open_dataset('mrc.nc', engine='netcdf4')
for i in range(100):
    print(f"i={i}")
    xarray.open_dataset('mrc.nc', engine='netcdf4')

Gives the output:

i=0
deallocating CachingFileManager(<class 'netCDF4._netCDF4.Dataset'>, '/home/mark/git/wgpu/mrc.nc', mode='r', kwargs={'clobber': True, 'diskless': False, 'persist': False, 'format': 'NETCDF4'}, manager_id='120349e5-9287-4535-a724-588aa78cf9d0'), but file is not already closed. This may indicate a bug.

jhamman · 2023-10-30T21:52:32Z

I can confirm that your original reproducer segfaults on my system (Linux/x86_64). I also agree with your diagnosis that this seems to be an issue with the caching file manager.

FWIW, adding dataset.close() after the first open_dataset does seem to solve things. That said, I don't think this should be required so, for some reason, we're not using the cached file object.

hmaarrfk · 2023-10-30T23:10:57Z

thanks for confirming. it has been puzzling me for no end.

heikoklein · 2024-05-03T11:41:31Z

I'm also struggling with the problem and I have simplified the code a bit more:

import xarray
import numpy as np
import os
import sys

filename = 'test_mrc.nc'

if not os.path.exists(filename):
    dataset_w = xarray.Dataset()
    dataset_w['x'] = ['a']
    dataset_w.to_netcdf(filename)

print("try open 1", file=sys.stderr)
dataset = xarray.open_dataset(filename)
print("try open 2", file=sys.stderr)
dataset2 = xarray.open_dataset(filename)
dataset2 = None
print("try open 3", file=sys.stderr)
dataset3 = xarray.open_dataset(filename)
print("success")

The problem only occurs if certain features of netcdf4 are used on file (e.g. superblock 2, strings), but those are common.
The cache-manager fails to handle the file if it was opened twice and one of these two handles go out of scope (here dataset2). The next open (dataset3) throws a segmentation-fault.

I've tested with v2023.6.0 (latest version in conda-forge / python 3.11.7, linux)

thorbjoernl · 2024-05-03T11:51:51Z

The above example succeeds in my case (~~after removing the extra .~~):

(venv) $ pip freeze | grep xarray
xarray==2024.3.0
(venv) $ python3 --version
Python 3.11.3
(venv) $ python3 test.py 
try open 1
try open 2
try open 3
success

No conda; just a regular python virtual environment

heikoklein · 2024-05-03T14:25:22Z

I just managed to upgrade my xarray to 2024.03.0 (pinning the version) and still get the error, though it works sometimes?

$ conda install xarray=2024.3.0
...
$ conda list | grep xarray
xarray                    2024.3.0           pyhd8ed1ab_0    conda-forge
$ python3 xarray_segfault.py 
try open 1
try open 2
try open 3
success
$ python3 xarray_segfault.py 
try open 1
try open 2
try open 3
Segmentation fault (core dumped)

(this was independ if the test_mrc.nc file existed or not...)

keewis · 2024-05-03T14:27:35Z

from a quick bisect, it appears this particular issue was introduced by #4879. Not sure what to do to fix that.

We can observe, though, that the file manager id in the least-recently used cache changes every time we open it, but that the underlying netCDF4.Dataset object always stays the same. So that might be a hint?

hmaarrfk · 2024-05-03T16:54:17Z

I think it is related to #7359 (comment)

sjsmith757 · 2024-10-04T20:48:02Z

I'm curious if there has been any recent progress on this issue? I'm running into this problem even with xarray 2024.9.0, and it's really disrupting my current workflow. I have yet to find a work around and would appreciate it if anyone has any. Simply adding a dataset.close() (or wrapping the open in a context manager) was not sufficient for me. Ideally, I would not have to change all my coordinate values to non-strings, but if that's the only option...

Anyways, my code is currently long and complex, but the minimal example of @heikoklein does segfault in my current environment. Running my code with gdb shows the segfault is coming from HDF5_addr_decode as suggested by this issue. My code more or less follows the pseudo-code outlined there, and it follows all the necessary ingredients described in this issue thread, so I'm fairly confident the problem I'm having is consistent with this issue.

I can provide more detail if helpful, but I don't really know how to tackle these kinds of library errors and would be grateful for any assistance. It's a bit unfortunate because the last time I ran my code using xarray 0.18.2 I did not have this issue, and downgrading that far at the moment is also not an ideal solution. Thank you!

derhintze · 2024-11-28T10:37:25Z

We hit the same bug independently on this setup:

>>> xr.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.9.10 (main, Mar 15 2022, 15:56:56) 
[GCC 7.5.0]
python-bits: 64
OS: Linux
OS-release: 3.10.0-1160.49.1.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.14.2
libnetcdf: 4.9.4-development

xarray: 2024.7.0
pandas: 2.2.2
numpy: 2.0.2
scipy: 1.13.1
netCDF4: 1.7.2
pydap: None
h5netcdf: None
h5py: None
zarr: None
cftime: 1.6.4
nc_time_axis: None
iris: None
bottleneck: None
dask: None
distributed: None
matplotlib: 3.9.1
cartopy: None
seaborn: None
numbagg: None
fsspec: None
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: 58.1.0
pip: 24.3.1
conda: None
pytest: 8.3.3
mypy: 1.13.0
IPython: 8.18.1
sphinx: None

hmaarrfk added bug needs triage Issue that has not been reviewed by xarray team member labels Oct 9, 2023

TomNicholas added topic-backends and removed needs triage Issue that has not been reviewed by xarray team member labels Oct 10, 2023

keewis mentioned this issue Nov 3, 2023

Segmentation fault 139 (SIGSEGV) #8410

Closed

5 tasks

thorbjoernl mentioned this issue May 2, 2024

Allow newer versions of xarray metno/pyaerocom#1140

Closed

cchwala mentioned this issue Oct 24, 2024

Add function to plot grid, line and point data OpenSenseAction/poligrain#75

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

segfault with a particular netcdf4 file #8289

segfault with a particular netcdf4 file #8289

hmaarrfk commented Oct 9, 2023 •

edited

Loading

hmaarrfk commented Oct 9, 2023 •

edited

Loading

hmaarrfk commented Oct 10, 2023 •

edited

Loading

hmaarrfk commented Oct 10, 2023

hmaarrfk commented Oct 30, 2023 •

edited

Loading

jhamman commented Oct 30, 2023

hmaarrfk commented Oct 30, 2023

heikoklein commented May 3, 2024 •

edited by keewis

Loading

thorbjoernl commented May 3, 2024 •

edited

Loading

heikoklein commented May 3, 2024

keewis commented May 3, 2024 •

edited

Loading

hmaarrfk commented May 3, 2024

sjsmith757 commented Oct 4, 2024

derhintze commented Nov 28, 2024

segfault with a particular netcdf4 file #8289

segfault with a particular netcdf4 file #8289

Comments

hmaarrfk commented Oct 9, 2023 • edited Loading

What happened?

What did you expect to happen?

Minimal Complete Verifiable Example

MVCE confirmation

Relevant log output

Anything else we need to know?

Environment

hmaarrfk commented Oct 9, 2023 • edited Loading

hmaarrfk commented Oct 10, 2023 • edited Loading

hmaarrfk commented Oct 10, 2023

hmaarrfk commented Oct 30, 2023 • edited Loading

jhamman commented Oct 30, 2023

hmaarrfk commented Oct 30, 2023

heikoklein commented May 3, 2024 • edited by keewis Loading

thorbjoernl commented May 3, 2024 • edited Loading

heikoklein commented May 3, 2024

keewis commented May 3, 2024 • edited Loading

hmaarrfk commented May 3, 2024

sjsmith757 commented Oct 4, 2024

derhintze commented Nov 28, 2024

hmaarrfk commented Oct 9, 2023 •

edited

Loading

hmaarrfk commented Oct 9, 2023 •

edited

Loading

hmaarrfk commented Oct 10, 2023 •

edited

Loading

hmaarrfk commented Oct 30, 2023 •

edited

Loading

heikoklein commented May 3, 2024 •

edited by keewis

Loading

thorbjoernl commented May 3, 2024 •

edited

Loading

keewis commented May 3, 2024 •

edited

Loading