Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when dimension shares name with variable in netcdf-4 classic file #1772

Open
ellenjohnson opened this issue Jun 30, 2020 · 17 comments
Open

Comments

@ellenjohnson
Copy link

Hello everyone!

Short question: In a netcdf4-classic file, can a fixed-size dimension share the same name with a previously-defined variable?

A customer reported a MATLAB crash with following steps using our nccreate function. To me, this wouldn’t make sense to name the second variable’s dimension the same name as the first variable’s name. Wondering if this is related to issue #1528, but in our case they are fixed-size dimensions. We're using netcdf 4.7.3 and this is not OS-specific.

nccreate('test.nc','var1','Dimensions',{'dim1',3}); % one-dimension, vector of length 3
nccreate('test.nc','var2','Dimensions',{'var1',2}); % one-dimension, vector of length 2

This errors during nc_close() with error -101 NC_EHDFERR, and then we crash (oops). Note: nccreate defaults to netcdf4-classic. Calling it twice in succession like that means we first create new file test.nc and write var1 with dim1 to it, then close. Then reopen for append and create var2 with dimension name var1, then close.

I wrote a netcdf-c program (attached, netcdfRepro_VarDimError) to reproduce it (using netcdf-c v4.7.3), and still get the same NC_EHDFERR error, but no segfault or similar crash. (My repro is .c, but looks like github won't let me attach a .c file, so renamed it .txt).
netcdfRepro_VarDimError.txt

@DennisHeimbigner
Copy link
Collaborator

It looks like this is driven by that bane of our existence: HDF5 dim scales.
My speculation is that this particular combination ends up trying to
create two datasets (one a hidden dimscale) with the same name.
But Ed Hartnett might know for sure.

@DennisHeimbigner
Copy link
Collaborator

Short answer: you probably can't do this using the current HDF5+netcdf-c.
It would require us to avoid using dimension scales and that is not likely
any time soon.

@ellenjohnson
Copy link
Author

Thanks Dennis. So to clarify: the NC_EHDFERR from nc_close() is expected netcdf behavior (i.e. limitation)? Or unexpected (i.e., bug) but there won't be a fix anytime soon? We'll fix our MATLAB code so we don't crash upon this error, but it would be great to know if this is a netcdf limitation (then we can document the lijmitation to our customers), or a netcdf bug.

In our case, the customer has an easy workaround -- don't name the dimension the same name as the previously defined variable. I'm not sure why the were doing that in the first place, unless it has something to do with netcdf coordinate variables. I know a variable should only be given the same name as a dimension if it's to be used as a coordinate variable -- but does that apply in reverse? Meaning are there restrictions (other than the error) on naming a dimension the same name as a predefined variable? And does that have anything to do with coordinate variables?

@DennisHeimbigner
Copy link
Collaborator

Technically, it is an error since the spec says nothing AFAIK about this being illegal.
But calling it a limitation of the netcdf-4 over HDF5 implementation seems
proper. I will see if I can add a description of this somewhere in our
documentation.
And no, it is unlikely to be fixed any time soon.
You can get an insite into this by setting this env variable:
export NETCDF_LOG_LEVEL=0
and running your test program. In the HDF5 error log output
you will see a complaint about a duplicate name.

As for this question:

are there restrictions (other than the error) on naming a dimension the same name as a predefined variable?
I do not know for sure.

@edwardhartnett
Copy link
Contributor

The best way to get a fix would be to submit a C test which demonstrates the problem.

There are netCDF-4 limitations with respect to the naming of dimensions and variables. This is due to the lack of support for shared dimensions in HDF5. So creating a netCDF dimension actually creates a HDF5 dataset (i.e. variable) with some special metadata.

If you then create a var with the same name, it needs to delete the dataset created for the dim, and create a new one, which is a coordinate variable, representing both the variable and the dimension in one HDF5 dataset.

However, apparently the code does not handle the case where you want a var of the same name, but not a coordinate var, when you do things in a certain order. This will be order dependent, so try doing things in a different order and that could well work. (For example, I think if you create the var first, and then a dim with the same name, that will work correctly.)

@DennisHeimbigner
Copy link
Collaborator

I stand corrected. Ed's answer is the one you should accept.

@ellenjohnson
Copy link
Author

Thanks Ed and Dennis.
@edhartnett: I did create a C repro code that gives the error -- I attached it in my initial question (see netcdfRepro_VarDimError). It's a C file but i had to rename it to .txt because github wouldn't let me attach a .c file. The NC_EHDFERR happens at call to nc_close(). I linked with netCDF 4.7.3 and HDF5 1.8.12.

I mimicked what our nccreate function is doing by calling it twice in a row like the customer did -- you'll see it in the repro code. Can you please take a look at the C repro and let me know if this helps explain the problem?

Let me know if you'd rather have me copy and paste the C repro directly i the comment.

@nvishnoiMW
Copy link

nvishnoiMW commented Jul 17, 2020

@edwardhartnett and @DennisHeimbigner - I was working on the crash that Ellen mentioned happened in MATLAB (after we got NC_EHDFERR error). I found this crash got triggered when nc_close(ncid) was called again after the first nc_close reported the error. I have attached the modified repro code in this post. Here is the sample output of running the code with netcdf 4.7.3:

status after close = 0
error code after close = -33
reopening file to write a second variable...
status = 0
error code = 0
error code after close = -101
main.o: nc4hdf.c:1438: attach_dimscales: Assertion `dsid > 0' failed.
Abort (core dumped)

Interestingly, I could only reproduce the crash using netcdf 4.6.1 and later versions. Calling nc_close(ncid) twice did not crash but gave the following output with 4.3.3.1:

status after close = 0
error code after close = -33
reopening file to write a second variable...
status = 0
error code = 0
error code after close = -101
error code after close = -33
End of test.

which is expected as as you can see in the case of "no error" calling nc_close twice returned -33. I would expect the same in the case of any error as well. Can you please confirm if this is a bug in later versions of netcdf? Probably something changed between 4.3.3.1 and 4.6.1 which caused this crash to happen.

Thanks,
Nalini

netcdfRepro_VarDimErrorAndCrash.txt

.

@nvishnoiMW
Copy link

Hi @edwardhartnett and @DennisHeimbigner,

Any comments on the crash I saw?

Thanks,
Nalini

@edwardhartnett
Copy link
Contributor

Firstly, thanks for all the hard work tracking this down!

I'm sorry that this issue, like so many, does not get the attention it deserves. ;-) I wish that there was a giant team of programmers working on netCDF, but unfortunately there is only a small team of (in one case) giant programmers.

The question of what happens when you call nc_close() again, after calling it once and getting an error, is undefined. Basically, don't do that.

I agree it should not crash however. So this is a second bug.

In terms of the first bug, the dimension scales issue is a difficult one. I will take a look when I get a chance.

@MT2017090
Copy link

Hi @edwardhartnett and @DennisHeimbigner ,

I work closely with Nalini and Ellen for providing NetCDF functionality in MATLAB. I wanted to check if there is any update regarding the crash issue reported by Nalini. Please let me know if you have any questions for me to clarify our workflow.

Thanks,
Rajat

@edwardhartnett
Copy link
Contributor

Sorry, no update nor likely to be any progress for some time.

@pp-mo
Copy link

pp-mo commented Nov 10, 2021

Just noting that this is still an issue.
Our experience mediated via Python, though.

Minimal:

import netCDF4 as nc
ds = nc.Dataset('tmp.nc', mode='w', format='NETCDF4')
ds.createVariable(varname='same_name', datatype=int, dimensions=())
ds.createDimension(dimname='same_name', size=4)
ds.close()

Resulting..

Traceback (most recent call last):
  File "minimal_hfd_error_case.py", line 5, in <module>
    ds.close()
  File "src/netCDF4/_netCDF4.pyx", line 2465, in netCDF4._netCDF4.Dataset.close
  File "src/netCDF4/_netCDF4.pyx", line 2429, in netCDF4._netCDF4.Dataset._close
  File "src/netCDF4/_netCDF4.pyx", line 1927, in netCDF4._netCDF4._ensure_nc_success
RuntimeError: NetCDF: HDF error

Alternatives:

  • creating the dimension before the variable does not hit the error.
  • creating the file with format "NETCDF3-CLASSIC " has no problem.

We here are writing a generic library which just uses this as a standard storage format.
I must say it seems to cause a nasty lack of generality (!!).
I guess we will need to worry about the names being different in these cases.

@edwardhartnett
Copy link
Contributor

Sorry this is a long-known issue, deeply embedded in the core netCDF-4 code and HDF5's way of dealing with dimensions.

The work-around is to define the dimension first.

@pp-mo
Copy link

pp-mo commented Nov 12, 2021

Just to expand on the like :
Many thanks @edwardhartnett for a definite response. This is really helpful !
Now looking to engineer workarounds in Iris for this.

@Alexander-Barth
Copy link
Contributor

Alexander-Barth commented Jul 31, 2024

I think we hitting this bug also in julia's NCDatasets.jl. In addition to the HDF5 error (code = -101), this can also lead to an assertion failure (Assertion failed: (dsid > 0), function attach_dimscales, file nc4hdf.c, line 1417.) when the file is closed and an abort.
As in Python where you have a with-block and a context manager (which closes automatically the opened file), the equivalent constructs in julia are frequently used. So the call to nc_close is implicit and all that the users see in this case is an aborted julia session.

For what is is worth, here are the steps in julia leading to an assertion error when the dimension name is used during a variable definition:

using NCDatasets
ou = NCDataset("test.nc", "c") # nc_create
defDim(ou, "TIME1_10", 10) # nc_def_dim
nctax = defVar(ou, "tax", Int32,("TIME1_10",)) # nc_def_var
nctax[:] = 1:10 # nc_put_var of the data 1,2,3...10
defDim(ou, "tax", 10)  # should netCDF-c rather raise an error already here?
ncmyvar = defVar(ou, "myvar", Int32, ("tax",))
ncmyvar[:] = 1:10 # HDF error
close(ou) # assertion error

Given that this is a long-standing issue, I am wondering if it would be possible:

  • to not abort when the file is closed
  • to have a different error code than the generic HDF5 error with a corresponding error message. Maybe reusing one of these:
NC_EDIMMETA      (-106)    // Problem with dimension metadata.
NC_EDIMSCALE     (-124)    // Problem with HDF5 dimscales.
  • raise the error before rather than on nc_close or nc_put_var (which would prevent to use this dimension)
full stack trace on nc_close
julia> NCDatasets.nc_close(ou.ncid)
julia: nc4hdf.c:1469: attach_dimscales: Assertion `dsid > 0' failed.

[1519179] signal (6.-6): Aborted
in expression starting at REPL[2]:1
pthread_kill at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
raise at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
abort at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x7f85a6b5c71a)
__assert_fail at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
attach_dimscales at /home/abarth/.julia/artifacts/0e6efa1634bd5e3defc1a7be42133121471ac346/lib/libnetcdf.so (unknown line)
nc4_rec_write_metadata at /home/abarth/.julia/artifacts/0e6efa1634bd5e3defc1a7be42133121471ac346/lib/libnetcdf.so (unknown line)
sync_netcdf4_file at /home/abarth/.julia/artifacts/0e6efa1634bd5e3defc1a7be42133121471ac346/lib/libnetcdf.so (unknown line)
nc4_close_hdf5_file at /home/abarth/.julia/artifacts/0e6efa1634bd5e3defc1a7be42133121471ac346/lib/libnetcdf.so (unknown line)
NC4_close at /home/abarth/.julia/artifacts/0e6efa1634bd5e3defc1a7be42133121471ac346/lib/libnetcdf.so (unknown line)
nc_close at /home/abarth/.julia/artifacts/0e6efa1634bd5e3defc1a7be42133121471ac346/lib/libnetcdf.so (unknown line)
nc_close at /home/abarth/.julia/dev/NCDatasets/src/netcdf_c.jl:1224
unknown function (ip: 0x7f858f7ebc47)
_jl_invoke at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:3076
jl_apply at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/julia.h:1982 [inlined]
do_call at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/interpreter.c:126
eval_value at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/interpreter.c:223
eval_stmt_value at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/interpreter.c:174 [inlined]
eval_body at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/interpreter.c:617
jl_interpret_toplevel_thunk at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/interpreter.c:775
jl_toplevel_eval_flex at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/toplevel.c:934
jl_toplevel_eval_flex at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/toplevel.c:877
jl_toplevel_eval_flex at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/toplevel.c:877
jl_toplevel_eval_flex at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/toplevel.c:877
ijl_toplevel_eval_in at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/toplevel.c:985
eval at ./boot.jl:385 [inlined]
eval_user_input at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/usr/share/julia/stdlib/v1.10/REPL/src/REPL.jl:150
repl_backend_loop at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/usr/share/julia/stdlib/v1.10/REPL/src/REPL.jl:246
#start_repl_backend#46 at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/usr/share/julia/stdlib/v1.10/REPL/src/REPL.jl:231
start_repl_backend at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/usr/share/julia/stdlib/v1.10/REPL/src/REPL.jl:228
_jl_invoke at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:3076
#run_repl#59 at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/usr/share/julia/stdlib/v1.10/REPL/src/REPL.jl:389
run_repl at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/usr/share/julia/stdlib/v1.10/REPL/src/REPL.jl:375
jfptr_run_repl_91689.1 at /mnt/data1/abarth/opt/julia-1.10.0/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:3076
#1013 at ./client.jl:432
jfptr_YY.1013_82677.1 at /mnt/data1/abarth/opt/julia-1.10.0/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:3076
jl_apply at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/julia.h:1982 [inlined]
jl_f__call_latest at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/builtins.c:812
#invokelatest#2 at ./essentials.jl:887 [inlined]
invokelatest at ./essentials.jl:884 [inlined]
run_main_repl at ./client.jl:416
exec_options at ./client.jl:333
_start at ./client.jl:552
jfptr__start_82703.1 at /mnt/data1/abarth/opt/julia-1.10.0/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:3076
jl_apply at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/julia.h:1982 [inlined]
true_main at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/jlapi.c:582
jl_repl_entrypoint at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/jlapi.c:731
main at julia (unknown line)
unknown function (ip: 0x7f85a6b5dd8f)
__libc_start_main at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x4010b8)
Allocations: 5495319 (Pool: 5490859; Big: 4460); GC: 7
Aborted (core dumped)

@DennisHeimbigner
Copy link
Collaborator

Is PR #2161 relevant to this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants