ncgen crashes on particular cdl file #323

WardF · 2016-10-28T19:10:00Z

The following CDL crashes when ncgen -k nc4 [filename] is executed on it:

netcdf stations {
types:
  compound obs_t {
    float lat ;
    float lon ;
    string stid ;
    double time ;
    float temperature ;
  }; // obs_t
dimensions:
  n = UNLIMITED ;
variables:
  obs_t obs(n) ;

data:

 obs = {39.9, -104.9, "KDEN", 7776000, 15},
       {40, -105, "KBOU", 7776000, 16} ;
}

Moving temperature before time resolves the crash. Changing the type from float to double fixes the crash. Running ncgen -k nc4 -l c test.cdl > test.c, compiling and running test.c generates a proper binary netcdf file as well.

Log files were generated for each run, and captured below. The issue appears to be that, in the failure case, the size of the compound type is incorrectly computed as 28 instead of 32.

ncgen_temp_moved_below_time.out.txt
ncgen.out.txt
test.out.txt

The text was updated successfully, but these errors were encountered:

WardF · 2016-10-28T19:20:12Z

Starting diagnosis in init_types.

WardF · 2016-10-28T19:28:49Z

Size is being computed in semantics.c:computesize(), ~line 640.

dopplershift · 2016-10-28T20:08:57Z

This reeks of some kind of padding issue. Like where the double field needs a 64-bit aligned address or something.

WardF · 2016-10-28T21:01:25Z

I had to switch gears after identifying the line in semantics.c, but I suspect you are correct.

DennisHeimbigner · 2016-10-28T22:26:39Z

On what os are you running this? Under cygwin on x86, it does not fail for me.
I can easily imagine that the compound type size is not being computed correctly.
What happens is that I have to analyze the compound type to figure out the alignment.
of the fields in the compound type. This has always been a source of potential bugs
since it is potentially dependent on os, compiler, and hardware.

dopplershift · 2016-10-28T22:27:50Z

Running on macOS, 64-bit. Do you really mean x86, or is it x86-64?

WardF · 2016-10-28T22:28:08Z

I've tested on Linux and OSX, both 64-bit

DennisHeimbigner · 2016-10-28T22:33:58Z

The issue is probably compiler related and involves
32bit vs 64 bit. I will need access to a true 64 bit linux
installation to debug it.
=Dennis

On 10/28/2016 4:27 PM, Ryan May wrote:

Running on macOS, 64-bit. Do you really mean x86, or is it x86-64?

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#323 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AA3P23JlNInaD1q1anzG14eKMqz4jY41ks5q4nbngaJpZM4KjwtR.

WardF · 2016-10-31T16:53:13Z

Created a branch, gh323, and incorporated the cdl provided by Ryan into the ncgen netcdf4 tests.

WardF · 2016-11-14T18:22:30Z

@dmh were you able to get access to a 64-bit environment?

DennisHeimbigner · 2016-11-14T19:33:09Z

Not yet. If you want to take a stab at it, the probable
source of the error is ncgen/offset.c. This code attempts
to figure out the struct field alignments. The code
was taken from HDF5, and it may be time to revisit
that original HDF5 code to look for discrepancies.
=Dennis

On 11/14/2016 11:22 AM, Ward Fisher wrote:

@dmh https://github.com/dmh were you able to get access to a 64-bit
environment?

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#323 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AA3P24mn_d4UGodqmO4DinBn3n8GM8THks5q-KbmgaJpZM4KjwtR.

WardF · 2016-11-14T21:19:51Z

note-to-self: semantics.c:640.

WardF · 2016-11-14T21:28:12Z

Read: start, += getpadding(), += field->typesize

Loop	offset (float-double)	offset (double-float)
0	0,0,4 (float)	0,0,4 (float)
1	4,4,8 (float)	4,4,8 (float)
2	8,8,16 (string)	8,8,16 (string)
3	16,16,20 (float)	16,16,24 (double)
4	20,24,32 (double)	24,24,28 (float)

Clearly, order matters when computing padding/offset, and the latter order crashes hdf5 for some reason. Should it? Probably not.

WardF · 2016-11-14T22:24:03Z

Looking at the C code generated by ncgen, it uses the NC_COMPOUND_OFFSET macro which in turn invokes the offsetof C macro, to compute offsets. @DennisHeimbigner Is there any reason not to use the same/similar code for ncgen? I don't know the historic reason for using the hdf5-inspired offsets.c file.

DennisHeimbigner · 2016-11-14T23:08:36Z

The problem is compile-time vs run time. The c-code can use
the NC_COMPOUND_OFFSET because it gets compiled. Generating
the nc file directly (-lb) requires the run-time equivalent.
If memory serves, there is no run-time equivalent of NC_COMPOUND_OFFSET.
=Dennis

On 11/14/2016 3:24 PM, Ward Fisher wrote:

Looking at the C code generated by ncgen, it uses the
|NC_COMPOUND_OFFSET| macro which in turn invokes the |offsetof| C macro,
to compute offsets. @DennisHeimbigner
https://github.com/DennisHeimbigner Is there any reason not to use the
same/similar code for ncgen? I don't know the historic reason for using
the hdf5-inspired offsets.c file.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#323 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AA3P29rP5OI1XtNHlgJtZD0siK37huO6ks5q-N-DgaJpZM4KjwtR.

DennisHeimbigner · 2016-11-14T23:29:47Z

BTW, order sometimes does matter (though rarely).
As a rule, the starting offset of a field is rounded up
to the nearest multiple of the size of the field. so given
struct { char x[3]; int y;} , the offset of x will be 0, but the offset of y will be 4, with a 1 char padding
between x and y.
On the other hand,
struct { char x[3]; double y;} , the offset of x will be 0, but the offset of y will be 8.
Technically, this is compiler dependent, but in practice, all of them adhere to the gcc rules.
Note also that using a packed structure changes things completely. I suspect that none of this is
documented anywhere outside the code.

The reason all of this is important is that netcdf-4 tried to make it possible to define a
C struct type with various fields, and then allow direct read/write of data between that
struct and the underlying HDF5 file. In HDF5 (and netcdf-4), defining a field in a compound
type requires specifying its offset from the beginning of the structure. So it was necessary for
the ncgen to properly determine the offsets at run-time.

I should note that there is undocumented code (ncaux.[ch]) that allows a user to
build a compound type without having to specify the offsets: they are computed using
essentially the same code as used by ncgen.

WardF · 2016-11-15T17:21:12Z

Breadcrumbs: Section 4.3.2.1

https://support.hdfgroup.org/HDF5/doc1.6/UG/11_Datatypes.html

WardF · 2016-11-15T17:23:08Z

Technically, this is compiler dependent, but in practice, all of them adhere to the gcc rules.
Note also that using a packed structure changes things completely. I suspect that none of this is
documented anywhere outside the code.

I need to check to see what Visual Studio does; I hope it also adheres to gcc rules but who knows, considering all the other weird stuff they've done.

WardF · 2016-11-15T20:39:09Z

Misc. notes:

gh323.c = C code generated by ncgen -l c -k nc4 gh323.cdl > gh323.c

Sample CDL where a compound data type including a double followed by a smaller data type such as int or char results in either a crash, such as seen in the CDL above, _or silently fails after writing garbage to the binary netcdf file_.
gh323.c (temporarily stored in ncgen/ while debugging this) contains diagnostic info, and confirms the offsets computed are the same as those in the table above, but the final struct size is 32 not 28.
Trying to nail down behavior and why expected behavior isn't manifesting lead me to this page on wikipedia re: Data structure alignment, where the following statement is made:

It is important to note that the last member is padded with the number of bytes required so that the total size of the structure should be a multiple of the largest alignment of any structure member (alignment(int) in this case, which = 4 on linux-32bit/gcc)[citation needed].

That squares with what my understanding but I'll have to turn to other sources for a definitive source. This also explains why size = 28 is failing. ~~~However, attempting to correct this in semantics.c:~654 with some computations fixes the test case for this issue, but causes a bunch of new failures in ncdump/tst_ncgen4.sh that have yet to be diagnosed/investigated. Before going down that rabbit hole I'll need to convince myself that the approach is correct.~~~

UPDATE: I corrected my fix (I was using the wrong variable to compute the "largest alignment multiplier"). Correcting this corrected the failures previously reported re: tst_ncgen4.sh. I still need to convince myself this fix is correct, but I'm optimistic.

WardF · 2016-11-15T20:51:20Z

Cleanup todo list

Scratchpad for temporary files in netcdf-c/ that need to be addressed before merging into master.

ncgen/gh323_ncgen.c -> Remove
RELEASE_NOTES.md -> Update
gdb.txt -> Remove
align_test.c -> Remove
ncgen/gh323-c.cdl -> Remove
ncgen/gh323-b.cdl -> Remove
ncgen/gh323.cdl -> Rename
ncgen/run_nc4_tests.sh -> Update w/ gh323.cdl name, message.

WardF · 2016-11-15T22:12:27Z

Testing on ARM, Windows (with visual studio) appear to be working so far. Assuming these work, I will see about issuing a pull request to make code review easier.

DennisHeimbigner · 2016-11-15T22:20:41Z

I pushed a branch called issue323.dmh.
can you please try the following.

extract ncgen/offsets.[ch] from that branch
compile on a 64-bit machine as: ${CC} -DOFFSETTEST offsets.c -o offsets.exe
execute offsets.exe and send me the output.

WardF · 2016-11-15T22:24:33Z

Hi Dennis,

The output is below. I've also gotten something that appears to be fixing
the issue in, in branch gh323; I've outlined the fix I found on the github
page.

Here's the output:

null : size= 0 alignment= 0
char : size= 2 alignment= 1
unsigned char : size= 2 alignment= 1
short : size= 4 alignment= 2
unsigned short : size= 4 alignment= 2
int : size= 8 alignment= 4
unsigned int : size= 8 alignment= 4
long long : size=16 alignment= 8
unsigned long long : size=16 alignment= 8
float : size= 8 alignment= 4
double : size=16 alignment= 8
void* : size=16 alignment= 8
nc_vlen_t : size=24 alignment= 8
short vs null : size= 0 alignment= 0
short vs char : size= 4 alignment= 2
short vs unsigned char : size= 4 alignment= 2
short vs short : size= 4 alignment= 2
short vs unsigned short : size= 4 alignment= 2
short vs int : size= 8 alignment= 4
short vs unsigned int : size= 8 alignment= 4
short vs long long : size=16 alignment= 0
short vs unsigned long long : size=16 alignment= 0
short vs float : size= 8 alignment= 4
short vs double : size=16 alignment= 8
short vs void* : size=16 alignment= 8
short vs nc_vlen_t : size=16 alignment= 8
int vs null : size= 0 alignment= 0
int vs char : size= 8 alignment= 2
int vs unsigned char : size= 8 alignment= 2
int vs short : size= 8 alignment= 2
int vs unsigned short : size= 8 alignment= 2
int vs int : size= 8 alignment= 4
int vs unsigned int : size= 8 alignment= 4
int vs long long : size=16 alignment= 8
int vs unsigned long long : size=16 alignment= 8
int vs float : size= 8 alignment= 4
int vs double : size=16 alignment= 8
int vs void* : size=16 alignment= 8
int vs nc_vlen_t : size=16 alignment= 8

On Tue, Nov 15, 2016 at 3:20 PM, DennisHeimbigner notifications@github.com
wrote:

—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
#323 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEH-UoxrHKcLC01bwdb2P46lOPr6Tnm9ks5q-jA5gaJpZM4KjwtR
.

DennisHeimbigner · 2016-11-15T22:37:10Z

I think the situation is a lot worse than I realized.
The tests that need to run are:

write file on 32 bit machine, read on a 64 bit machine
write on a 64 bit machine, read on a 32 bit machine.
The think I am afraid of is this: the offsets encoded into
the hdf5/netcdf-4 file are determined by the machine on which
the file was written. If the machine doing the reading is 64-bit
and has different offsets as set by the compiler, then the cross-
machine reads will fail.
=Dennis

On 11/15/2016 3:24 PM, Ward Fisher wrote:

Hi Dennis,

The output is below. I've also gotten something that appears to be fixing
the issue in, in branch gh323; I've outlined the fix I found on the github
page.

Here's the output:

null : size= 0 alignment= 0
char : size= 2 alignment= 1
unsigned char : size= 2 alignment= 1
short : size= 4 alignment= 2
unsigned short : size= 4 alignment= 2
int : size= 8 alignment= 4
unsigned int : size= 8 alignment= 4
long long : size=16 alignment= 8
unsigned long long : size=16 alignment= 8
float : size= 8 alignment= 4
double : size=16 alignment= 8
void* : size=16 alignment= 8
nc_vlen_t : size=24 alignment= 8
short vs null : size= 0 alignment= 0
short vs char : size= 4 alignment= 2
short vs unsigned char : size= 4 alignment= 2
short vs short : size= 4 alignment= 2
short vs unsigned short : size= 4 alignment= 2
short vs int : size= 8 alignment= 4
short vs unsigned int : size= 8 alignment= 4
short vs long long : size=16 alignment= 0
short vs unsigned long long : size=16 alignment= 0
short vs float : size= 8 alignment= 4
short vs double : size=16 alignment= 8
short vs void* : size=16 alignment= 8
short vs nc_vlen_t : size=16 alignment= 8
int vs null : size= 0 alignment= 0
int vs char : size= 8 alignment= 2
int vs unsigned char : size= 8 alignment= 2
int vs short : size= 8 alignment= 2
int vs unsigned short : size= 8 alignment= 2
int vs int : size= 8 alignment= 4
int vs unsigned int : size= 8 alignment= 4
int vs long long : size=16 alignment= 8
int vs unsigned long long : size=16 alignment= 8
int vs float : size= 8 alignment= 4
int vs double : size=16 alignment= 8
int vs void* : size=16 alignment= 8
int vs nc_vlen_t : size=16 alignment= 8

On Tue, Nov 15, 2016 at 3:20 PM, DennisHeimbigner notifications@github.com
wrote:

—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
#323 (comment),
or mute the thread

https://github.com/notifications/unsubscribe-auth/AEH-UoxrHKcLC01bwdb2P46lOPr6Tnm9ks5q-jA5gaJpZM4KjwtR
.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#323 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AA3P2-uRYJJJtpP_vdjXEC6w5y3gfb_sks5q-jEigaJpZM4KjwtR.

WardF · 2016-11-16T01:42:02Z

I've tested 1 & 2 and haven't had a problem reading either; I tested branch gh323 with the fix described above and there was no problem reading the files generated on 32/64-machines by their opposite. I've attached the files here, the filename indicates which platform the file was written on.

I thought the file offsets were handled by the I/O layer; otherwise, wouldn't netCDF4 files have never been machine-independent in the first place? I'll test these on Windows as well, and ARM when I'm in the office tomorrow.

ncfile.zip

WardF · 2016-11-16T02:12:37Z

Update: The files are reading just fine between platforms, but are being written differently. Specifically, it is being written incorrectly on the 64-bit platform. So, more work is needed.

Written on 32-bit platform:

netcdf compound_datasize_test_32 {
types:
  compound obs_t {
    float lat ;
    float lon ;
    string stid ;
    double time ;
    float temperature ;
  }; // obs_t
dimensions:
    n = UNLIMITED ; // (2 currently)
variables:
    obs_t obs(n) ;
data:

 obs = {39.9, -104.9, "KDEN", 7776000, 15}, {40, -105, "KBOU", 7776000, 16} ;

Written on 64-bit platform

netcdf compound_datasize_test_64 {
types:
  compound obs_t {
    float lat ;
    float lon ;
    string stid ;
    double time ;
    float temperature ;
  }; // obs_t
dimensions:
    n = UNLIMITED ; // (2 currently)
variables:
    obs_t obs(n) ;
data:

 obs = {39.9, -104.9, "KDEN", 7776000, 15}, {-105, 0, "KBOU", 7776000, 16} ;

The results remain the same whether I read from a 32-bit or 64-bit platform.

DennisHeimbigner · 2016-11-16T03:05:19Z

Note this netcdf API function:
EXTERNL int
nc_insert_compound(int ncid, nc_type xtype, const char *name,
size_t offset, nc_type field_typeid);
Note that in the C code, the offset is determined by the offset of the C struct definition.
Someone (the HDF5 layer) is translating the offsets, apparently.

WardF · 2016-11-16T17:49:10Z

Hi Dennis; I had noticed the nc_insert_compound() function; when I looked into what the HDF5 function was doing, they were also using the offsetof C macro for runtime scripts. Still not sure what's going on with ncgen but I've just sat down to try to debug this issue with data being written incorrectly on 64-bit systems. Do you have any insight into a fix?

WardF · 2016-11-16T17:50:42Z

Additionally, I notice the tests in run_nc4_tests.sh only run ncgen to see if it exits with 0; no check to ensure the data is correct is made. I'm updating these tests to make sure they generate what is expected.

WardF · 2016-11-16T18:11:00Z

Ok, current status; the crash is gone but, on 64-bit systems, data is getting scrambled in the CDL file as represented above. The fix is the same as it was for the crash; moving the float below the double solves the issue.

DennisHeimbigner · 2016-11-16T23:32:04Z

This is all beginning to come back. I did an H5Dump on a netcdf-4 file
containing a compound type and it looks like the offsets are not stored in the file.
Assuming that, then the HDF5 library translates between the file layout and the in-memory
layout on the fly. So the data in the file is machine and compiler independent but
the in-memory layout is not.
Since I took the offsets.c code from HDF5, I hope that what we are seeing is not also
a problem with the hdf5 library.

DennisHeimbigner · 2016-11-16T23:35:46Z

Also, let me make sure I understand. You seem to be saying that even when you round up
the size of the whole structure, 64-bit still fails, correct? does it matter if the variable
is dimensioned?

DennisHeimbigner · 2016-11-17T00:07:05Z

So I was looking at the code in semantics.c and it seems to do what you
want about bumping the size of the structure. I think the variable largsize
is unneeded, but whatever.

WardF · 2016-11-17T17:56:46Z

Note from meeting w/ Dennis: probably something in the byte buffer construction code. It looks like the first 4 bytes of the data array (starting at the second element) are being discarded.

DennisHeimbigner · 2016-11-17T22:35:20Z

Ok, I am testing a fix now. Sadly, I was not only mishandling padding between
struct instances, but also within in some case. I have added another test to
try out nested compounds. Seems to work. Let you know when the code passes
distcheck.

WardF · 2016-11-17T22:41:53Z

Thanks Dennis, I did a spot test on my end and, under cmake at least, everything appears to be fixed. I'll wait for travis to confirm the fix, and assuming everything passes I'll flesh out the release notes as well as making sure that all the test scripts are checking the output when ncgen is run.

DennisHeimbigner · 2016-11-17T23:19:02Z

distcheck passed, so I guess we are just waiting on travis.
BTW, we will need to consider adding more tests with various
kinds of compound nesting and intermixing with e.g. vlens.

WardF · 2016-11-17T23:49:55Z

Travis has passed, as has 32-bit ARM. Testing on Windows now with Visual Studio (32, 64 bit). Will continue to flesh out tests and such, but assuming I don't find any failures I will merge into master and create a maintenance release ASAP.

dopplershift · 2016-11-18T00:19:33Z

As an aside, are you interested in AppVeyor as a Travis-like thing for windows? I can help get you set up--I'm just no good at windows building.

dopplershift · 2016-11-18T00:19:48Z

(I have a Unidata org for it)

WardF · 2016-11-18T02:14:34Z

@dopplershift possibly! I'll investigate it and follow up with you; the required dependencies are the only thing I wonder about.

WardF · 2017-03-24T21:55:44Z

Apparently I never closed this; reviewing to ensure it's solved and then closing.

WardF added type/bug status/fix required area/netcdf4 platform/any status/confirmed labels Oct 28, 2016

WardF added this to the 4.4.1.1 milestone Oct 28, 2016

WardF self-assigned this Oct 28, 2016

WardF added a commit that referenced this issue Oct 31, 2016

Added a test file for #323

0c83a59

WardF mentioned this issue Nov 15, 2016

nc_test fails for 4.4.1 on mpich build - needs to work to support HPC and parallel I/O #326

Closed

12 tasks

WardF added a commit that referenced this issue Nov 15, 2016

*possible* fix for #323 but further investigation required.

6c6bccf

WardF added a commit that referenced this issue Nov 18, 2016

Updated release notes for #323

d14227e

WardF modified the milestones: 4.4.2, 4.4.1.1 Dec 28, 2016

WardF closed this as completed May 24, 2017

ncgen crashes on particular cdl file #323

ncgen crashes on particular cdl file #323

Comments

WardF commented Oct 28, 2016

WardF commented Oct 28, 2016

WardF commented Oct 28, 2016

dopplershift commented Oct 28, 2016

WardF commented Oct 28, 2016

DennisHeimbigner commented Oct 28, 2016

dopplershift commented Oct 28, 2016

WardF commented Oct 28, 2016

DennisHeimbigner commented Oct 28, 2016

WardF commented Oct 31, 2016

WardF commented Nov 14, 2016

DennisHeimbigner commented Nov 14, 2016

WardF commented Nov 14, 2016

WardF commented Nov 14, 2016 • edited Loading

WardF commented Nov 14, 2016

DennisHeimbigner commented Nov 14, 2016

DennisHeimbigner commented Nov 14, 2016

WardF commented Nov 15, 2016

WardF commented Nov 15, 2016

WardF commented Nov 15, 2016 • edited Loading

WardF commented Nov 15, 2016 • edited Loading

Cleanup todo list

WardF commented Nov 15, 2016

DennisHeimbigner commented Nov 15, 2016

WardF commented Nov 15, 2016

DennisHeimbigner commented Nov 15, 2016

WardF commented Nov 16, 2016

WardF commented Nov 16, 2016

Written on 32-bit platform:

Written on 64-bit platform

DennisHeimbigner commented Nov 16, 2016

WardF commented Nov 16, 2016

WardF commented Nov 16, 2016

WardF commented Nov 16, 2016

DennisHeimbigner commented Nov 16, 2016

DennisHeimbigner commented Nov 16, 2016

DennisHeimbigner commented Nov 17, 2016

WardF commented Nov 17, 2016

DennisHeimbigner commented Nov 17, 2016

WardF commented Nov 17, 2016

DennisHeimbigner commented Nov 17, 2016

WardF commented Nov 17, 2016

dopplershift commented Nov 18, 2016

dopplershift commented Nov 18, 2016

WardF commented Nov 18, 2016

WardF commented Mar 24, 2017

WardF commented Nov 14, 2016 •

edited

Loading

WardF commented Nov 15, 2016 •

edited

Loading

WardF commented Nov 15, 2016 •

edited

Loading