Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need a unified source of standard netCDF attributes #525

Open
mhidas opened this issue Jul 11, 2016 · 27 comments
Open

Need a unified source of standard netCDF attributes #525

mhidas opened this issue Jul 11, 2016 · 27 comments
Assignees

Comments

@mhidas
Copy link
Contributor

mhidas commented Jul 11, 2016

Many of our processes that create netCDF files require templates to set global (and variable) attributes. Currently this is being done in a variety of different ways, from multiple sources, often setting the same basic attributes (project, acknowledgements, etc..):

  • the Matlab Toolbox has text files like this for global attributes, and imosParameters.txt for variable attributes
  • IMOSnetCDF.py uses text files like this for both global and variable attributes. The format used is consistent with CDL, though the files are not complete CDL files. (It also reads a copy of the imosParameters.txt file from the Toolbox, which is out of date now...)
  • Loz has a generate_netcdf_att.py script which reads global and variable attributes from a config file
  • the ACORN current_generator code simply sets attributes in a acorn_constants.py file.

There may be others too...

There are two issues here:

  1. We have redundant code doing the same thing in different ways.
  2. We have redundant versions of the same standard global attributes in several locations.

It would be helpful to come up with a solution that removes, or at least minimises, both of these issues.

mhidas referenced this issue Jul 11, 2016
 * cleaning python script calls from bash
 * update README
 * remove various symbolic links
 * merge data_validation unittests into main script
 * add dotenv
@mhidas
Copy link
Contributor Author

mhidas commented Jul 11, 2016

In my opinion CDL would be the most appropriate format to use for these templates. These can be very easily converted into a netCDF file with ncgen and then opened as a NetCDF4 Dataset object.

As for the location, something like data-services/lib/netcdf/templates would make sense. The tricky bit will be how to make this available to the Matlab Toolbox as well...

@lbesnard
Copy link
Contributor

converted into a netCDF file with ncgen

the only issue with this is we would have to do a subprocess call from python, which is never an ideal solution.

@mhidas
Copy link
Contributor Author

mhidas commented Jul 26, 2018

Format options for storing netCDF attributes

There are many ways fixed attribute values (such as the license, disclaimer and acknowledgement strings required by the IMOS conventions) could be stored. The idea is to keep these separate from code that creates netCDF files. Here are some options

CDL (Common Data Language)

Currently, IMOSnetCDF.py in data-services uses text files like this, which are not complete CDL files, but they store global and variable attributes the same way as CDL.

Pros

  • CDL is the standard text representation of a NetCDF file. Some existing tools (e.g. nco) can already read/write it.
  • Besides attributes, it can also represent the file structure (dimensions, variable types) and include actual data values.
  • Already have code in data-services to parse attribute definitions (https://github.com/aodn/data-services/blob/master/lib/python/IMOSnetCDF.py)

Cons

  • No standard Python package to read in CDL directly (cdlparser could be a starting point, but doesn't seem to be actively maintained)

conf (Windows INI syntax)

e.g. https://github.com/aodn/data-services/blob/master/SOOP/SOOP_XBT/DELAYED/generate_nc_file_att

Pros

Cons

json

Pros

  • Standard Python package json to read files
  • Handles attribute types (can distinguish between string, numeric and even lists).
  • Hierarchical structure could be very similar to Python NetCDF object

Cons

  • Need to re-write existing attribute files.

IMOS Toolbox

e.g.
global attributes: https://github.com/aodn/imos-toolbox/blob/master/NetCDF/template/global_attributes_timeSeries.txt
variable attributes: https://github.com/aodn/imos-toolbox/blob/master/IMOS/imosParameters.txt

Pros

  • Already maintained & up to date (in imos-toolbox repo).
  • Already have code in IMOSnetCDF.py to pares imosParameters.txt (variable attributes only)
  • Same attributes used in all ANMN data.

Cons

  • Toolbox-specific functionality (e.g. getting values form deployment db), which would need to be ignored for non-Toolbox use.
  • Need to write new Python code to parse global attributes file.
  • Stored in a separate repo.

Python

Attribute values can be specified directly in a Python module and simply imported, e.g. https://github.com/aodn/data-services/blob/master/ACORN/current_generator/acorn_constants.py

Pros

  • Handles data types, i.e. can specify string, numeric or list attributes.
  • Flexible, can include calculated values.
  • No parsing required.

Cons

  • (Using code to store configuration?)

Other possibilities

  • NcML - like CDL but in XML format.
  • YAML
  • TOML

@mhidas
Copy link
Contributor Author

mhidas commented Jul 26, 2018

@smancini @lbesnard @ggalibert @bpasquer To be discussed.

@ocehugo
Copy link
Contributor

ocehugo commented Jul 26, 2018

@mhidas,

I've a code that read cdl from python.

It need some refactoring into regex (its pretty primitive). It doens't support groups, but there are a battery of tests from the unidata website cdls and is pure python using only the re module.

It works for my needs but I would like to extend with more regex and support groups.

Let me know if you wanna have a go.

PS: there is also TOML.

@lbesnard
Copy link
Contributor

lbesnard commented Jul 27, 2018

After doing a bit of testing, I'm of the opinion that the https://github.com/rockdoc/cdlparser code could be what we are after.
It parses CDL files, generates a netcdf file or keep the netcdf object open in python.

example:

  1. in bash
ncdump http://thredds.aodn.org.au/thredds/dodsC/IMOS/SOOP/SOOP-SST/9HA2479_Pacific-Sun/2010/IMOS_SOOP-SST_MT_20101212T120000Z_9HA2479_FV01_C-20120528T071954Z.nc > /tmp/soop.cdl
  1. in python
from cdlparser import CDL3Parser

myparser = CDL3Parser(close_on_completion=False, file_format='NETCDF4_CLASSIC')
ncdataset = myparser.parse_file("/tmp/soop.cdl",ncfile="/tmp/soop.nc")
ncdataset.close()

However the generated netcdf file and the orginal cdl ends up missing some attributes (and maybe other things). The code would have to be forked an improved

@bpasquer
Copy link
Contributor

When I was working on generating files with the netcdfgenerator 2 years ago, I remember having issues to generate NetCDF4 file using CDL It's probably been improved since then and so If the cdlparser meets our needs like @lbesnard is suggesting (but further tests are needed), using CDL is the best option in my opinion.

@ocehugo
Copy link
Contributor

ocehugo commented Jul 30, 2018

if you are going to use https://github.com/rockdoc/cdlparser, check if they implemented python3 support.

I did a pr to support python 3 sometime ago. It didn't got into upstream because my pr was disorganized . My ode already worked fine for my case, so i didn't push forward. I couldn't remember much why I did a new code...maybe something cdlparser was failing, or I was not aware of it at the time, or I was too optmistic to finish my code with groups support and all the bells.

Anyway, I don't see a problem with raw python code either, but json is a clear win if you plan to read those things over the wire.

@lbesnard
Copy link
Contributor

re https://github.com/rockdoc/cdlparser, I was having issues with the generated file missing many attributes, data. This is because I used file_format='NETCDF4_CLASSIC' instead of file_format='NETCDF4'. Apparently a bug from the netcdf library.

So, doing the following (see below), will create a correct NetCDF file from a CDL format. Will keep on testing with other datasets:

  1. in bash
ncdump http://thredds.aodn.org.au/thredds/dodsC/IMOS/SOOP/SOOP-SST/9HA2479_Pacific-Sun/2010/IMOS_SOOP-SST_MT_20101212T120000Z_9HA2479_FV01_C-20120528T071954Z.nc > /tmp/soop.cdl
  1. in python
from cdlparser import CDL3Parser

myparser = CDL3Parser(close_on_completion=False, file_format='NETCDF4')
ncdataset = myparser.parse_file("/tmp/soop.cdl", ncfile="/tmp/soop.nc")
ncdataset.close()
  1. bash, to check the diff between the new netcdf and the cdl file
diff /tmp/soop.cdl < (ncdump /tmp/soop.nc)

@lbesnard
Copy link
Contributor

check if they implemented python3 support.

No They didn't but we use python 2.7 anyway

@mhidas
Copy link
Contributor Author

mhidas commented Jul 30, 2018

We do want to eventually move to python 3 though!
(Support for Python 2 ends in a year and a half, and many packages we use will stop supporting it before then - see e.g. https://python3statement.org/)

@ggalibert
Copy link
Contributor

My preference would go for CDL, especially if we're only hard coding the content of the attributes.

But if we want to be a bit more flexible and allow for external resources to document some attributes (like when the toolbox tries to inject information from a deployment database or a Matlab expression) then JSON might be neater. @jonescc correct me if I'm wrong but I think this is what you are already doing in gogoduck and netcdf generator?

@jonescc
Copy link
Contributor

jonescc commented Jul 30, 2018

The netcdf generator and gogoduck use their own xml format. Geoserver uses xml for its configuration and this is where they were located at the time. We had to write support for translating those formats into netcdf attributes which you would also have to do if you don't use an existing mechanism such as cdl or ncml.

@mhidas
Copy link
Contributor Author

mhidas commented Jul 31, 2018

After chatting to @ocehugo and playing around in Python, I am now only half-convinced that CDL is the way to go.

For just global attributes there's no problem, you can have a valid CDL (and equivalend netCDF) file containing just global attributes. However, we also want to store variable attributes, and to put them in a valid CDL file, we need to define the variables themselves, complete with data type and dimensions. We can do that too, though this is a bit overkill when we just want to specify a few attributes.

More importantly, when creating these templates, we don't actually know what the exact structure of the final file will be, or at least the size of the dimensions. So we can't just read a CDL template straight into a netCDF4.Dataset object (as cdlparser does), add, data, then save to netCDF. Before we can add data, we have to get the dimensions right, but once a dimension or variable is defined in a Dataset object, you can't change its structure.

Instead, the workflow needs to be something like this:

[template file] => [intermediate Python object]
                => set correct dimension sizes and variable dimensions
                => convert to netCDF4 object
                => add data arrays
                => write file

The obvious [intermediate Python object] would be a dictionary (with nested dicts for each variable), which could either be defined in Python code, or in json.

@ghost
Copy link

ghost commented Aug 1, 2018

If you need to manipulate it from Python, then it's kind of a moot point with JSON vs. dict... at that point JSON is effectively a serialised dict and a dict is a deserialised JSON object.

we need to defile the variables themselves

We definitely need to avoid defiling things if possible, but that Freudian slip kind of leads onto the next thought I had... you can complement a JSON structure with http://json-schema.org/, as it says on the page it "Describes your existing data format(s)."

Having totally unconstrained JSON is just a big bag of keys and values which is a recipe for bugs. You can't get particularly rich types in JSON, but you can at least avoid totally arbitrary data structures.

@ggalibert
Copy link
Contributor

The other advantage about using JSON or XML as opposed to CDL is that you could define the file format (NETCDF3, NETCDF4, etc), chunking and compression level per variable.

@ocehugo
Copy link
Contributor

ocehugo commented Aug 1, 2018

@ggalibert, this is the exact reason why I wrote a class-wrapper for writing netCDF4.Datasets (that sit below the cdlreader).

The Class, DictDataset, is initialized with 3 different dictionaries: dimensions, variables and global attributes. It postpones the actual netCDF Dataset creation to the "create" method of it.
You can define variable dimensions as keys in the other dictionary, as well as chunking/compression per variable.
It's useful because everything is within a single dictionary structure that can be reused or even summed with other DictDatasets (the class has the "add" method).

@mhidas saw it in action, and it requires only some small fixes/changes depending on how the template is to be defined:

atemplate.py => x = DictDataset(from_file=atemplate.py)
                => set correct dimension sizes and variable dimensions (x.dimensions['x'] = X) #if 'x' in template is different from X
                => convert to netCDF4 object (x.set_file(outputfile),x.create())
                => add data arrays (x.ncobj[varname][:] = var)
                => write file (x.ncobj.sync())

@mhidas
Copy link
Contributor Author

mhidas commented Aug 7, 2018

Basic functionality of new NetCDF writer module

To support the workflow proposed above, we will create a new Python package for writing netCDF files using templates. If @ocehugo is happy to contribute his code, it could be based on his DictDataset class. At the minimum, it will need to implement the basic functionality described below.

Note that reading in or creating the data values is outside the scope of this package. The most convenient way to provide the data would be in numpy arrays or a Pandas dataframe.

Read template

Read one or more template files and return a dictionary-like template object (e.g. DictDataset) representing the file structure and attributes. The template file format will be JSON (optionally readers for other formats could be implemented).
e.g.

template = DictDataset(from_file='template.json')

Update template

Update dimensions, variables and attributes in a template object. This should be as simple as adding or updating entries in a dictionary.
e.g.

template.dimensions['TIME'] = 100
template.variables['PRES']['units'] = 'dbar'
template.title = 'Test dataset'

Create netCDF object

Create a netCDF4.Dataset object and add the dimensions, variables and attributes specified in the template.
e.g.

template.create(filename)   # user-specified file name
template.create()           # auto-generated file name using IMOS conventions

This could actually be called automatically at the start of the add_data method, if it hasn't been explicitly called yet.

Add data

Add values from numpy arrays or a Pandas dataframe into the variables defined in the template. This is already done in several exising bits of data-services/aodndata code, but they could be slightly streamlined by offering a single function to do it. If the column names in a dataframe match the variables in the template, the code can match them up automatically. Otherwise each individual data array will need to be specified separately.
e.g.

template.add_data(dataframe)
template.add_data(TIME=time_values, TEMP=temp_values, PRES=pres_values)

Write file

Close the netCDF object to finish writing the file.

@ghost
Copy link

ghost commented Aug 7, 2018

The classmethod alternate constructor pattern would be a good fit to decouple the object from the source format, e.g.

template1 = DictDataset(source_dict)
template2 = DictDataset.from_json(path='template.json')
template3 = DictDataset.from_ini(path='template.ini')

@ggalibert
Copy link
Contributor

Could we still add data by just doing:

template.variables['PRES'] = pres_values

?

@ocehugo
Copy link
Contributor

ocehugo commented Aug 8, 2018

I'm happy to change the Dictdataset code I have.

I think the first thing to raise is some testing cases (json/ini/cdl) .

@mhidas, can you pull out some (or all) the aodn templates already in use? I assumed we would like to avoid rewriting some of them to avoid breaking from the start and go moving them slowly to json.

The DictDataset at the moment accepts 3 dict inputs (dims, vars,global_att). Is this way because it was the way my cdlreader output things from a valid CDL.

Some things in my mind now:

  1. You guys think its better to have a single input dict to rule them all!? This would match json...
template1 = DictDataset(source_dict)

or

template1 = DictDataset(dims=d_dims,vars=d_vars,gattr=d_gattr)
  1. "append mode" to an already created NetCDF file is required in the the short term? This would force us to provide a template from a netCDF4.Dataset object:
template = DictDataset.from_dataset(path='file.nc')
  1. I think delay the creation of the Dataset after add_data method is a better strategy.
    This would allow leaving dimensions specification to be later evaluated plus other things. Just store the ata as a "value" key (validation would be at create() anyway).

  2. backends: netCDF4 is assumed from start. Can't see it going anytime soon, but h5netcdf could be an option!?

@ghost
Copy link

ghost commented Aug 8, 2018

You guys think its better to have a single input dict to rule them all!? This would match json...

Doesn't really matter in terms of the design pattern, both would work depending on what makes sense to you guys. Ideally the regular init could just take native Python dicts to construct the instance, and the from_dict/from_dataset etc. methods can then basically be wrappers which retrieve/transform things into dicts to feed into init, e.g:

class DictDataset(object):
    def __init__(self, d_dims, vars, gattr):
        pass

    @classmethod
    def from_json(cls, path):
        # load the JSON file into a dict
        with open(path) as f:
            template = json.load(f)

        # e.g. this could call out to JSONschema to make sure the JSON has the expected 
        # high level structure, and  could refuse to create the object right here if it wasn't correct
        validate_template(template)  


        # instantiate using the regular __init__ method
        return cls(d_dims=template['d_dims'], vars=template['vars'], gattr=template['gattr'])

So if you want to instantiate from a Python context where you have dicts already, you just use the regular init, e.g.

d_dims = dict()
vars = dict()
gattr = dict()

my_dataset = DictDataset(d_dims, vars, gattr)

Or if you want to start from a JSON file you use the classmethod constructor and ultimately end up with the same object:

template.json

{
    "d_dims": {},
    "vars": {},
    "gattr": {}
}
my_dataset = DictDataset.from_json(path='template.json')

In any case, this pattern totally decouples the source format from the Python object, because after initialisation it makes zero difference where the original parameters came from.

It's great for creating different flavours of pizza too: https://realpython.com/instance-class-and-static-methods-demystified/#delicious-pizza-factories-with-classmethod

@mhidas
Copy link
Contributor Author

mhidas commented Aug 8, 2018

template.variables['PRES'] = pres_values

Not quite. template.variables['PRES'] needs to be a dictionary, so it can store the variable's type, dimensions, and attribute values. We could replicate the netCDF4 interface, so you can do this:

template.variables['PRES'][:] = pres_values

@mhidas
Copy link
Contributor Author

mhidas commented Aug 8, 2018

You guys think its better to have a single input dict to rule them all!?

That's what I was thinking, but it probably doesn't really matter. I think most of the code would be reading the template from a file, so it wouldn't actually need to construct the One dict (or call the constructor with three dicts in the other case). I guess having the init() accept three separate dicts makes it easier to create a simple template that specifies e.g. only global attributes:

template = DictDataset(gattr={'title': 'test file', 'author': 'me'})

By the way, I think we should use ordered dicts, so the template can specify the order they are written to file.

"append mode" to an already created NetCDF file is required in the the short term?

I don't think that's needed for our main use case. Could be something to add later.

I think delay the creation of the Dataset after add_data method is a better strategy.

Yeah that could work. I guess it could then automatically set the data type and dimensions of the variable based on the array provided.

backends

netCDF4 is all we need at this point.

@mhidas
Copy link
Contributor Author

mhidas commented Aug 28, 2018

The new package will be developed here: https://github.com/aodn/aodn-netcdf-tools

@ocehugo
Copy link
Contributor

ocehugo commented Sep 4, 2018

Just a final follow up before discussions over aodn-netcdf-tools:

I just had a discussion with @mhidas regarding @lwgordonimos suggestion:

The actual code uses a more implicit/inheritance style with a parent/children classes. Given the scope is reduced and not much is to be done by the "from_*" functions, I think the suggestion is good and will be simpler than creating a class (say JsonDataset) that inherit from others.

By the way, I think we should use ordered dicts, so the template can specify the order they are written to file.

I don't foresee any problems with that if we keep with python3. AFAIK dict in python3 preserve the order inserted. Anyway, this would be an easy change.

Yeah that could work. I guess it could then automatically set the data type and dimensions of the variable based on the array provided.

This is already implemented in a crude way, you set up the output file with A.set_output('/tmp/abc.nc') and them call "A.create()". Some consistency checks are done at init time, however. At the time I was thinking not in the template itself, but in writing everything from a smaller set of calls.

Also, filling the ncobj variables is not handled, but you can do it after the create step by:

A.ncobj['PRES'][:] = pres_values

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants