-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Need a unified source of standard netCDF attributes #525
Comments
* cleaning python script calls from bash * update README * remove various symbolic links * merge data_validation unittests into main script * add dotenv
In my opinion CDL would be the most appropriate format to use for these templates. These can be very easily converted into a netCDF file with ncgen and then opened as a NetCDF4 Dataset object. As for the location, something like data-services/lib/netcdf/templates would make sense. The tricky bit will be how to make this available to the Matlab Toolbox as well... |
the only issue with this is we would have to do a subprocess call from python, which is never an ideal solution. |
Format options for storing netCDF attributesThere are many ways fixed attribute values (such as the license, disclaimer and acknowledgement strings required by the IMOS conventions) could be stored. The idea is to keep these separate from code that creates netCDF files. Here are some options CDL (Common Data Language)Currently, IMOSnetCDF.py in data-services uses text files like this, which are not complete CDL files, but they store global and variable attributes the same way as CDL. Pros
Cons
conf (Windows INI syntax)e.g. https://github.com/aodn/data-services/blob/master/SOOP/SOOP_XBT/DELAYED/generate_nc_file_att Pros
Cons
jsonPros
Cons
IMOS Toolboxe.g. Pros
Cons
PythonAttribute values can be specified directly in a Python module and simply imported, e.g. https://github.com/aodn/data-services/blob/master/ACORN/current_generator/acorn_constants.py Pros
Cons
Other possibilities |
@smancini @lbesnard @ggalibert @bpasquer To be discussed. |
I've a code that read cdl from python. It need some refactoring into regex (its pretty primitive). It doens't support groups, but there are a battery of tests from the unidata website cdls and is pure python using only the re module. It works for my needs but I would like to extend with more regex and support groups. Let me know if you wanna have a go. PS: there is also TOML. |
After doing a bit of testing, I'm of the opinion that the https://github.com/rockdoc/cdlparser code could be what we are after. example:
ncdump http://thredds.aodn.org.au/thredds/dodsC/IMOS/SOOP/SOOP-SST/9HA2479_Pacific-Sun/2010/IMOS_SOOP-SST_MT_20101212T120000Z_9HA2479_FV01_C-20120528T071954Z.nc > /tmp/soop.cdl
from cdlparser import CDL3Parser
myparser = CDL3Parser(close_on_completion=False, file_format='NETCDF4_CLASSIC')
ncdataset = myparser.parse_file("/tmp/soop.cdl",ncfile="/tmp/soop.nc")
ncdataset.close() However the generated netcdf file and the orginal cdl ends up missing some attributes (and maybe other things). The code would have to be forked an improved |
When I was working on generating files with the netcdfgenerator 2 years ago, I remember having issues to generate NetCDF4 file using CDL It's probably been improved since then and so If the cdlparser meets our needs like @lbesnard is suggesting (but further tests are needed), using CDL is the best option in my opinion. |
if you are going to use https://github.com/rockdoc/cdlparser, check if they implemented python3 support. I did a pr to support python 3 sometime ago. It didn't got into upstream because my pr was disorganized . My ode already worked fine for my case, so i didn't push forward. I couldn't remember much why I did a new code...maybe something cdlparser was failing, or I was not aware of it at the time, or I was too optmistic to finish my code with groups support and all the bells. Anyway, I don't see a problem with raw python code either, but json is a clear win if you plan to read those things over the wire. |
re https://github.com/rockdoc/cdlparser, I was having issues with the generated file missing many attributes, data. This is because I used So, doing the following (see below), will create a correct NetCDF file from a CDL format. Will keep on testing with other datasets:
ncdump http://thredds.aodn.org.au/thredds/dodsC/IMOS/SOOP/SOOP-SST/9HA2479_Pacific-Sun/2010/IMOS_SOOP-SST_MT_20101212T120000Z_9HA2479_FV01_C-20120528T071954Z.nc > /tmp/soop.cdl
from cdlparser import CDL3Parser
myparser = CDL3Parser(close_on_completion=False, file_format='NETCDF4')
ncdataset = myparser.parse_file("/tmp/soop.cdl", ncfile="/tmp/soop.nc")
ncdataset.close()
diff /tmp/soop.cdl < (ncdump /tmp/soop.nc) |
No They didn't but we use python 2.7 anyway |
We do want to eventually move to python 3 though! |
My preference would go for CDL, especially if we're only hard coding the content of the attributes. But if we want to be a bit more flexible and allow for external resources to document some attributes (like when the toolbox tries to inject information from a deployment database or a Matlab expression) then JSON might be neater. @jonescc correct me if I'm wrong but I think this is what you are already doing in gogoduck and netcdf generator? |
The netcdf generator and gogoduck use their own xml format. Geoserver uses xml for its configuration and this is where they were located at the time. We had to write support for translating those formats into netcdf attributes which you would also have to do if you don't use an existing mechanism such as cdl or ncml. |
After chatting to @ocehugo and playing around in Python, I am now only half-convinced that CDL is the way to go. For just global attributes there's no problem, you can have a valid CDL (and equivalend netCDF) file containing just global attributes. However, we also want to store variable attributes, and to put them in a valid CDL file, we need to define the variables themselves, complete with data type and dimensions. We can do that too, though this is a bit overkill when we just want to specify a few attributes. More importantly, when creating these templates, we don't actually know what the exact structure of the final file will be, or at least the size of the dimensions. So we can't just read a CDL template straight into a netCDF4.Dataset object (as cdlparser does), add, data, then save to netCDF. Before we can add data, we have to get the dimensions right, but once a dimension or variable is defined in a Dataset object, you can't change its structure. Instead, the workflow needs to be something like this:
The obvious [intermediate Python object] would be a dictionary (with nested dicts for each variable), which could either be defined in Python code, or in json. |
If you need to manipulate it from Python, then it's kind of a moot point with JSON vs. dict... at that point JSON is effectively a serialised dict and a dict is a deserialised JSON object.
We definitely need to avoid defiling things if possible, but that Freudian slip kind of leads onto the next thought I had... you can complement a JSON structure with http://json-schema.org/, as it says on the page it "Describes your existing data format(s)." Having totally unconstrained JSON is just a big bag of keys and values which is a recipe for bugs. You can't get particularly rich types in JSON, but you can at least avoid totally arbitrary data structures. |
The other advantage about using JSON or XML as opposed to CDL is that you could define the file format (NETCDF3, NETCDF4, etc), chunking and compression level per variable. |
@ggalibert, this is the exact reason why I wrote a class-wrapper for writing netCDF4.Datasets (that sit below the cdlreader). The Class, DictDataset, is initialized with 3 different dictionaries: dimensions, variables and global attributes. It postpones the actual netCDF Dataset creation to the "create" method of it. @mhidas saw it in action, and it requires only some small fixes/changes depending on how the template is to be defined:
|
Basic functionality of new NetCDF writer moduleTo support the workflow proposed above, we will create a new Python package for writing netCDF files using templates. If @ocehugo is happy to contribute his code, it could be based on his Note that reading in or creating the data values is outside the scope of this package. The most convenient way to provide the data would be in numpy arrays or a Pandas dataframe. Read templateRead one or more template files and return a dictionary-like template object (e.g. template = DictDataset(from_file='template.json') Update templateUpdate dimensions, variables and attributes in a template object. This should be as simple as adding or updating entries in a dictionary. template.dimensions['TIME'] = 100
template.variables['PRES']['units'] = 'dbar'
template.title = 'Test dataset' Create netCDF objectCreate a template.create(filename) # user-specified file name
template.create() # auto-generated file name using IMOS conventions This could actually be called automatically at the start of the Add dataAdd values from numpy arrays or a Pandas dataframe into the variables defined in the template. This is already done in several exising bits of data-services/aodndata code, but they could be slightly streamlined by offering a single function to do it. If the column names in a dataframe match the variables in the template, the code can match them up automatically. Otherwise each individual data array will need to be specified separately. template.add_data(dataframe)
template.add_data(TIME=time_values, TEMP=temp_values, PRES=pres_values) Write fileClose the netCDF object to finish writing the file. |
The classmethod alternate constructor pattern would be a good fit to decouple the object from the source format, e.g. template1 = DictDataset(source_dict)
template2 = DictDataset.from_json(path='template.json')
template3 = DictDataset.from_ini(path='template.ini') |
Could we still add data by just doing: template.variables['PRES'] = pres_values ? |
I'm happy to change the Dictdataset code I have. I think the first thing to raise is some testing cases (json/ini/cdl) . @mhidas, can you pull out some (or all) the aodn templates already in use? I assumed we would like to avoid rewriting some of them to avoid breaking from the start and go moving them slowly to json. The DictDataset at the moment accepts 3 dict inputs (dims, vars,global_att). Is this way because it was the way my cdlreader output things from a valid CDL. Some things in my mind now:
or
|
Doesn't really matter in terms of the design pattern, both would work depending on what makes sense to you guys. Ideally the regular init could just take native Python dicts to construct the instance, and the from_dict/from_dataset etc. methods can then basically be wrappers which retrieve/transform things into dicts to feed into init, e.g: class DictDataset(object):
def __init__(self, d_dims, vars, gattr):
pass
@classmethod
def from_json(cls, path):
# load the JSON file into a dict
with open(path) as f:
template = json.load(f)
# e.g. this could call out to JSONschema to make sure the JSON has the expected
# high level structure, and could refuse to create the object right here if it wasn't correct
validate_template(template)
# instantiate using the regular __init__ method
return cls(d_dims=template['d_dims'], vars=template['vars'], gattr=template['gattr']) So if you want to instantiate from a Python context where you have dicts already, you just use the regular init, e.g. d_dims = dict()
vars = dict()
gattr = dict()
my_dataset = DictDataset(d_dims, vars, gattr) Or if you want to start from a JSON file you use the classmethod constructor and ultimately end up with the same object: template.json {
"d_dims": {},
"vars": {},
"gattr": {}
} my_dataset = DictDataset.from_json(path='template.json') In any case, this pattern totally decouples the source format from the Python object, because after initialisation it makes zero difference where the original parameters came from. It's great for creating different flavours of pizza too: https://realpython.com/instance-class-and-static-methods-demystified/#delicious-pizza-factories-with-classmethod |
Not quite. template.variables['PRES'][:] = pres_values |
That's what I was thinking, but it probably doesn't really matter. I think most of the code would be reading the template from a file, so it wouldn't actually need to construct the One dict (or call the constructor with three dicts in the other case). I guess having the init() accept three separate dicts makes it easier to create a simple template that specifies e.g. only global attributes: template = DictDataset(gattr={'title': 'test file', 'author': 'me'}) By the way, I think we should use ordered dicts, so the template can specify the order they are written to file.
I don't think that's needed for our main use case. Could be something to add later.
Yeah that could work. I guess it could then automatically set the data type and dimensions of the variable based on the array provided.
netCDF4 is all we need at this point. |
The new package will be developed here: https://github.com/aodn/aodn-netcdf-tools |
Just a final follow up before discussions over aodn-netcdf-tools: I just had a discussion with @mhidas regarding @lwgordonimos suggestion: The actual code uses a more implicit/inheritance style with a parent/children classes. Given the scope is reduced and not much is to be done by the "from_*" functions, I think the suggestion is good and will be simpler than creating a class (say JsonDataset) that inherit from others.
I don't foresee any problems with that if we keep with python3. AFAIK dict in python3 preserve the order inserted. Anyway, this would be an easy change.
This is already implemented in a crude way, you set up the output file with A.set_output('/tmp/abc.nc') and them call "A.create()". Some consistency checks are done at init time, however. At the time I was thinking not in the template itself, but in writing everything from a smaller set of calls. Also, filling the ncobj variables is not handled, but you can do it after the create step by:
|
Many of our processes that create netCDF files require templates to set global (and variable) attributes. Currently this is being done in a variety of different ways, from multiple sources, often setting the same basic attributes (project, acknowledgements, etc..):
There may be others too...
There are two issues here:
It would be helpful to come up with a solution that removes, or at least minimises, both of these issues.
The text was updated successfully, but these errors were encountered: