-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for the scientific extension. #199
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good. Couple comments, and also can you add this extension to the API documentation?
27fc83a
to
43ca90a
Compare
Change suggested by @lossyrob For testing, it was easier to add __eq__ to Publication.
Codecov Report
@@ Coverage Diff @@
## develop #199 +/- ##
===========================================
+ Coverage 93.80% 93.93% +0.13%
===========================================
Files 31 32 +1
Lines 3807 3960 +153
===========================================
+ Hits 3571 3720 +149
- Misses 236 240 +4
Continue to review full report at Codecov.
|
Checked type hints with pytype, but mypy does not agree.
Fixes flake8 error: pystac/extensions/scientific.py:110:12: F402 import 'link' from line 19 shadowed by loop variable
|
||
@property | ||
def publications(self) -> List[Publication]: | ||
return [Publication.from_dict(pub) for pub in self.item.properties.get(PUBLICATIONS, [])] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't catch this in my last review, but:
The way that Publications are handled here breaks a mutability pattern in PySTAC.
For instance, when I do
item.ext.scientific.doi = "something"
print(item.ext.scientific.doi)
I would get the expected result, "something". This is true even though the extension wrapper object is created twice - once for each call of item.ext.scientific
.
As it is currently written, the same wouldn't hold true for publications. For example, say I had an item with a single publication that had a citation of "everyone et al". The following:
item.ext.scientific.publications[0].citation = "none et al"
print(item.ext.scientific.publications[0].citation
would print "everyone et al" - since the Publication instance that was returned was a temporary object, setting it's citation didn't have consequence.
This could be worked around by requiring the user always get and set publications, e.g.
pubs = item.ext.scientific.publications
pubs[0].citation = "none et al"
item.ext.scientific.publications = pubs
but this is burdensome, and surely users will forget to do this. Also, it doesn't match the expectation of mutability that's present elsewhere in the library, including the Ext classes that wrap an item rather than contain data of their own.
The way around this is to keep the wrapper train rolling - take LabelOverview for instance. It takes as a constructor the dict from the parent Item's properties and uses getters and setters to manipulate the data directly on the Item rather than hold its own state.
I think we should follow that pattern here and provide similar wrappers around Item state with the Publications and DOI objects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll take a look at this today. Now that you say this, I realize that version is likely broken. Created #210.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm having trouble thinking through how best to do this. I'd like Publication and Doi instances to be able to live outside of an Item or Collection. What should a Publication track? I see a couple possibilities with it being optional if these are attached:
- The Item or Collection
- The properties/extra_fields and links
- The specific entries in properties/extra_fields and the link
Having a reference to the specific dict in sci:publications
and in links could make changing the doi or citation fairly easy to do, but removing publications from an Item/Collection becomes trouble, yes?
I do worry about things like people taking a publication for one item and then adding that publication to all the other items in a collection and trying to permute them to match that item. If there are references inside the publication, then all the copies will end up changing together. And now I'm thinking which things should have deepcopies and which shouldn't getting all turned around.
I don't have an understand of what the overall strategy is for mutations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest wrapping the dictionary of properties that make up the Publication, i.e. the init takes a JSON serialized version of the Publication. In other cases, I have a create
classmethod which acts as an overloaded constructor that takes the same params as apply
and allows the objects to live outside of an Item (e.g. how it's done with Bands)
I'm not sure that removing publications is too onerous, the logic in the removal would have to just account for the fact it's removing a dict from a list rather than an object from the list.
The scenario you mention is problematic, and a good flag to raise. If you were to add the publication from one item to another and change the publication instance, that change would appear in both places. This could lead to unexpected behavior. The way I can see coding around this case is to have these objects track their owners, much like a link or asset tracks their owners. That way, when a publication owned by one item is attached to another, either the ownership can switch or a deepcopy made.
It does seem to me that this scenario is less likely to cause confusion than if the publications become detached and are un-editable unlike the rest of the library. I think if we're clear that things are highly mutable, users should be careful about reusing instances set on many different objects and then editing those instances, and use a copy themselves if they want a fresh copy that won't be mutated by the original instance. It seems similar to the scenario of a user creating a single dict
and assigning it to many Item instance's properties
via the constructor, and then mutating the properties. So, the scenario you outline is problematic, but I think is something we can live with.
I'm not sure what the overall strategy for mutations is, but I'll try to comment on my perspective around mutability in PySTAC. Mutability always comes with risk of state modification when you don't want it, and ideally all things would be strictly immutable. Any change would give you back a new copy of the thing you changed, with the change you requested. However, this gets complex pretty quickly in STAC. Imaging you've read in a large catalog. Any change in that catalog would produce a new catalog with the relevant edits. If I modify a single property of an Item, then I need a new Item. I then need all of the links to that Item changed, and so on - what I really would produce is an entirely new STAC. Things get more complicated when you consider that PySTAC uses caching to enable links to the same STAC object on the filesystem to refer to the same STAC object in memory.
Immutability is possible, and there are patterns like Lenses to make complex hierarchical immutable data structures easier to transform. When I was setting things up in PySTAC, I certainly considered this. However, Python being the dynamic, mutation-friendly language that it is, that felt like it would be swimming against the current of what users would want. I think of STACs as graphs, so I looked at the most popular graph library - networkx - to take direction from, which has mutable data structures (as an aside, I found this interesting conversation about mutable graphs that is related).
So to me, if we're not trying to be immutable, then we can lean into mutability and use it to our advantage, like how we allow multiple extension objects act as views into the STAC object properties and mutate it, so that the extension objects are essentially stateless. We can also ask the user to assume mutability and understand that there's risk in assigning the same instance across multiple STAC objects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the code, I realized that Doi
is not an object, but a wrapper around a string. This would be fine to keep as an object, though I think it may get awkward to do things like
item.ext.scientific.doi.doi
to get at the string. Perhaps consider moving the logic into the scientific extensions? E.g. I could imagine validation logic and the URL being directly on ScientificItemExt, like a doi_url
property that is only a getter, and a validate_doi
method that is called on the doi setter, which would take a string.
For Publication
, the above comments hold - this is a object that would be represented as a dict inside the item properties.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the publication class go away and just use a dict? I thought python3 had a dotted dict type thing, but ? If we wanted to ditch mutability of the fields inside the pub, we could use collections.namedtuple
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After taking another look at this and pushing some code around, I think this is a tricker case than other cases, for instance LabelClasses
, because this is a list of objects and not an dict. The 'proper' way to do what I'm talking about is to generate a collection class that would define it's own sequence methods, that would translate between the Item properties and Publication objects as items were retrieved, appended etc.
However, this adds a level of complexity such that the "worth it" factor is starting to tip in the direction of "no" for me. The way you have implemented works great, and the situation I mention may come up and confuse users but perhaps won't be too much of head-scratcher to work around.
That's to say, let's go forward with this solution and maybe re-evaluate if someone complains. Thanks for bearing with me on this!
def publications(self, v: List[Publication]) -> None: | ||
self.item.properties[PUBLICATIONS] = [pub.to_dict() for pub in v] | ||
for pub in v: | ||
self.item.add_link(pub.get_link()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I may have been mistaken if I commented on publication links - the spec does not say that publications should be in the links
of an item, only the item DOI link. This is added above in the doi setter. With this logic we have links that are marked as 'cite_as' for the doi of the data, and also any publications that reference it. I think the correct approach is to not have publications modify links, but have doi
modify the single cite_as link.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rereading your comments above and this one, I will start cleaning up what I have. I will remove the Doi class. It's just a convenience wrapper that can easily be a function added later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rereading the spec, I see this near the bottom defining a Publication:
doi | string | DOI of a publication referencing the data.
And then after that is:
For all DOI names respective DOI links SHOULD be added to the links section (see chapter "Relation types").
To me this implies that the publication DOI should be a link. If it had said sci:doi
should be added to the links, then I would have agreed that the pubs should not be in the links.
Does that make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, good call - it seems like the publication DOIs should end up as links. And based on the language for the relation type, the cite-as
rel type seems like the right way to go. This is confusing to me - to have several links with the rel type cite-as
that point to either DOIs for the data DOIs for publications that point to the data - but that is how the spec reads now that you've clarified it for me. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cite-as
does seem to have too many uses. I guess we need to assume that the DOIs will be different and the code should search for cite-as links with the quoted doi string in the link. Created radiantearth/stac-spec#915 to discuss this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
radiantearth/stac-spec#915 (comment) said:
The links were just intended as simplification for clients and users, mostly for the ones not supporting the scientific extension so that user could still follow links. It was just meant as a list of related dois to follow, not to transport any additional information or so.
If two links would be the same, there should be just one link to it. The thing is that the DOI links can easily be generated if the client understands the sci extentsion/dois anyway so those clients don't really need the links.
It's just a wrapper around str that isn't critical.
- Use pystac.Thing rather than from pystac import thing - Drop the regex checks of DOI - Remove TODO about links that was in the wrong place
|
||
@property | ||
def publications(self) -> List[Publication]: | ||
return [Publication.from_dict(pub) for pub in self.item.properties.get(PUBLICATIONS, [])] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After taking another look at this and pushing some code around, I think this is a tricker case than other cases, for instance LabelClasses
, because this is a list of objects and not an dict. The 'proper' way to do what I'm talking about is to generate a collection class that would define it's own sequence methods, that would translate between the Item properties and Publication objects as items were retrieved, appended etc.
However, this adds a level of complexity such that the "worth it" factor is starting to tip in the direction of "no" for me. The way you have implemented works great, and the situation I mention may come up and confuse users but perhaps won't be too much of head-scratcher to work around.
That's to say, let's go forward with this solution and maybe re-evaluate if someone complains. Thanks for bearing with me on this!
https://github.com/radiantearth/stac-spec/tree/dev/extensions/scientific
This is for issue #194. I will be tracking the changes to PR #193 to improve this extension code.