Add support for the scientific extension. #199

schwehr · 2020-10-14T21:53:34Z

https://github.com/radiantearth/stac-spec/tree/dev/extensions/scientific

This is for issue #194. I will be tracking the changes to PR #193 to improve this extension code.

lossyrob

Looking good. Couple comments, and also can you add this extension to the API documentation?

pystac/extensions/scientific.py

@lossyrob

Change suggested by @lossyrob For testing, it was easier to add __eq__ to Publication.

codecov-io · 2020-10-16T21:21:14Z

Codecov Report

Merging #199 (9d72587) into develop (37895cd) will increase coverage by 0.13%.
The diff coverage is 97.38%.

@@             Coverage Diff             @@
##           develop     #199      +/-   ##
===========================================
+ Coverage    93.80%   93.93%   +0.13%     
===========================================
  Files           31       32       +1     
  Lines         3807     3960     +153     
===========================================
+ Hits          3571     3720     +149     
- Misses         236      240       +4

Impacted Files	Coverage Δ
pystac/extensions/scientific.py	`97.36% <97.36%> (ø)`
pystac/__init__.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 37895cd...9d72587. Read the comment docs.

Checked type hints with pytype, but mypy does not agree.

Fixes flake8 error: pystac/extensions/scientific.py:110:12: F402 import 'link' from line 19 shadowed by loop variable

pystac/extensions/scientific.py

lossyrob · 2020-10-22T04:09:36Z

pystac/extensions/scientific.py

+
+    @property
+    def publications(self) -> List[Publication]:
+        return [Publication.from_dict(pub) for pub in self.item.properties.get(PUBLICATIONS, [])]


Didn't catch this in my last review, but:

The way that Publications are handled here breaks a mutability pattern in PySTAC.

For instance, when I do

item.ext.scientific.doi = "something" print(item.ext.scientific.doi)

I would get the expected result, "something". This is true even though the extension wrapper object is created twice - once for each call of item.ext.scientific.

As it is currently written, the same wouldn't hold true for publications. For example, say I had an item with a single publication that had a citation of "everyone et al". The following:

item.ext.scientific.publications[0].citation = "none et al" print(item.ext.scientific.publications[0].citation

would print "everyone et al" - since the Publication instance that was returned was a temporary object, setting it's citation didn't have consequence.

This could be worked around by requiring the user always get and set publications, e.g.

pubs = item.ext.scientific.publications pubs[0].citation = "none et al" item.ext.scientific.publications = pubs

but this is burdensome, and surely users will forget to do this. Also, it doesn't match the expectation of mutability that's present elsewhere in the library, including the Ext classes that wrap an item rather than contain data of their own.

The way around this is to keep the wrapper train rolling - take LabelOverview for instance. It takes as a constructor the dict from the parent Item's properties and uses getters and setters to manipulate the data directly on the Item rather than hold its own state.

I think we should follow that pattern here and provide similar wrappers around Item state with the Publications and DOI objects.

I'll take a look at this today. Now that you say this, I realize that version is likely broken. Created #210.

I'm having trouble thinking through how best to do this. I'd like Publication and Doi instances to be able to live outside of an Item or Collection. What should a Publication track? I see a couple possibilities with it being optional if these are attached:

The Item or Collection

The properties/extra_fields and links

The specific entries in properties/extra_fields and the link

Having a reference to the specific dict in sci:publications and in links could make changing the doi or citation fairly easy to do, but removing publications from an Item/Collection becomes trouble, yes?

I do worry about things like people taking a publication for one item and then adding that publication to all the other items in a collection and trying to permute them to match that item. If there are references inside the publication, then all the copies will end up changing together. And now I'm thinking which things should have deepcopies and which shouldn't getting all turned around.

I don't have an understand of what the overall strategy is for mutations.

I'd suggest wrapping the dictionary of properties that make up the Publication, i.e. the init takes a JSON serialized version of the Publication. In other cases, I have a create classmethod which acts as an overloaded constructor that takes the same params as apply and allows the objects to live outside of an Item (e.g. how it's done with Bands)

I'm not sure that removing publications is too onerous, the logic in the removal would have to just account for the fact it's removing a dict from a list rather than an object from the list.

The scenario you mention is problematic, and a good flag to raise. If you were to add the publication from one item to another and change the publication instance, that change would appear in both places. This could lead to unexpected behavior. The way I can see coding around this case is to have these objects track their owners, much like a link or asset tracks their owners. That way, when a publication owned by one item is attached to another, either the ownership can switch or a deepcopy made.

It does seem to me that this scenario is less likely to cause confusion than if the publications become detached and are un-editable unlike the rest of the library. I think if we're clear that things are highly mutable, users should be careful about reusing instances set on many different objects and then editing those instances, and use a copy themselves if they want a fresh copy that won't be mutated by the original instance. It seems similar to the scenario of a user creating a single dict and assigning it to many Item instance's properties via the constructor, and then mutating the properties. So, the scenario you outline is problematic, but I think is something we can live with.

I'm not sure what the overall strategy for mutations is, but I'll try to comment on my perspective around mutability in PySTAC. Mutability always comes with risk of state modification when you don't want it, and ideally all things would be strictly immutable. Any change would give you back a new copy of the thing you changed, with the change you requested. However, this gets complex pretty quickly in STAC. Imaging you've read in a large catalog. Any change in that catalog would produce a new catalog with the relevant edits. If I modify a single property of an Item, then I need a new Item. I then need all of the links to that Item changed, and so on - what I really would produce is an entirely new STAC. Things get more complicated when you consider that PySTAC uses caching to enable links to the same STAC object on the filesystem to refer to the same STAC object in memory.

Immutability is possible, and there are patterns like Lenses to make complex hierarchical immutable data structures easier to transform. When I was setting things up in PySTAC, I certainly considered this. However, Python being the dynamic, mutation-friendly language that it is, that felt like it would be swimming against the current of what users would want. I think of STACs as graphs, so I looked at the most popular graph library - networkx - to take direction from, which has mutable data structures (as an aside, I found this interesting conversation about mutable graphs that is related).

So to me, if we're not trying to be immutable, then we can lean into mutability and use it to our advantage, like how we allow multiple extension objects act as views into the STAC object properties and mutate it, so that the extension objects are essentially stateless. We can also ask the user to assume mutability and understand that there's risk in assigning the same instance across multiple STAC objects.

Looking at the code, I realized that Doi is not an object, but a wrapper around a string. This would be fine to keep as an object, though I think it may get awkward to do things like

item.ext.scientific.doi.doi

to get at the string. Perhaps consider moving the logic into the scientific extensions? E.g. I could imagine validation logic and the URL being directly on ScientificItemExt, like a doi_url property that is only a getter, and a validate_doi method that is called on the doi setter, which would take a string.

For Publication, the above comments hold - this is a object that would be represented as a dict inside the item properties.

Should the publication class go away and just use a dict? I thought python3 had a dotted dict type thing, but ? If we wanted to ditch mutability of the fields inside the pub, we could use collections.namedtuple

After taking another look at this and pushing some code around, I think this is a tricker case than other cases, for instance LabelClasses, because this is a list of objects and not an dict. The 'proper' way to do what I'm talking about is to generate a collection class that would define it's own sequence methods, that would translate between the Item properties and Publication objects as items were retrieved, appended etc.

However, this adds a level of complexity such that the "worth it" factor is starting to tip in the direction of "no" for me. The way you have implemented works great, and the situation I mention may come up and confuse users but perhaps won't be too much of head-scratcher to work around.

That's to say, let's go forward with this solution and maybe re-evaluate if someone complains. Thanks for bearing with me on this!

lossyrob · 2020-10-30T16:00:59Z

pystac/extensions/scientific.py

+    def publications(self, v: List[Publication]) -> None:
+        self.item.properties[PUBLICATIONS] = [pub.to_dict() for pub in v]
+        for pub in v:
+            self.item.add_link(pub.get_link())


I think I may have been mistaken if I commented on publication links - the spec does not say that publications should be in the links of an item, only the item DOI link. This is added above in the doi setter. With this logic we have links that are marked as 'cite_as' for the doi of the data, and also any publications that reference it. I think the correct approach is to not have publications modify links, but have doi modify the single cite_as link.

Rereading your comments above and this one, I will start cleaning up what I have. I will remove the Doi class. It's just a convenience wrapper that can easily be a function added later.

Rereading the spec, I see this near the bottom defining a Publication:

doi | string | DOI of a publication referencing the data.

And then after that is:

For all DOI names respective DOI links SHOULD be added to the links section (see chapter "Relation types").

To me this implies that the publication DOI should be a link. If it had said sci:doi should be added to the links, then I would have agreed that the pubs should not be in the links.

Does that make sense?

Ah, good call - it seems like the publication DOIs should end up as links. And based on the language for the relation type, the cite-as rel type seems like the right way to go. This is confusing to me - to have several links with the rel type cite-as that point to either DOIs for the data DOIs for publications that point to the data - but that is how the spec reads now that you've clarified it for me. Thanks!

cite-as does seem to have too many uses. I guess we need to assume that the DOIs will be different and the code should search for cite-as links with the quoted doi string in the link. Created radiantearth/stac-spec#915 to discuss this.

radiantearth/stac-spec#915 (comment) said:

The links were just intended as simplification for clients and users, mostly for the ones not supporting the scientific extension so that user could still follow links. It was just meant as a list of related dois to follow, not to transport any additional information or so.
If two links would be the same, there should be just one link to it. The thing is that the DOI links can easily be generated if the client understands the sci extentsion/dois anyway so those clients don't really need the links.

pystac/extensions/scientific.py

It's just a wrapper around str that isn't critical.

- Use pystac.Thing rather than from pystac import thing - Drop the regex checks of DOI - Remove TODO about links that was in the wrong place

lossyrob · 2020-11-19T04:18:58Z

pystac/extensions/scientific.py

+
+    @property
+    def publications(self) -> List[Publication]:
+        return [Publication.from_dict(pub) for pub in self.item.properties.get(PUBLICATIONS, [])]


After taking another look at this and pushing some code around, I think this is a tricker case than other cases, for instance LabelClasses, because this is a list of objects and not an dict. The 'proper' way to do what I'm talking about is to generate a collection class that would define it's own sequence methods, that would translate between the Item properties and Publication objects as items were retrieved, appended etc.

However, this adds a level of complexity such that the "worth it" factor is starting to tip in the direction of "no" for me. The way you have implemented works great, and the situation I mention may come up and confuse users but perhaps won't be too much of head-scratcher to work around.

That's to say, let's go forward with this solution and maybe re-evaluate if someone complains. Thanks for bearing with me on this!

lossyrob requested changes Oct 16, 2020

View reviewed changes

pystac/extensions/scientific.py Outdated Show resolved Hide resolved

pystac/extensions/scientific.py Show resolved Hide resolved

schwehr added 2 commits October 16, 2020 09:37

Rebase after version extension committed

8582ea1

__init__.py: Add missing import of pystac.extensions.scientific

43ca90a

schwehr force-pushed the scientific-extension branch from 27fc83a to 43ca90a Compare October 16, 2020 16:40

schwehr added 2 commits October 16, 2020 14:12

science: Switch to getter returning a list of Publication instances

814c5f5

Change suggested by @lossyrob For testing, it was easier to add __eq__ to Publication.

scientific.py: pub -> p for flake8 line length check

f5e5364

schwehr mentioned this pull request Oct 20, 2020

doi registrant code is too restrictive in the schema radiantearth/stac-spec#910

Closed

schwehr added 2 commits October 21, 2020 13:19

scientific ext: Add Doi class, remove_publication, and type hints

9fdba7c

Checked type hints with pytype, but mypy does not agree.

scientific.py: link -> a_link in remove_link()

205ee5d

Fixes flake8 error: pystac/extensions/scientific.py:110:12: F402 import 'link' from line 19 shadowed by loop variable

lossyrob requested changes Oct 22, 2020

View reviewed changes

schwehr mentioned this pull request Oct 23, 2020

version extension link setters do not check for existing links #210

Closed

lossyrob reviewed Oct 30, 2020

View reviewed changes

pystac/extensions/scientific.py Outdated Show resolved Hide resolved

schwehr added 4 commits November 3, 2020 10:57

scientific.py: Remove class Doi.

9c1e23a

It's just a wrapper around str that isn't critical.

scientific.py: Add docs and cleanup.

27b4e06

- Use pystac.Thing rather than from pystac import thing - Drop the regex checks of DOI - Remove TODO about links that was in the wrong place

Merge develop

a329e5b

scientific.py: Remove TypeVar type hints.

e9afcab

schwehr mentioned this pull request Nov 4, 2020

scientific extension: rel of cite-as links for sci:doi and for pub dois radiantearth/stac-spec#915

Closed

schwehr and others added 3 commits November 8, 2020 14:27

scientific: Clear an existing doi link is a new one is set

af38dc1

Merge branch 'develop' into scientific-extension

6688d07

Fix formatting issue from previous merge

9d72587

lossyrob approved these changes Nov 19, 2020

View reviewed changes

lossyrob merged commit 4116383 into stac-utils:develop Nov 19, 2020

schwehr deleted the scientific-extension branch November 20, 2020 19:54

lossyrob mentioned this pull request Jan 22, 2021

Support the scientific extension #194

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for the scientific extension. #199

Add support for the scientific extension. #199

schwehr commented Oct 14, 2020 •

edited

Loading

lossyrob left a comment

codecov-io commented Oct 16, 2020 •

edited

Loading

lossyrob Oct 22, 2020

schwehr Oct 23, 2020

schwehr Oct 23, 2020

lossyrob Oct 27, 2020

lossyrob Oct 30, 2020

schwehr Nov 9, 2020

lossyrob Nov 19, 2020

lossyrob Oct 30, 2020

schwehr Nov 3, 2020

schwehr Nov 3, 2020

lossyrob Nov 3, 2020

schwehr Nov 4, 2020

schwehr Nov 10, 2020

lossyrob Nov 19, 2020

Add support for the scientific extension. #199

Add support for the scientific extension. #199

Conversation

schwehr commented Oct 14, 2020 • edited Loading

lossyrob left a comment

Choose a reason for hiding this comment

codecov-io commented Oct 16, 2020 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

schwehr commented Oct 14, 2020 •

edited

Loading

codecov-io commented Oct 16, 2020 •

edited

Loading