Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicated "items" link when trying to update an existing collection #505

Closed
remicres opened this issue Dec 1, 2022 · 3 comments
Closed

Comments

@remicres
Copy link
Contributor

remicres commented Dec 1, 2022

Hi,

I am trying to update a remote collection using pystac, pystac_client and requests.
The collection is read and updated using the STAC FastAPI.

  • First I use pystac_client to grab the collection
  • Then I update the collection with pystac (I just add or replace a single item)
  • After that, I update the modified collection using requests

The problem is that the "items" link of the collection is duplicated every time the collection is updated.
After N updates, I got N "items" links in the collection links!

I don't know if its a bug, a limitation, of a wrong usage of pystac with pystac_client.

I have added below a minimal example to reproduce the thing.

Code snippet to reproduce the error

import datetime
import pystac
from pystac_client import Client, exceptions
import requests
from urllib.parse import urljoin


def post_or_put(url: str, data: dict):
    """Post or put data to url."""
    r = requests.post(url, json=data)
    if r.status_code == 409:
        new_url = url if data["type"] == "Collection" else url + f"/{data['id']}"
        # Exists, so update
        r = requests.put(new_url, json=data)
        # Unchanged may throw a 404
        if not r.status_code == 404:
            r.raise_for_status()
    else:
        r.raise_for_status()

# New stac item
new_item = pystac.Item(
    id="my_item",
    bbox=[0.28, 43.20, 1.03, 43.76],
    geometry={'type': 'Polygon', 'coordinates': [[[0.28, 43.74], [1.01, 43.76], [1.03, 43.21], [0.30, 43.20], [0.28, 43.74]]]},
    datetime=datetime.datetime(year=2022, month=1, day=1),
    properties={'platform': 'something', 'instruments': ['something'], 'datetime': '2022-01-01T00:00:00Z'}
)
new_item.validate()

collection_id = "my_collection"
stacapi_url = "http://some-stac-fastapi.org"
api = Client.open(stacapi_url)

try:
    existing_collection = api.get_collection(collection_id)
except exceptions.APIError as e:
    existing_collection = None

if not existing_collection:
    print("Collection does not exist")
    spat_extent = pystac.SpatialExtent(bboxes=[new_item.bbox])
    temp_extent = pystac.TemporalExtent(intervals=[(new_item.datetime, new_item.datetime)])
    extent = pystac.Extent(spat_extent, temp_extent)
    collection = pystac.Collection(id=collection_id,
                                   description="some description",
                                   extent=extent,
                                   title="my collection",
                                   providers=[pystac.Provider("Some provider")])
else:
    print("Collection already exist")
    collection = existing_collection

collection.add_item(new_item)
collection.normalize_hrefs(stacapi_url)
collection.make_all_asset_hrefs_relative()
collection.validate()
post_or_put(urljoin(stacapi_url, "/collections"), collection.to_dict())
for link in collection.links:
    if link.rel == "item":
        post_or_put(urljoin(stacapi_url, f"collections/{collection_id}/items"), new_item.to_dict())

First run

Output:

Collection does not exist

Resulting collection:

{
  "id": "my_collection",
  "type": "Collection",
  "links": [
    {
      "rel": "items",
      "type": "application/geo+json",
      "href": "http://some-stac-fastapi.org/collections/my_collection/items"
    },
    {
      "rel": "parent",
      "type": "application/json",
      "href": "http://some-stac-fastapi.org/"
    },
    {
      "rel": "root",
      "type": "application/json",
      "href": "http://some-stac-fastapi.org/"
    },
    {
      "rel": "self",
      "type": "application/json",
      "href": "http://some-stac-fastapi.org/collections/my_collection"
    }
  ],
  "title": "my collection",
  "extent": {
    "spatial": {
      "bbox": [
        [
          0.28,
          43.2,
          1.03,
          43.76
        ]
      ]
    },
    "temporal": {
      "interval": [
        [
          "2022-01-01T00:00:00Z",
          "2022-01-01T00:00:00Z"
        ]
      ]
    }
  },
  "license": "proprietary",
  "providers": [
    {
      "name": "Some provider"
    }
  ],
  "description": "some description",
  "stac_version": "1.0.0",
  "stac_extensions": []
}

Nothing really exciting here. The collection is created.

Second run

Output:

Collection already exist

Resulting collection:

{
  "id": "my_collection",
  "type": "Collection",
  "links": [
    {
      "rel": "items",
      "type": "application/geo+json",
      "href": "http://some-stac-fastapi.org/collections/my_collection/items"
    },
    {
      "rel": "parent",
      "type": "application/json",
      "href": "http://some-stac-fastapi.org/"
    },
    {
      "rel": "root",
      "type": "application/json",
      "href": "http://some-stac-fastapi.org/"
    },
    {
      "rel": "self",
      "type": "application/json",
      "href": "http://some-stac-fastapi.org/collections/my_collection"
    },
    {
      "rel": "items",
      "href": "http://some-stac-fastapi.org/collections/my_collection/items",
      "type": "application/geo+json"
    }
  ],
  "title": "my collection",
  "extent": {
    "spatial": {
      "bbox": [
        [
          0.28,
          43.2,
          1.03,
          43.76
        ]
      ]
    },
    "temporal": {
      "interval": [
        [
          "2022-01-01T00:00:00Z",
          "2022-01-01T00:00:00Z"
        ]
      ]
    }
  },
  "license": "proprietary",
  "providers": [
    {
      "name": "Some provider"
    }
  ],
  "description": "some description",
  "stac_version": "1.0.0",
  "stac_extensions": []
}

Here, you can notice that "items " is duplicated!

Third run

Collection already exist

Resulting collection:

{
  "id": "my_collection",
  "type": "Collection",
  "links": [
    {
      "rel": "items",
      "type": "application/geo+json",
      "href": "http://some-stac-fastapi.org/collections/my_collection/items"
    },
    {
      "rel": "parent",
      "type": "application/json",
      "href": "http://some-stac-fastapi.org/"
    },
    {
      "rel": "root",
      "type": "application/json",
      "href": "http://some-stac-fastapi.org/"
    },
    {
      "rel": "self",
      "type": "application/json",
      "href": "http://some-stac-fastapi.org/collections/my_collection"
    },
    {
      "rel": "items",
      "href": "http://some-stac-fastapi.org/collections/my_collection/items",
      "type": "application/geo+json"
    },
    {
      "rel": "items",
      "href": "http://some-stac-fastapi.org/collections/my_collection/items",
      "type": "application/geo+json"
    }
  ],
  "title": "my collection",
  "extent": {
    "spatial": {
      "bbox": [
        [
          0.28,
          43.2,
          1.03,
          43.76
        ]
      ]
    },
    "temporal": {
      "interval": [
        [
          "2022-01-01T00:00:00Z",
          "2022-01-01T00:00:00Z"
        ]
      ]
    }
  },
  "license": "proprietary",
  "providers": [
    {
      "name": "Some provider"
    }
  ],
  "description": "some description",
  "stac_version": "1.0.0",
  "stac_extensions": []
}

Here "items" has been one more time duplicated. There is now 3 "items" entry.

Is this behavior nominal?
If yes, how should I avoid the duplicated links?

Thanks

Rémi

@gadomski
Copy link
Member

gadomski commented Dec 1, 2022

The PUT/POST /collections endpoint in stac-fastapi doesn't do anything magical with the links array on your Collection, and neither does the pgstac backend (I haven't checked sqlalchemy but I assume it's the same). So, when you update your collection that second time, the links array already includes a rel=items link. On the way back out, another rel=items link is added.

I think it'd be reasonable for either stac-fastapi or the backend to strip off any existing rel=items links on the way in. I'm moving this issue to stac-fastapi and asking @bitner for his opinion.

For now, I would suggest removing the rel=items link before upserting your collection, e.g.

collection.clear_links("items")

@gadomski gadomski transferred this issue from stac-utils/pystac Dec 1, 2022
@geospatial-jeff
Copy link
Collaborator

geospatial-jeff commented Dec 1, 2022

@remicres If you'd like to update an item in a collection you do not need to update the collection. Instead you may use the transactions extension (PUT /collections/{collection_id}/items) to update the item (no need to update the collection). I think this will fix the underlying issue of the items link being duplicated.

This being said you and @gadomski have touched on an interesting subject which is how stac-fastapi is currently dealing with inferred links. Inferred links are any links that may be inferred from information in the request, these are automatically generated by the API to ensure that links resolve properly. Inferred links generated by the API should always take precedence over those passed by a user, and IMO should not be persisted in the database. The types packages contains a helper function filter_links which is designed to filter out inferred links to prevent duplication in API responses.

INFERRED_LINK_RELS = ["self", "item", "parent", "collection", "root"]

def filter_links(links: List[Dict]) -> List[Dict]:
    """Remove inferred links."""
    return [link for link in links if link["rel"] not in INFERRED_LINK_RELS]

This function is used by the sqlalchemy backend but I don't believe it is called in pgstac. We could also be more aggressive with which links are included in INFERRED_LINK_RELS (as you can see it doesn't include items) or at least provide a way for users to specify what links are inferred. A lot of this ties into #472.

TLDR:
There is existing code to prevent duplication of these inferred links but it isn't used consistently and could probably be improved.

@remicres
Copy link
Contributor Author

remicres commented Dec 2, 2022

Hi @geospatial-jeff and @gadomski ,
I ended doing the PUT /collections/{collection_id}/items (when the collection already exist).
When the collection does not exist, I just have to create it with POST /collections.
More straightforward than my precedent approach 🥲
Thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants