Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define Geospatial Accepted Formats for DCAT-US #5010

Open
1 task
jbrown-xentity opened this issue Dec 9, 2024 · 10 comments
Open
1 task

Define Geospatial Accepted Formats for DCAT-US #5010

jbrown-xentity opened this issue Dec 9, 2024 · 10 comments
Assignees

Comments

@jbrown-xentity
Copy link
Contributor

User Story

In order to support data providers and questions around DCAT-US accepted spatial field values, data.gov admins want a detailed list and test/example use cases for what should be valid spatial values for DCAT-US.

Acceptance Criteria

[ACs should be clearly demoable/verifiable whenever possible. Try specifying them using BDD.]

  • GIVEN a DCAT-US dataset object with a spatial field filled out
    WHEN that field is examined
    THEN it is clear whether that format is supported or not.

Background

Current examples: https://github.com/GSA/ckanext-datajson/tree/main/ckanext/datajson/tests/datajson-samples

Security Considerations (required)

None

Sketch

We need to fully define the list of acceptable sources. The current logic of support is here: https://github.com/GSA/ckanext-geodatagov/blob/main/ckanext/geodatagov/logic.py#L445-L515
Need to start the list of use cases, and decide if the envelope use case (see here) is an acceptable format and should be included.
Make sure every test case is defined.

@rshewitt rshewitt moved this to 🏗 In Progress [8] in data.gov team board Dec 12, 2024
@rshewitt rshewitt self-assigned this Dec 12, 2024
@rshewitt
Copy link
Contributor

the schema error occurs here in the datajson extension

@rshewitt
Copy link
Contributor

rshewitt commented Dec 12, 2024

What does DCATUS define as valid "spatial" values? Of those, what do we support? spec

the "spatial" field is optional in DCATUS 1.1. if it exists it must be a string with at least 1 character.

  • a bounding coordinate box for the dataset represented in latitude / longitude pairs where the coordinates are specified in decimal degrees and in the order of: minimum longitude, minimum latitude, maximum longitude, maximum latitude
    • 1.0,2.0,3.5,5.5
  • a latitude / longitude pair (in decimal degrees) representing a point where the dataset is relevant
    • we support this as geojson.
  • a geographic feature expressed in Geography Markup Language using the Simple Features Profile
  • a geographic feature from the GeoNames database
    • United States
    • California

Other cases

  • If the input can be JSON deserialized and it's a list of 2 points ( e.g. [[3,4],[5,6]]) otherwise return the string as-is. for example,
two_points = "[[3,4],[5,6]]"
translate_spatial(two_points) # returns => '{"type": "Polygon", "coordinates": [[[3, 4], [3, 6], [5, 6], [5, 4], [3, 4]]]}'

geojson = '{"type":"Polygon","coordinates":[[[-124.3926,32.5358],[-124.3926,42.0022],[-114.1252,42.0022],[-114.1252,32.5358],[-124.3926,32.5358]]]}'
translate_spatial(geojson) # returns => same as input

just because the input can be JSON deserialized doesn't mean it's compatible with solr

we could check if the input is valid geojson instead of letting solr complain when something is incompatible (assuming this happens but the point being some downstream process complains)

import geojson

data = '{"type":"Polygon","coordinates":[[[-124.3926,32.5358],[-124.3926,42.0022],[-114.1252,42.0022],[-114.1252,32.5358],[-124.3926,32.5358]]]}'

geojson.loads(data) # => doesn't throw an exception which means it's valid

Conclusion

  • we support the following string formats
    • "minX, minY, maxX, maxY"
    • "[ [ minX, minY ], [ maxX, maxY ] ]"
    • a geonames value
    • anything that can be JSON deserialized

@tdlowden
Copy link
Member

I don't know how to interpret what

means. docs in GML mention both simplePolygon and gridEnvelope, but is envelope not valid bc it's not... simple?

@tdlowden
Copy link
Member

tdlowden commented Dec 12, 2024

simple features profile: https://portal.ogc.org/files/?artifact_id=39853

image

@rshewitt
Copy link
Contributor

rshewitt commented Dec 12, 2024

"spatial" is optional in all dcatus schemas but if present needs to a string with at least 1 character in it. if the spatial data is an object like the 3rd example in this source ( control+f "spatial" and navigate to it ) then validation will fail. dcatus specifies a JSON object as an acceptable value in some circumstances which is different from a string. basically, the root of a common problem we see ( e.g. "ERROR #2: 'spatial':{'coordinates': [[-78.9823, 35.5216], [-78.2607, 36.0742]], 'type': 'envelope'} is not valid under any of the given schemas" ) isn't deep. they're not providing the correct data type.

@tdlowden
Copy link
Member

Understood. The issue here is the source you cited is from arcGIS and specifically is available to export saying it DOES abide by the DCAT-US format

image

So regarding this envelope type.... do we need to ask ESRI to adapt?

@rshewitt
Copy link
Contributor

as long as we're using solr for search we have to conform to what it supports. "spatial" data expressed as geojson (e.g. {"type": "Polygon", "coordinates": [[[10.0, 0.0], [10.0, 5.0], [15.0, 5.0], [15.0, 0.0], [10.0, 0.0]]]}) must be one of these types

  • "Point"
  • "MultiPoint"
  • "LineString"
  • "MultiLineString"
  • "Polygon"
  • "MultiPolygon"
  • "GeometryCollection"

We can add support for translating a geojson "envelope" in [ southwestPnt, northeastPnt ] format into a polygon compatible with solr but until that happens the data provider needs to update the value to something we support so they would have to convert

# from this
{'coordinates': [[-78.9823, 35.5216], [-78.2607, 36.0742]], 'type': 'envelope'}

# into this
"""{
    "type": "Polygon",
    "coordinates": [
      [[-78.9823, 35.5216], [-78.2607, 35.5216], [-78.2607, 36.0742], [-78.9823, 36.0742], [-78.9823, 35.5216]]
    ]
}"""

@rshewitt
Copy link
Contributor

rshewitt commented Dec 16, 2024

facet query of "old-spatial" on catalog. interestingly, there's 247 instances of this xml value ( simple features profile? )

<?xml version=\"1.0\" encoding=\"UTF-8\"?>
  <gml:Polygon xmlns:gml=\"http://www.opengis.net/gml/3.2\" srsName=\"EPSG:9825\">
    <gml:outerBoundaryIs>
      <gml:LinearRing>
        <gml:posList>-90.0 -180.0 -90.0 180.0 90.0 180.0 90.0 -180.0 -90.0 -180.0</gml:posList>
      </gml:LinearRing>
    </gml:outerBoundaryIs>
  <gml:innerBoundaryIs>
  </gml:innerBoundaryIs>
</gml:Polygon>

looks like we have 971 datasets with an old-spatial value containing the word xml

i validated this xml against a simple feature profile level-2 schematron

@rshewitt
Copy link
Contributor

rshewitt commented Dec 16, 2024

  • Are there any gaps in what is defined by the spec and what we support? And if so, do we have any use cases for supporting them?

    • yes there are gaps. we don't support a single point as a comma delimited string ( e.g. -1.1, 2.2 ) but there doesn't appear to be any on catalog and gml simple feature profiles which we only have around 1k in that format on catalog ( see comment above for query )
  • Are there any formats that we support that aren't in the documentation? If so, what and why?

    • we support anything that is JSON deserializable which can include geojson
  • Provide clear, reusable examples of each type we support, preferably from real-world examples

    • real-world examples are provided throughout this ticket.
  • Do we need to write tests for these use cases (see current tests here)?

    • overall we're in solid shape. something that could be beneficial is validating whether the deserialized string is valid geojson.
  • Should we update the DCAT-US documentation or examples? Are there potential gaps/problems in the code that should be addressed (like your note about trying to load the final import with geojson)?

    • updating the documentation with more examples would be good.
    • add support for geojson validation
  • Should we allow Envelope terminology (picking up this ticket)?

    • "envelope" isn't valid geojson but we can support it if we want.
  • Should we fix the bug with geospatial bounds?

@jbrown-xentity
Copy link
Contributor Author

Quick summary:

  • ckanext-spatial supports geojson formats for spatial (and uses leaflet for displaying), which is a widely used and widely known open source geospatial data format.
  • We currently translate various formats into geojson for spatial search usage.
  • There are not any clear definitions, whether in DCAT-US, DCAT, or DCAT's various reference sources for what should be in spatial.
  • We currently require for validation that the spatial field be a string with at least one character. Some providers (see recent "Envelope" discussion) are providing a JSON structured object.

Recommendations:

  • Promote and allow for raw geojson input for spatial field (either as JSON, or as string)
  • Continue support for "custom" minx,miny... formats, but formalize what is allowed and what is not (spaces between commas? etc)
  • Evaluate and/or fix data objects that go around/over the anti-meridian
  • Have better/clear warnings when data is not spatially indexed properly, and don't fail dataset ingestion on load due to spatial transform failure. Spatial documentation is so unclear that datasets should not be rejected due to "bad" spatial value.
  • Don't allow for Envelope field, but allow to pass through validation and be imported (force usage of better Geospatial representations for spatial search, but allow to be harvested).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 🏗 In Progress [8]
Development

No branches or pull requests

3 participants