-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Summary statistics [RSS] #84
Comments
DQV should be able to meet these requirements. |
Do we also need statistics on distributions? This requirement is suggested by the comment submitted by Daniel Pop [1]. Of course it also depends on how we resolve the matter of 'information equivalence' of different distributions. [1] https://lists.w3.org/Archives/Public/public-dxwg-comments/2019Jan/0013.html |
I think we shouldn't prevent this - as done for other information. The question is how to put this option in the spec. I'm a bit reluctant to explicitly add properties in class definitions where we don't have real-world use cases and/or implementation evidence. So, this could be included in the "guidance" part of the spec. About "information equivalence", (again) -1 to it. This ends up to be a matter of the "granularity" of the notion of dataset, which is mainly a data provider choice (possibly also based on the requirements of the intended users). |
@dr-shorthair I am not quite sure how you derive a requirement for statistics on datasets? If there is a need for it, maybe we could refer to DQV or Data Cube? |
The statistic that Daniel mentioned is the frequency or spacing of members in a time series, where various distributions might have fixed spacing that is different (usually coarser) than what is available from the underlying dataset. I was on the point of creating an explicit issue for this aspect alone, but since this is an aspect of dataset statistics I thought it would be best to open the discussion here first. |
@makxdekkers I did not derive a new requirement for dataset statistics - this was one of the original requirements taken from UCR. However, I do wonder if time-series are such a common case that they might deserve special treatment. i.e. complement |
If I understand correctly, the concept @dr-shorthair is looking for is named temporalResolution in ISO19115-1, and is important for evaluating datasets that have temporal coverage. There is a corresponding spatialResolution property that is equally important if you're evaluating spatial data. |
@smrgeoinfo yes - I think we need to pair
And, while I'm a little wary of treading too far down a path that should be managed through a geospatial profile, since we already have
(and stop there). |
For spatial / temporal resolution, see UC15, which describes the general context and provides the relevant references. These topics were discussed by the SDW WG, and then with the DWBP WG (in particular, with @aisaac and @riccardoAlbertoni ), which led to a proposal on how to specify it by using DQV. The proposal is included as an example (focussing on spatial resolution only) in DQV, §6.13 (Express dataset precision and accuracy), which was in turn re-used into SDW's Best Practice 14 (Describe the positional accuracy of spatial data). We should therefore re-use and consolidate that approach. About consolidation, I summarised what I see as issues to be addressed in the context of the possible revisions to GeoDCAT-AP ( see SEMICeu/GeoDCAT-AP#3). For our convenience, I copy-paste below the relevant text from SEMICeu/GeoDCAT-AP#3:
|
I agree that DQV is competent to satisfy the requirement, as shown in the examples. For example, the examples and the summary above present multiple kinds of 'spatial resolution', which may be important for sophisticated users. Access to a single summary statistic for each would help a lot in the initial discovery phase. My proposition (above) is that for DCAT to work better for a large number of datasets, two statistics might be worth 'promoting' to be first-class properties for datasets, i.e. corresponding to:
|
@dr-shorthair It would indeed be good if there was a simple way to expose resolutions. There is in any case a need to express both value and unit, so for spatial resolution the range would be (something like) |
@dr-shorthair , I also agree that we need to address first the simplest use cases - and actually the reasoning in SEMICeu/GeoDCAT-AP#3 was along those lines (the first example was about the two typical ways of expressing spatial resolution: distance and equivalent scale). As @makxdekkers says, I see more an issue on the fact that we need to express value and unit of measurement, and however we do it, it is unlikely we end up with something simpler than the DQV approach, unless we inflate all these semantics in the one single term, and we allow the use of just 1 unit of measurement. E.g., by using properties like:
or
(or something along those lines). |
One issue with dqv is that in some engineering situations, resolution and precision are different. Is there a problem with using schema:Distance as the value for SpatialResolutionAsDistance, and schema:Duration for TemporalResolutionAsDuration? |
@smrgeoinfo wrote:
Yes, the wording of the relevant section in DQV does not make this distinction, but the formal definition of the resolution in the examples does not bind the notion of resolution with the one of precision.
Maybe Besides this, IMO, re-using Schema.org properties may lead to the issues mentioned in #85 (comment) (in that case in relation to |
@andrea-perego yes this is a bit of a perma-issue. There are too many representations of 'measure' or 'quantity' already, but none have achieved universal acceptance. Furthermore, most come with a lot of baggage (or at least are just one tiny part of some huge vocabulary, the rest of which we have little interest in in this context. That is the problem with your original DQV proposal: it makes the simple case hard. So, taking a leaf out of Randall Munroe's book, I suggest crashing through and specifying this as the range of both
Which would mean that an instance would look like
|
@dr-shorthair While I do like the approach to provide a 'simple' solution for 'simple' cases, I do feel a bit uneasy to replicate something that is already there, i.e. the more 'fundamental' solution in DQV. If we promote this 'simple' solution, 'simple' cases -- using the DCAT-specific solution -- are not going to be interoperable with more 'complex' cases using a DQV-based solution. One could argue that by promoting a DCAT-specific approach, we are discouraging people to use a DQV-based approach and thus only cater for 'simple' cases to be handled by DCAT. |
Yeah. On the one hand, I'm usually one of the first to advocate strongly for re-use of existing solutions, particularly if they are from the W3C stable and have clearly been designed to integrate. Of course, all of these spatial and temporal properties (including the classic DCT ones) have non-simple values, so just the complexity re-appears a layer down anyway. However, I think the mappings to DQV can almost certainly be formally expressed using OWL Restrictions and property-chain-axioms (e.g. see mappings from DCT to PROV here: https://github.com/w3c/dxwg/blob/gh-pages/dcat/rdf/dcat-prov.ttl#L63 ) so I'm not sure the interoperability argument made by @makxdekkers is strictly true. |
@dr-shorthair , working towards a simple solution: Following up from @makxdekkers 's and @smrgeoinfo 's comment on Re-using your example, this would be something like: <> a dcat:Dataset ;
...
dcat:temporalResolution "PT15M"^^xsd:duration ;
...
. Unfortunately, the same cannot be done for spatial resolution. |
Good point. Temporal resolution was the thing that triggered this discussion, and it is more mainstream - one dimension is so much easier than two or three. Spatial resolution (as distance) is still relatively simple conceptually but does need an explicit UOM. If only XSD had a 'measure' type (and every other programming language for that matter ... computer-science fail IMHO) |
@dr-shorthair wrote:
I am not very convinced about the need to mint a new property for dcat:unitOfMeasure.
I see pros and cons in having both approaches : DQV/RDF DATA CUBE style and the DCAT properties. |
Mind you, |
(shame there isn't an ISO standard for 'Length' complementing what ISO 8601 did for 'Time') |
See revised proposal for |
+1 from me. |
I wonder whether we could consider adding properties for spatial resolutions not expressed as distance, namely, as equivalent scale - which is the other one most common way for specifying spatial resolution. |
I'm reluctant to provide a second option at this level. As soon as you have more than one alternative, you begin to lose interoperability. I understand that '1:50,000' etc is the cartographic tradition, and geographers routinely infer resolution from this ('what is the distance on the ground of the thickness of a pencil line on the map?'). But a length measure is more direct and less ambiguous, and also applies to gridded data. While more detail and options can be given using the DQV structures shown above, I really think we should add only one option in the DCAT namespace. |
@riccardoAlbertoni Your contributions in Chapter 8 show some patterns for use of DQV for quality information. Are you aware of a 'standard' way to provide basic dataset statistics using DQV or any other RDF vocabulary? e.g. minimum/maximum(/average) values for specified dimensions? I'm not seeing anything obvious in DQV or QB :-( I guess it might be a dqv:Metric but I wonder if you could provide guidance on how this might look? |
I was looking for the same thing and the relevant bit that I found is this DQV section on statistics that relies on an extension of VoID and thus too oriented to RDF datasets. |
I am not aware of anything except the examples mentioned by @agbeltran for the statistics oriented to RDF datasets, perhaps @makxdekkers knows more ?!?. Anyway, I guess there is more than one way to do it. For example, using RDF data cube you can define your own qb:DataStructureDefinition. if you want to describe statistic of datasets such as Average, Max, Min for the "fields" in the dataset, you might define a qb:DataStructureDefinition whose dimensions/components include
If you provide statistics as quality indicators you can think of using DQV qualityMeasurement, for example defining a new dqv:dimensioni for each pair of field and operator. |
@dr-shorthair wrote:
I would also prefer to have one solution that fits all use cases, but we should also recognise that this two ways of expressing spatial resolution (i.e., distance and equivalent scale) are not comparable or convertable. So, IMO, the use of two different properties is more than acceptable. BTW, my request is based also on an explicit requirement from GeoDCAT-AP - which is defining mappings from ISO 19115:2003, where spatial resolution is expressed either as distance or equivalent scale. |
Re-thinking about this, probably we should consider the option of specifying spatial resolution in 2 steps (which was one of the options discussed earlier): a:Dataset a dcat:Dataset ;
dcat:spatialResolution [
dcat:distanceInMeters "15"^^xsd:decimal .
] . One of the advantages is that it would be easier for people to reuse the main pattern |
@andrea-perego - do you see this issue as critical or can this be moved to the backlog? |
Partially critical (for the reasons I explained) but it can be moved to the backlog, provided that it will be possible to come back to this after DCAT v1.1 is out and possibly address it in the v1.2 release. |
I created a new issue to work on the discussion points still open: Closing this one. |
Summary statistics [RSS]
Express summary statistics and descriptive metrics to characterize a Dataset.
Related use cases: Summarization/Characterization of datasets [ID33]
The text was updated successfully, but these errors were encountered: