Summary statistics [RSS] #84

jpullmann · 2018-01-18T21:13:17Z

Summary statistics [RSS]

Express summary statistics and descriptive metrics to characterize a Dataset.

Related use cases: Summarization/Characterization of datasets [ID33]

makxdekkers · 2018-02-07T12:14:12Z

DQV should be able to meet these requirements.

dr-shorthair · 2019-02-03T06:29:23Z

Do we also need statistics on distributions? This requirement is suggested by the comment submitted by Daniel Pop [1]. Of course it also depends on how we resolve the matter of 'information equivalence' of different distributions.

[1] https://lists.w3.org/Archives/Public/public-dxwg-comments/2019Jan/0013.html

andrea-perego · 2019-02-03T16:07:36Z

I think we shouldn't prevent this - as done for other information. The question is how to put this option in the spec.

I'm a bit reluctant to explicitly add properties in class definitions where we don't have real-world use cases and/or implementation evidence. So, this could be included in the "guidance" part of the spec.

About "information equivalence", (again) -1 to it. This ends up to be a matter of the "granularity" of the notion of dataset, which is mainly a data provider choice (possibly also based on the requirements of the intended users).

makxdekkers · 2019-02-04T11:05:32Z

@dr-shorthair I am not quite sure how you derive a requirement for statistics on datasets? If there is a need for it, maybe we could refer to DQV or Data Cube?
In my mind, Daniel's point (1) could be resolved by modelling the real-time data stream as a dcat:DataService and modelling the CSVs as separate datasets.

dr-shorthair · 2019-02-04T11:46:58Z

The statistic that Daniel mentioned is the frequency or spacing of members in a time series, where various distributions might have fixed spacing that is different (usually coarser) than what is available from the underlying dataset. I was on the point of creating an explicit issue for this aspect alone, but since this is an aspect of dataset statistics I thought it would be best to open the discussion here first.

dr-shorthair · 2019-02-15T00:16:49Z

@makxdekkers I did not derive a new requirement for dataset statistics - this was one of the original requirements taken from UCR.

However, I do wonder if time-series are such a common case that they might deserve special treatment. i.e. complement dct:temporal (coverage) with one more number - the item-accrual-periodicity. And since dct:accrualPeriodicity has been hijacked (in the DCAT context) to describe the publication period, it might have to be a new property? See #728

smrgeoinfo · 2019-02-16T15:54:49Z

If I understand correctly, the concept @dr-shorthair is looking for is named temporalResolution in ISO19115-1, and is important for evaluating datasets that have temporal coverage. There is a corresponding spatialResolution property that is equally important if you're evaluating spatial data.

dr-shorthair · 2019-02-17T02:26:44Z

@smrgeoinfo yes - I think we need to pair

dct:temporal - Temporal Coverage - i.e. the temporal extent of the dataset, the time interval that this dataset describes
with
dcat:temporalResolution - smallest time period resolvable in the data; e.g. temporal spacing of a regular time series

And, while I'm a little wary of treading too far down a path that should be managed through a geospatial profile, since we already have

dct:spatial - Geospatial coverage - i.e. the spatial extent of the dataset
it is not much of a stretch to match this with
dcat:spatialResolution - smallest distance separating items in the data

(and stop there).

andrea-perego · 2019-02-17T14:52:28Z

For spatial / temporal resolution, see UC15, which describes the general context and provides the relevant references.

These topics were discussed by the SDW WG, and then with the DWBP WG (in particular, with @aisaac and @riccardoAlbertoni ), which led to a proposal on how to specify it by using DQV.

The proposal is included as an example (focussing on spatial resolution only) in DQV, §6.13 (Express dataset precision and accuracy), which was in turn re-used into SDW's Best Practice 14 (Describe the positional accuracy of spatial data).

We should therefore re-use and consolidate that approach.

About consolidation, I summarised what I see as issues to be addressed in the context of the possible revisions to GeoDCAT-AP ( see SEMICeu/GeoDCAT-AP#3).

For our convenience, I copy-paste below the relevant text from SEMICeu/GeoDCAT-AP#3:

Basically, DQV models this information as observations / measurements of a given quality metric (which corresponds to a given type of resolution).

[...]

[Adopting] This [solution] would however require the definition of two groups of individuals:

Those corresponding to the different types of resolution (denoting a quality metric).

Those corresponding to each of the different levels of resolution (denoting the measurement of a specific quality metric).

As far as the first group is concerned (i.e., the different types of resolution), these individuals can be defined in DQV as follows:
:SpatialResolutionAsEquivalentScale a dqv:Metric;
  skos:definition "Spatial resolution of a dataset expressed as equivalent scale,
	  by using a representative fraction (e.g., 1:1,000, 1:1,000,000)."@en ;
  dqv:expectedDataType xsd:decimal ;
  dqv:inDimension dqv:precision .
    
:SpatialResolutionAsDistance a dqv:Metric;
  skos:definition "Spatial resolution of a dataset expressed as distance"@en ;
  dqv:expectedDataType xsd:decimal ;
  dqv:inDimension dqv:precision .
This initial list can be further extended. E.g.:
:SpatialResolutionAsHorizontalGroundDistance a dqv:Metric;
  skos:definition "Spatial resolution of a dataset expressed as horizontal ground distance"@en ;
  dqv:expectedDataType xsd:decimal ;
  dqv:inDimension dqv:precision .
    
:SpatialResolutionAsVerticalDistance a dqv:Metric;
  skos:definition "Spatial resolution of a dataset expressed as vertical distance"@en ;
  dqv:expectedDataType xsd:decimal ;
  dqv:inDimension dqv:precision .
    
:SpatialResolutionAsAngularDistance a dqv:Metric;
  skos:definition "Spatial resolution of a dataset expressed as angular distance"@en ;
  dqv:expectedDataType xsd:decimal ;
  dqv:inDimension dqv:precision .    
The question is in which space such individuals should be defined [...].

The definition of individuals in the second group is however more problematic, since the level of resolution and unit of measurement are arbitrary (1:1000, 1:100, 1m, 1km, 100m, 10 decimal degrees, etc.).

Possible options include the following ones:

Define only the individuals corresponding to the types of spatial / temporal resolution, whereas the individuals expressing the actual resolution will be defined at the data level. This solution is not optimal, since it will result in multiple definitions of the same individuals.

Define individuals only for some levels of resolution and units of measurements - e.g., the most common ones. This solution may address the majority of (but not all) the cases.

Set up a URI space supporting arbitrary levels of resolution and units of measurements. This register will dynamically generate the corresponding individuals based on information included in their URI.

An example of the last option, including also a proposal for how these individuals could be defined, is available at:

http://geodcat-ap.semic.eu/id/resolution/

dr-shorthair · 2019-02-18T03:33:09Z

I agree that DQV is competent to satisfy the requirement, as shown in the examples.
However, I'm not sure it is optimal for meeting it in the DCAT context.

For example, the examples and the summary above present multiple kinds of 'spatial resolution', which may be important for sophisticated users.
But pushing the basic case into this structure, and then depending on a subsidiary vocabulary for labels like 'SpatialResolutionAsDistance', adds two additional layers for concepts that are widely relevant and can be easily explained (and also note the dependency on SDMX as well ...).

Access to a single summary statistic for each would help a lot in the initial discovery phase.
Interoperability is almost always helped by limiting the options.

My proposition (above) is that for DCAT to work better for a large number of datasets, two statistics might be worth 'promoting' to be first-class properties for datasets, i.e. corresponding to:

SpatialResolutionAsDistance
TemporalResolutionAsDuration

makxdekkers · 2019-02-18T10:15:38Z

@dr-shorthair It would indeed be good if there was a simple way to expose resolutions. There is in any case a need to express both value and unit, so for spatial resolution the range would be (something like) schema:Distance, and for temporal resolution (something like) schema:Duration.
Unfortunately, DCMI only has a class dct:SizeOrDuration, but not separate classes for Size and Duration. Should we define classes dcat:Distance and dcat:Duration?

andrea-perego · 2019-02-18T21:40:47Z

@dr-shorthair , I also agree that we need to address first the simplest use cases - and actually the reasoning in SEMICeu/GeoDCAT-AP#3 was along those lines (the first example was about the two typical ways of expressing spatial resolution: distance and equivalent scale).

As @makxdekkers says, I see more an issue on the fact that we need to express value and unit of measurement, and however we do it, it is unlikely we end up with something simpler than the DQV approach, unless we inflate all these semantics in the one single term, and we allow the use of just 1 unit of measurement. E.g., by using properties like:

dcat:spatialResolutionAsDistanceInMeters
dcat:temporalResolutionAsDurationInSeconds

or

dcat:resolution / dcat:SpatialResolutionAsDistance / dcat:distanceInMeters
dcat:resolution / dcat:TemporalResolutionAsDuration / dcat:durationInSeconds

(or something along those lines).

smrgeoinfo · 2019-02-18T22:45:34Z

One issue with dqv is that in some engineering situations, resolution and precision are different. Is there a problem with using schema:Distance as the value for SpatialResolutionAsDistance, and schema:Duration for TemporalResolutionAsDuration?

andrea-perego · 2019-02-18T23:53:15Z

@smrgeoinfo wrote:

One issue with dqv is that in some engineering situations, resolution and precision are different.

Yes, the wording of the relevant section in DQV does not make this distinction, but the formal definition of the resolution in the examples does not bind the notion of resolution with the one of precision.

Is there a problem with using schema:Distance as the value for SpatialResolutionAsDistance, and schema:Duration for TemporalResolutionAsDuration?

Maybe schema:Duration can work, as it is using a standard syntax encoding, but schema:Distance uses a literal where value and a code for unit of measurement are separated by a space. Besides the problem of ensuring that codes for units of measurement are used consistently, this value is not machine-actionable. E.g., I won't be able to make a query to get the datasets using a spatial resolution with a distance less than 100 m.

Besides this, IMO, re-using Schema.org properties may lead to the issues mentioned in #85 (comment) (in that case in relation to schema:startDate and schema:endDate).

dr-shorthair · 2019-02-19T01:31:47Z

@andrea-perego yes this is a bit of a perma-issue. There are too many representations of 'measure' or 'quantity' already, but none have achieved universal acceptance. Furthermore, most come with a lot of baggage (or at least are just one tiny part of some huge vocabulary, the rest of which we have little interest in in this context. That is the problem with your original DQV proposal: it makes the simple case hard.

So, taking a leaf out of Randall Munroe's book, I suggest crashing through and specifying this as the range of both dcat:temporalResolution and dcat:spatialResolution:

dcat:Measure a owl:Class . 
dcat:unitOfMeasure a rdf:Property ;
    rdfs:domain dcat:Measure .
dcat:amount a owl:DatatypeProperty ;
    rdfs:domain dcat:Measure ;
    rdfs:range xsd:decimal .

Which would mean that an instance would look like

<> a dcat:Dataset ;
    ...
    dcat:temporalResolution [
        a dcat:Measure ;
        dcat:amount 15.0 ;
        dcat:unitOfMeasure <http://www.w3.org/2006/time#unitMinute> ;
    ] ;
    dcat:spatialResolution [
        a dcat:Measure ;
        dcat:amount 30.0 ;
        dcat:unitOfMeasure <http://qudt.org/vocab/unit/M> ;
    ] ;
    ...
.

makxdekkers · 2019-02-19T08:41:28Z

@dr-shorthair While I do like the approach to provide a 'simple' solution for 'simple' cases, I do feel a bit uneasy to replicate something that is already there, i.e. the more 'fundamental' solution in DQV. If we promote this 'simple' solution, 'simple' cases -- using the DCAT-specific solution -- are not going to be interoperable with more 'complex' cases using a DQV-based solution. One could argue that by promoting a DCAT-specific approach, we are discouraging people to use a DQV-based approach and thus only cater for 'simple' cases to be handled by DCAT.

dr-shorthair · 2019-02-19T11:18:56Z

Yeah. On the one hand, I'm usually one of the first to advocate strongly for re-use of existing solutions, particularly if they are from the W3C stable and have clearly been designed to integrate.
On the other I was somewhat put off by the complexity that is introduced as a further controlled vocabulary is required for the property semantics. I understand why DQV does it that way, to remain scalable and general. But we need to be sure that we want this to be reflected into DCAT. Furthermore, as has been noted before, DQV is not a Rec therefore officially it cannot be cited normatively;-(

Of course, all of these spatial and temporal properties (including the classic DCT ones) have non-simple values, so just the complexity re-appears a layer down anyway.

However, I think the mappings to DQV can almost certainly be formally expressed using OWL Restrictions and property-chain-axioms (e.g. see mappings from DCT to PROV here: https://github.com/w3c/dxwg/blob/gh-pages/dcat/rdf/dcat-prov.ttl#L63 ) so I'm not sure the interoperability argument made by @makxdekkers is strictly true.

andrea-perego · 2019-02-19T22:20:40Z

@dr-shorthair , working towards a simple solution:

Following up from @makxdekkers 's and @smrgeoinfo 's comment on schema:Duration, cannot we make dcat:temporalResolution a datatype property, with range xsd:duration?

Re-using your example, this would be something like:

<> a dcat:Dataset ;
    ...
    dcat:temporalResolution "PT15M"^^xsd:duration ;
    ...
.

Unfortunately, the same cannot be done for spatial resolution.

dr-shorthair · 2019-02-19T23:58:20Z

Good point. Temporal resolution was the thing that triggered this discussion, and it is more mainstream - one dimension is so much easier than two or three.

Spatial resolution (as distance) is still relatively simple conceptually but does need an explicit UOM. If only XSD had a 'measure' type (and every other programming language for that matter ... computer-science fail IMHO)

riccardoAlbertoni · 2019-02-20T16:09:37Z

@dr-shorthair wrote:

...
dcat:spatialResolution [
a dcat:Measure ;
dcat:amount 30.0 ;
dcat:unitOfMeasure <http://qudt.org/vocab/unit/M> ;
] ;

I am not very convinced about the need to mint a new property for dcat:unitOfMeasure.

sdmx-attribute:unitMeasure is widely used, W3C recommendations such as RDF data cube use it, and I am concerned about introducing new patterns when there is one which is more or less well-accepted.

I see pros and cons in having both approaches : DQV/RDF DATA CUBE style and the DCAT properties.
If we go for defining new dcat properties, I guess that we should anyway explicitly refer to SDW best practice which reuses DQV/RDF DATA CUBE for the more general cases.

dr-shorthair · 2019-02-20T20:31:01Z

Mind you, xsd:duration is not an OWL built-in https://www.w3.org/TR/owl2-quick-reference/#Built-in_Datatypes . So I'm thinking perhaps to leave the range open, but recommend use of xsd:duration?

dr-shorthair · 2019-02-22T00:01:44Z

(shame there isn't an ISO standard for 'Length' complementing what ISO 8601 did for 'Time')

dr-shorthair · 2019-02-24T06:14:59Z

See revised proposal for dcat:spatialResolutionM in branch https://github.com/w3c/dxwg/tree/dcat-issue84-sres-simon - simplified with units of measure fixed to metres:

andrea-perego · 2019-02-25T22:32:39Z

+1 from me.

andrea-perego · 2019-02-25T22:55:27Z

I wonder whether we could consider adding properties for spatial resolutions not expressed as distance, namely, as equivalent scale - which is the other one most common way for specifying spatial resolution.

dr-shorthair · 2019-02-25T23:01:46Z

I'm reluctant to provide a second option at this level. As soon as you have more than one alternative, you begin to lose interoperability. I understand that '1:50,000' etc is the cartographic tradition, and geographers routinely infer resolution from this ('what is the distance on the ground of the thickness of a pencil line on the map?'). But a length measure is more direct and less ambiguous, and also applies to gridded data.

While more detail and options can be given using the DQV structures shown above, I really think we should add only one option in the DCAT namespace.

dr-shorthair · 2019-03-03T19:13:27Z

@riccardoAlbertoni Your contributions in Chapter 8 show some patterns for use of DQV for quality information.

Are you aware of a 'standard' way to provide basic dataset statistics using DQV or any other RDF vocabulary? e.g. minimum/maximum(/average) values for specified dimensions? I'm not seeing anything obvious in DQV or QB :-( I guess it might be a dqv:Metric but I wonder if you could provide guidance on how this might look?

agbeltran · 2019-03-03T19:34:08Z

I was looking for the same thing and the relevant bit that I found is this DQV section on statistics that relies on an extension of VoID and thus too oriented to RDF datasets.

riccardoAlbertoni · 2019-03-03T22:32:15Z

Are you aware of a 'standard' way to provide basic dataset statistics using DQV or any other RDF vocabulary? e.g. minimum/maximum(/average) values for specified dimensions? I'm not seeing anything obvious in DQV or QB :-( I guess it might be a dqv:Metric but I wonder if you could provide guidance on how this might look?

I am not aware of anything except the examples mentioned by @agbeltran for the statistics oriented to RDF datasets, perhaps @makxdekkers knows more ?!?.

Anyway, I guess there is more than one way to do it. For example, using RDF data cube you can define your own qb:DataStructureDefinition.

if you want to describe statistic of datasets such as Average, Max, Min for the "fields" in the dataset, you might define a qb:DataStructureDefinition whose dimensions/components include

the considered dataset
the considered field
the considered operator ( i.e. Average, Max, Min.. etc)
the actual measures

If you provide statistics as quality indicators you can think of using DQV qualityMeasurement, for example defining a new dqv:dimensioni for each pair of field and operator.

andrea-perego · 2019-03-04T23:15:40Z

@dr-shorthair wrote:

I'm reluctant to provide a second option at this level. As soon as you have more than one alternative, you begin to lose interoperability. I understand that '1:50,000' etc is the cartographic tradition, and geographers routinely infer resolution from this ('what is the distance on the ground of the thickness of a pencil line on the map?'). But a length measure is more direct and less ambiguous, and also applies to gridded data.

While more detail and options can be given using the DQV structures shown above, I really think we should add only one option in the DCAT namespace.

I would also prefer to have one solution that fits all use cases, but we should also recognise that this two ways of expressing spatial resolution (i.e., distance and equivalent scale) are not comparable or convertable. So, IMO, the use of two different properties is more than acceptable.

BTW, my request is based also on an explicit requirement from GeoDCAT-AP - which is defining mappings from ISO 19115:2003, where spatial resolution is expressed either as distance or equivalent scale.

andrea-perego · 2019-03-07T13:37:19Z

Re-thinking about this, probably we should consider the option of specifying spatial resolution in 2 steps (which was one of the options discussed earlier):

a:Dataset a dcat:Dataset ;
  dcat:spatialResolution [
    dcat:distanceInMeters "15"^^xsd:decimal .
] .

One of the advantages is that it would be easier for people to reuse the main pattern dcat:spatialResolution / "specific property" in case they need to express this information in other ways (e.g., as per ISO 19115-1:2014, which includes also resolution as horizontal ground distance, vertical distance and angular distance).

davebrowning · 2019-03-14T15:47:46Z

@andrea-perego - do you see this issue as critical or can this be moved to the backlog?

andrea-perego · 2019-03-15T20:48:11Z

Partially critical (for the reasons I explained) but it can be moved to the backlog, provided that it will be possible to come back to this after DCAT v1.1 is out and possibly address it in the v1.2 release.

andrea-perego · 2020-10-29T20:16:26Z

I created a new issue to work on the discussion points still open:

#1266

Closing this one.

jpullmann added dcat requirement labels Jan 18, 2018

dr-shorthair added dcat:Dataset dcat:Distribution statistics labels Feb 3, 2019

dr-shorthair mentioned this issue Feb 15, 2019

Clarify definition of dct:accrualPeriodicity in the context of DCAT #728

Closed

This was referenced Feb 18, 2019

Temporal coverage [RTC] #85

Closed

Spatial coverage [RSC] #83

Closed

This was referenced Feb 27, 2019

DCAT - temporal resolution #776

Merged

DCAT spatial resolution #777

Merged

agbeltran added this to the DCAT Third Public Working Draft milestone Feb 27, 2019

davebrowning modified the milestones: DCAT CR, DCAT Future Priority Work Jul 24, 2019

This was referenced Sep 25, 2019

Generalize dcat:byteSize to dcat:size #313

Closed

Use case: Dataset size characteristics #161

Closed

riccardoAlbertoni added the future-work issue deferred to the next standardization round label Sep 27, 2019

andrea-perego added spatial-resolution temporal-resolution labels Oct 29, 2020

andrea-perego mentioned this issue Oct 29, 2020

Revisiting use cases for spatial resolution #1266

Open

andrea-perego closed this as completed Oct 29, 2020

andrea-perego mentioned this issue Dec 3, 2020

Define additional terms for spatial resolution in the GeoDCAT-AP namespace? SEMICeu/GeoDCAT-AP#15

Closed

andrea-perego removed the future-work issue deferred to the next standardization round label Mar 26, 2022

Summary statistics [RSS] #84

Summary statistics [RSS] #84

Comments

jpullmann commented Jan 18, 2018

Summary statistics [RSS]

makxdekkers commented Feb 7, 2018

dr-shorthair commented Feb 3, 2019

andrea-perego commented Feb 3, 2019

makxdekkers commented Feb 4, 2019

dr-shorthair commented Feb 4, 2019

dr-shorthair commented Feb 15, 2019

smrgeoinfo commented Feb 16, 2019

dr-shorthair commented Feb 17, 2019 • edited Loading

andrea-perego commented Feb 17, 2019

dr-shorthair commented Feb 18, 2019 • edited Loading

makxdekkers commented Feb 18, 2019

andrea-perego commented Feb 18, 2019

smrgeoinfo commented Feb 18, 2019

andrea-perego commented Feb 18, 2019

dr-shorthair commented Feb 19, 2019 • edited Loading

makxdekkers commented Feb 19, 2019

dr-shorthair commented Feb 19, 2019

andrea-perego commented Feb 19, 2019

dr-shorthair commented Feb 19, 2019

riccardoAlbertoni commented Feb 20, 2019

dr-shorthair commented Feb 20, 2019 • edited Loading

dr-shorthair commented Feb 22, 2019

dr-shorthair commented Feb 24, 2019

andrea-perego commented Feb 25, 2019

andrea-perego commented Feb 25, 2019

dr-shorthair commented Feb 25, 2019 • edited Loading

dr-shorthair commented Mar 3, 2019 • edited Loading

agbeltran commented Mar 3, 2019

riccardoAlbertoni commented Mar 3, 2019

andrea-perego commented Mar 4, 2019

andrea-perego commented Mar 7, 2019

davebrowning commented Mar 14, 2019

andrea-perego commented Mar 15, 2019

andrea-perego commented Oct 29, 2020

dr-shorthair commented Feb 17, 2019 •

edited

Loading

dr-shorthair commented Feb 18, 2019 •

edited

Loading

dr-shorthair commented Feb 19, 2019 •

edited

Loading

dr-shorthair commented Feb 20, 2019 •

edited

Loading

dr-shorthair commented Feb 25, 2019 •

edited

Loading

dr-shorthair commented Mar 3, 2019 •

edited

Loading