Skip to content

Commit

Permalink
[jts,#353,#359,!][s]: missingValues work - fixes #353, fixes #359 (ag…
Browse files Browse the repository at this point in the history
…ain).

* missingValue -> missingValues (plural)
* require it to be an array
* move it to table schema level from field level - fixes #359
* no change to defaults - still just "" (as per #359 discussion)

Also do a fair bit of tidying up / refactoring on layout of spec e.g.

* get rid of specification heading and just have descriptor
* optional "Other Properties" section
* ...
  • Loading branch information
rufuspollock committed May 21, 2017
1 parent 44e3ba1 commit 778439f
Showing 1 changed file with 57 additions and 83 deletions.
140 changes: 57 additions & 83 deletions content/table-schema/contents.lr
Original file line number Diff line number Diff line change
Expand Up @@ -48,21 +48,15 @@ In JSON a table would be:
...
]

# Specification
# Descriptor

A Table Schema consists of:
A Table Schema consists of a descriptor. The descriptor MUST be a JSON `object` (JSON is defined in [RFC 4627][]).

* a required list of **field descriptors**
* optionally, a **primary key** description
* optionally, a **foreign key** description
It MUST contain a property `fields`. `fields` `MUST` be an array where each entry in the array is a field descriptor (as defined below). The descriptor `MAY` have the additional properties set out below and `MAY` contain any number of other properties (not defined in this spec).

A schema is described using JSON. This might exist as a standalone document
or may be embedded within another JSON structure, e.g. as part of a
data package description.

## Schema
[RFC 4627]: http://www.ietf.org/rfc/rfc4627.txt

A schema has the following structure:
The following is an illustration of this structure:

```javascript
{
Expand All @@ -80,43 +74,30 @@ A schema has the following structure:
},
... more field descriptors
],
// (optional) specification of missing values
"missingValues": [ ... ],
// (optional) specification of the primary key
"primaryKey": ...
// (optional) specification of the foreign keys
"foreignKeys": ...

}
```

That is, a Table Schema MUST be a JSON `object` (JSON is defined in [RFC 4627][]). The `object` MUST have the following structure:

[RFC 4627]: http://www.ietf.org/rfc/rfc4627.txt

- It MUST contain a property `fields`. `fields` MUST be an array where each entry in the array is a field
descriptor. (Structure and usage described below)
- It `MAY` contain a property `primaryKey` (structure and usage
specified below)
- It `MAY` contain a property `foreignKeys` (structure and usage
specified below)
- It `MAY` contain any number of other properties (not defined in this
spec)

## Field Descriptors
# Field Descriptors

A field descriptor is a simple JSON `object` that describes a single field. The
A field descriptor is a JSON `object` that describes a single field. The
descriptor provides additional human-readable documentation for a field, as
well as additional information that may be used to validate the field or create
a user interface for data entry.

At a minimum a field descriptor will contain at least a `name` key, but MAY
have additional keys as described below:
At a minimum a field descriptor will contain at least a `name` property, but MAY
have additional properties as described below:

{
"name": "name of field (e.g. column name)",
"title": "A nicer human readable label or title for the field",
"type": "A string specifying the type",
"format": "A string specifying a format",
"missingValue": "",
"description": "A description for the field",
"constraints": {
# a constraints-descriptor
Expand Down Expand Up @@ -148,7 +129,7 @@ have additional keys as described below:
- `constraints`: A constraints descriptor that can be used by consumers
to validate field values

### Field Constraints
## Constraints

A set of constraints can be associated with a field. These constraints can be used
to validate data against a Table Schema. The constraints might be used by consumers
Expand Down Expand Up @@ -184,7 +165,7 @@ all the constraints when determining if a field value is valid.
A data file, e.g. an entry in a data package, is considered to be valid if all of its fields are valid
according to their declared `type` and `constraints`.

### Field Types and Formats
## Types and Formats

A field's `type` property is a string indicating the type of this field.

Expand All @@ -205,7 +186,7 @@ types](http://www.elasticsearch.org/guide/reference/mapping/)).
The type list with associated formats and other related properties is as
follows.

#### string
### string

The field contains strings, that is, sequences of characters.

Expand All @@ -217,7 +198,7 @@ The field contains strings, that is, sequences of characters.
* **binary**: A base64 encoded string representing binary data.
* **uuid**: A string that is a uuid.

#### number
### number

The field contains numbers of any kind including decimals.

Expand Down Expand Up @@ -256,15 +237,15 @@ This lexical formatting may be modified using these additional properties:

[xsd-decimal]: https://www.w3.org/TR/xmlschema-2/#decimal

#### integer
### integer

The field contains integers - that is whole numbers.

Integer values are indicated in the standard way for any valid integer.

`format`: no options (other than the default).

#### boolean
### boolean

The field contains boolean (true/false) data.

Expand All @@ -276,19 +257,19 @@ so, for example, "Y" and "y" are both acceptable):

`format`: no options (other than the default).

#### object
### object

The field contains data which is valid JSON.

`format`: no options (other than the default).

#### array
### array

The field contains data that is a valid JSON format arrays.

`format`: no options (other than the default).

#### date
### date

**datetime**; **date**; **time**

Expand All @@ -310,15 +291,15 @@ Tthe field contains temporal values such as dates, times and date-times.
by Python / C standard `strptime` using `PATTERN`). Example: `%d %b %y`
would correspond to dates like: `30 Nov 14`

#### gyear
### gyear

A calendar year as per [XMLSchema `gYear`][xsd-gyear].

Usual lexical reprentation is `YYYY`. There are no format options.

[xsd-gyear]: https://www.w3.org/TR/xmlschema-2/#gYear

#### gyearmonth
### gyearmonth

A specific month in a specific year as per [XMLSchema
`gYearMonth`][xsd-gyearmonth].
Expand All @@ -327,7 +308,7 @@ Usual lexical representation is: `YYYY-MM`. There are no format options.

[xsd-gyearmonth]: https://www.w3.org/TR/xmlschema-2/#gYearMonth

#### duration
### duration

A duration of time.

Expand All @@ -343,7 +324,7 @@ and that definition is implicitly inlined here.

`format`: no options (other than the default).

#### geopoint
### geopoint

The field contains data describing a geographic point.

Expand All @@ -356,7 +337,7 @@ The field contains data describing a geographic point.
item is `lat`.
* **object**: A JSON object with exactly two keys, `lat` and `lon`

#### geojson
### geojson

The field contains a JSON object according to GeoJSON or TopoJSON spec.

Expand All @@ -365,51 +346,15 @@ The field contains a JSON object according to GeoJSON or TopoJSON spec.
* **default**: A geojson object as per the [GeoJSON spec](http://geojson.org/).
* **topojson**: A topojson object as per the [TopoJSON spec](https://github.com/topojson/topojson-specification/blob/master/README.md)

#### any
### any

Any `type` or `format` is accepted.

[strptime]: https://docs.python.org/2/library/datetime.html#strftime-strptime-behavior
[iso8601-duration]: https://en.wikipedia.org/wiki/ISO_8601#Durations
[xsd-duration]: http://www.w3.org/TR/xmlschema-2/#duration

### Missing Values

By "missing" we simply mean null or "not present for whatever reason". Many
datasets arrive with missing data values, either because a value was not
collected or it never existed.

Missing values may be indicated simply by the value being empty in other cases
a special value may have been used e.g. `-`, `NaN`, `0`, `-9999` etc.

The `missingValue` property provides a way to indicate that these values should
be interpreted as equivalent to null.

Strings that indicate missing (null) data values for a field `MAY` be provided
for a field using the the `missingValue` property.

If present, the `missingValue` MUST be a single string or an array of strings,
for example:

"missingValue": ""
"missingValue": "-"
"missingValue": ["Nan", "-"]

**Note**: `missingValue` are strings rather than being the data type of
the particular field. This allows for comparison prior to casting and for
fields to have missing value which are not of their type, for example a
`number` field to have missing values indicated by `-`.

The default value of `missingValue` for a non-string type field is the empty
string `""`. For string type fields there is no default for `missingValue` (for
string fields the empty string `""` is a valid value and need not indicate
null).

**Processing**: if a missing value is encountered it SHOULD be converted into
the NULL or equivalent value.


### Rich Field Types
## Rich Types

A richer, "semantic", description of the "type" of data in a given column MAY
be provided using a `rdfType` property on a field descriptor.
Expand All @@ -418,12 +363,15 @@ The value of of the `rdfType` property MUST be the URI of a RDF Class, that is a

Here is an example using the Schema.org RDF Class `http://schema.org/Country`:

```
| Country | Year Date | Value |
| ------- | --------- | ----- |
| US | 2010 | ... |
```

The corresponding Table Schema is:

# Table Schema
```javascript
{
fields: [
{
Expand All @@ -434,10 +382,36 @@ Here is an example using the Schema.org RDF Class `http://schema.org/Country`:
...
}
}
```

[rdfs-class]: https://www.w3.org/TR/rdf-schema/#ch_class


# Other Properties

## Missing Values

Many datasets arrive with missing data values, either because a value was not collected or it never existed. Missing values may be indicated simply by the value being empty in other cases a special value may have been used e.g. `-`, `NaN`, `0`, `-9999` etc.

The `missingValues` property is optional. It `MAY` be provided and when it exists it indicates Values that when encountered in the data, should be considered as `null`, 'not present', or 'blank' values.

`missingValues` MUST be an `array` where each entry is a `string`.

**Why strings**: `missingValues` are strings rather than being the data type of the particular field. This allows for comparison prior to casting and for fields to have missing value which are not of their type, for example a `number` field to have missing values indicated by `-`.

**Defaults:** The default value of `missingValue` for a non-string type field is the empty string `""`. For string type fields there is no default for `missingValue` (for string fields the empty string `""` is a valid value and need not indicate null).

Examples:

```javascript
"missingValues": [""]
"missingValues": ["-"]
"missingValue": ["Nan", "-"]
```

**Processing**: if a missing value is encountered it SHOULD be converted into
the NULL or equivalent value.

## Primary Key

A primary key is a field or set of fields that uniquely identifies each row in
Expand Down

0 comments on commit 778439f

Please sign in to comment.