Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pattern supporting use of value labels, categoricals and factors #844

Merged
merged 17 commits into from
Nov 30, 2023
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
368 changes: 368 additions & 0 deletions patterns/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -1030,3 +1030,371 @@ A field MAY have a `missingValues` property that MUST be an `array` where each e
### Implementations

None known.

## Facilitate use of value labels (Stata, SAS and SPSS), categoricals (Python) and factors (R) in software that supports them
pschumm marked this conversation as resolved.
Show resolved Hide resolved

### Overview

Many software packages for manipulating and analyzing tabular data have special
features for working with categorical variables. These include:

- Value labels or formats ([Stata](https://www.stata.com/manuals13/dlabel.pdf),
[SAS](https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/proc/p1upn25lbfo6mkn1wncu4dyh9q91.htm)
and [SPSS](https://www.ibm.com/docs/en/spss-statistics/beta?topic=data-adding-value-labels))
- [Categoricals (Pandas)](https://pandas.pydata.org/docs/user_guide/categorical.html)
- [Factors (R)](https://www.stat.berkeley.edu/~s133/factors.html)
- [CategoricalVectors (Julia)](https://dataframes.juliadata.org/stable/man/categorical/)

These features can result in more efficient storage and faster runtime
performance, but more importantly, facilitate analysis by indicating that a
variable should be treated as categorical and by permitting the logical order
of the categories to differ from their lexical order. And in the case of value
labels, they permit the analyst to work with variables in numeric form (e.g.,
in expressions, when fitting models) while generating output (e.g., tables,
plots) that is labeled with informative strings.

While these features are of limited use in some disciplines, others rely
heavily on them (e.g., social sciences, epidemiology, clinical research,
etc.). Thus, before these disciplines can begin to use Frictionless in a
meaningful way, both the standards and the software tools need to support
these features. This pattern addresses the necessary extensions to the
[table schema](https://specs.frictionlessdata.io//table-schema/).
pschumm marked this conversation as resolved.
Show resolved Hide resolved

### Principles

Before describing the proposed extensions, here are the principles on which
they are based:

1. Extensions should be software agnostic (i.e., no additions to the official
schema targeted toward a specific piece of software). While the extensions
are intended to support the use of features not available in all software,
the resulting data package should continue to work as well as possible with
software that does not have those features.
2. Related to (1), extensions should only include metadata that describe the
data themselves—not instructions for what a specific software package should
do with the data. Users who want to include the latter may do so within
a sub-namespace such as `custom` (e.g., see Issues [#103](https://github.com/frictionlessdata/specs/issues/103)
and [#663](https://github.com/frictionlessdata/specs/issues/663)).
3. Extensions should be feature-complete (i.e., they should permit full
support of value labels, categoricals and factors by software tools).
4. Extensions must be backward compatible (i.e., not break existing tools,
workflows, etc. for working with Frictionless packages).

It is worth emphasizing that the scope of the proposed extensions is strictly
limited to the information necessaary to make full use of the features for
working with categorical data provided by the software packages listed above.
Previous discussions of this issue have occasionally included references to
additional variable-level metadata (e.g., multiple sets of category labels
such as both "short labels" and longer "descriptions", or links to common data
elements, controlled vocabularies or rdfTypes). While these additional
metadata are undoubtedly useful, we speculate that the large majority of users
who would benefit from the extensions propopsed here would not have and/or
utilize such information, and therefore argue that these should be considered
under a separate proposal.

### Implementations

We note that our proposal regarding field-specific missing values has been
discussed frequently in numerous contexts, and is nearly identical to the pattern
[missing values per field](https://specs.frictionlessdata.io/patterns/#missing-values-per-field)
appearing in this document above.

Our proposal to add a field-specific `ordered` property has been raised
[here](https://github.com/frictionlessdata/specs/issues/739) and
[here](https://github.com/frictionlessdata/specs/issues/156).

Discussions regarding supporting software providing features for working with
categorical variables appear in the following GitHub issues:

- [https://github.com/frictionlessdata/specs/issues/156](https://github.com/frictionlessdata/specs/issues/156)
- [https://github.com/frictionlessdata/specs/issues/739](https://github.com/frictionlessdata/specs/issues/739)

and in the Frictionless Data forum:

- [https://discuss.okfn.org/t/can-you-add-code-descriptions-to-a-data-package/](https://discuss.okfn.org/t/can-you-add-code-descriptions-to-a-data-package/)
- [https://discuss.okfn.org/t/something-like-rs-ordered-factors-or-enums-as-column-type/](https://discuss.okfn.org/t/something-like-rs-ordered-factors-or-enums-as-column-type/)

Finally, while we are unaware of any existing implementations intended for
general use, it is likely that many users are already exploiting the fact that
arbitrary fields may be added to the
[table schema](https://specs.frictionlessdata.io//table-schema/)
to support internal implementations (e.g., our group is doing so).
pschumm marked this conversation as resolved.
Show resolved Hide resolved

### Proposed extensions

We propose three extensions:

1. Add an optional field-specific `missingValues` property. This is necessary
so that such values can be included in the definition of a categorical
(e.g., `["Yes", "No", "Don't know", "Refused"]`) or a value label, but
still ignored by software without such features. Note that unlike the
[missing values per field](https://specs.frictionlessdata.io/patterns/#missing-values-per-field)
pattern above, we propose that field-specific missing values be *added* to
the values appearing in the `missingValues` property at the resource level,
rather than replacing them. This is so that software can distinguish
between so-called *system missing values* (e.g., "Not applicable") and
other values that you may wish to include in certain tabulations/analyses
but exclude from others (e.g., "Don't know" or "Refused").
2. Add an optional field-specific `ordered` property, which can be used when
contructing a categorical (or factor) to indicate that the variable is
ordinal.
3. Add an optional field-specific `encoding` property for use when data are
pschumm marked this conversation as resolved.
Show resolved Hide resolved
stored using integer or other codes rather than using the category labels.
This contains an object mapping the codes appearing in the data (keys) to
what they mean (values), and can be used by software to construct
corresponding value labels or categoricals (when supported) or to translate
the values when reading the data.

As none of the three proposed properties is part of the current
[table schema](https://specs.frictionlessdata.io//table-schema/), the proposed
pschumm marked this conversation as resolved.
Show resolved Hide resolved
extensions are fully backward compatible.
pschumm marked this conversation as resolved.
Show resolved Hide resolved

Here is an example using extensions (1) and (2):

```
{
"fields": [
{
"name": "physical_health",
"type": "string",
"constraints": {
"enum": [
"Poor",
"Fair",
"Good",
"Very good",
"Excellent",
]
}
"ordered": true
"missingValues": ["Don't know","Refused"]
}
],
"missingValues": ["Not applicable","No answer"]
}
```

This is our preferred strategy, as it provides all of the information
necessary to support fully the categorical functionality of the software
packages listed above, while still yielding a useable result for software
without such capability. As described below, value labels or categoricals can
be created automatically based on the ordering of the values in the `enum`
array, and the field level `missingValues` can be incorporated into the value
labels or categoricals if desired. In those cases where it is desired to have
more control over how the value labels are constructed, this information can
be stored in a separate encodings file in JSON format or as part of a custom
extension to the table schema. Since such instructions do not describe the
data themselves (but only how a specific software package should handle them),
and since they are often software- and/or user-specific, we argue that they
should not be included in the official table schema.

Alternatively, those who wish to store their data in encoded form (e.g., this
is the default for data exports from [REDCap](https://projectredcap.org), a
commonly-used platform for collecting data for clinical studies) may use
extension (3) to do so:
pschumm marked this conversation as resolved.
Show resolved Hide resolved

```
{
"fields": [
{
"name": "physical_health",
"type": "integer",
"enum": [1,2,3,4,5]
"ordered": true
"missingValues": ["Don't know","Refused"]
pschumm marked this conversation as resolved.
Show resolved Hide resolved
"encoding": {
"1": "Poor",
"2": "Fair",
"3": "Good",
"4": "Very good",
"5": "Excellent"
}
}
],
"missingValues": ["Not applicable","No answer"]
}
```

Note that although the field type is `integer`, the keys in the encoding
object must be enclosed in double quotes because this is required by the JSON
specification.

A second variant of the example above is the following:

```
{
"fields": [
{
"name": "physical_health",
"type": "integer",
"enum": [1,2,3,4,5]
"ordered": true
"missingValues": [".a",".b"]
"encoding": {
"1": "Poor",
"2": "Fair",
"3": "Good",
"4": "Very good",
"5": "Excellent",
".a": "Don't know",
".b": "Refused"
}
}
],
"missingValues": ["."]
}
```

This represents encoded data exported from software with support for value
labels. The values `.a`, `.b`, etc. are known as *extended missing values*
(Stata and SAS only) and provide 26 unique missing values for numeric fields
(both integer and float) in addition to the system missing value ("`.`"); in
SPSS these would be replaced with designated numbers (e.g., -97, -98 and -99).

Note that one might argue that the encoding property should instead be
specified as:

```
{
"encoding": {
"Poor": 1,
"Fair": 2,
"Good": 3,
"Very good": 4,
"Excellent": 5
}
```

since that represents the encoding that has been applied to the data, and the
table in the example is what is now necessary to *decode* the data. However,
there are at least three arguments in favor of the proposed specification.
First, it is the way value labels are uniformly written (e.g., in Stata, SAS
and SPSS). Second, it automatically imposes the necessary constraint that the
codes are unique (since a JSON object's keys must be unique). Third, it
simplifies working with the encoding programmatically, since it can be read as
an associative array and then applied directly to decode to the data (e.g.,
using `DataFrame.replace()` in Pandas).

### Specification

1. A field MAY have a `missingValues` property that MUST be an `array` where
each entry is a `string`. If not specified, each field shall inherit the
entries in the `missingValues` property at the level of the tabular data
resource. If present at both the field and resource levels, the
field level property will be replaced by the *union* of the two arrays,
with the values specified at the resource level appearing in the same order
*after* those specified at the field level.

2. A field with an `enum` constraint or an `encoding` property MAY have an
`ordered` property that MUST be a boolean. A value of `true` indicates that
the field should be treated as having an ordinal scale of measurement, with
the ordering given by the order of the field's `enum` array or by the
lexical order of the `encoding` object's keys, with the latter taking
precedence. Fields without an `enum` constraint or an `encoding` property
or for which the encoding object's keys do not include all values observed
in the data (excluding any values specified in either the field level or
resource level `missingValues` property) SHOULD NOT have an `ordered`
pschumm marked this conversation as resolved.
Show resolved Hide resolved
property since in that case the correct ordering of the data is ambiguous.
The absence of an `ordered` property MUST NOT be taken to imply
`ordered: false`.

3. A field MAY have an `encoding` property that MUST be an object. This
property SHOULD be used to indicate how the values in the data (represented
by the object's keys) are to be labeled or translated (represented by the
corresponding value). The object's keys MAY include values that do not
pschumm marked this conversation as resolved.
Show resolved Hide resolved
appear in the data and MAY omit some values that do appear in the data. For
clarity and to avoid unintentional loss of information, the object's values
SHOULD be unique.

### Suggested implementations

Note: The use cases below address only *reading data* from a Frictionless data
package; it is assumed that implementations will also provide the ability to
write Frictionless data packages using the schema extensions proposed above.
We suggest two types of implementations:

1. Additions to the official Python Frictionless Framework to generate
software-specific scripts that may be executed by a specific software
package to read data from a Frictionless data package and create the
appropriate value labels or categoricals, as described below. These
scripts can then be included along with the data in the package itself.

2. Software-specific extension packages that may be installed to permit users
of that software to read data from a Frictionless data package directly,
automatically creating the appropriate value labels or categoricals as
described below.

The advantage of (1) is that it doesn't require users to install a package,
which may in some cases be difficult or impossible. The advantage of (2) is
that it provides native support for working with Frictionless data packages,
and may be both easier and faster once the package is installed. We are in the
process of implementing both approaches for Stata; implementations for the
other software listed above are straightforward.

#### Software that supports value labels (Stata, SAS or SPSS)

1. In cases where a field has an `enum` constraint but no `encoding` property,
automatically generate a value label mapping the integers 1, 2, 3, ... to
the `enum` values in order, use this to encode the field (thereby changing
its type from `string` to `integer`), and attach the value label to the
field. Provide option to skip automatically dropping field level
`missingValues` and instead add them in order to the end of the value label,
encoded using extended missing values if supported.

2. In cases where the data are stored in encoded form (e.g., as integers) and
a corresponding `encoding` property is present, and assuming that the keys
in the encoding object are limited to integers and extended missing values
(if supported), use the `encoding` object to generate a value label and
attach it to the field. As with (1), provide option to skip automatically
dropping field level `missingValues` and instead add them in order to the
end of the value label, encoded using extended missing values if supported.

3. Although none of Stata, SAS or SPSS currently permit designating a specific
variable as ordered, Stata permits attaching arbitrary metadata to
individual variables. Thus, in cases where the `ordered` property is
present, this information can be stored in Stata to inform the analyst and
to permit loss of information when generating Frictionless data packages
from within Stata.

#### Software that supports categoricals or factors (Pandas, R, Julia)

1. In cases where a field has an `enum` constraint but no `encoding` property,
automatically define a categorical or factor using the `enum` values in
pschumm marked this conversation as resolved.
Show resolved Hide resolved
order, and convert the variable to categorical or factor type using this
definition. Provide option to skip automatically dropping field level
`missingValues` and instead add them in order to the end of the `enum`
values when defining the categorical or factor.

2. In cases where the data are stored in encoded form (e.g., as integers) and
a corresponding `encoding` property is present, translate the data using
the `encoding` object, define a categorical or factor using the values of
the `encoding` object in lexical order of the keys, and convert the
variable to categorical or factor type using this definition. Provide
option to skip automatically dropping field level `missingValues` and
instead add them to the end of the `encoding` values when defining the
categorical or factor.

3. In cases where a field has an `ordered` property, use that when defining
the categorical or factor.

#### All software

Although the extensions proposed here are intended primarily to support the
use of value labels and categoricals in software that supports them, they also
provide additional functionality when reading data into any software that can
handle tabular data. Specifically:

1. Field-specific `missingValues`, especially when combined with
`missingValues` at the tabular resource level, provide considerably more
flexibility in specifying missing values that can benefit reading
Frictionless data into any software.

2. The `encoding` property may be used to support any type of encoding, even
in cases where value labels or categoricals are not being used. For example,
it is standard practice in software for analyzing genetic data to code sex
as 0, 1 and 2 (corresponding to "Unknown", "Male" and "Female") and
affection status as 0, 1 and 2 (corresponding to "Unknown", "Unaffected"
and "Affected"). In such cases, the `encoding` property may be used to
confirm that the data follow the standard convention or to indicate that
they deviate from it; it may also be used to translate those codes into
human-readable values, if desired.