frictionlessdata · roll · Nov 30, 2023 · Aug 6, 2023 · Aug 12, 2023 · Aug 12, 2023
diff --git a/patterns/README.md b/patterns/README.md
@@ -1030,3 +1030,371 @@ A field MAY have a `missingValues` property that MUST be an `array` where each e
 ### Implementations
 
 None known.
+
+## Facilitate use of value labels (Stata, SAS and SPSS), categoricals (Python) and factors (R) in software that supports them
+
+### Overview
+
+Many software packages for manipulating and analyzing tabular data have special
+features for working with categorical variables. These include:
+
+- Value labels or formats ([Stata](https://www.stata.com/manuals13/dlabel.pdf),
+  [SAS](https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/proc/p1upn25lbfo6mkn1wncu4dyh9q91.htm)
+  and [SPSS](https://www.ibm.com/docs/en/spss-statistics/beta?topic=data-adding-value-labels))
+- [Categoricals (Pandas)](https://pandas.pydata.org/docs/user_guide/categorical.html)
+- [Factors (R)](https://www.stat.berkeley.edu/~s133/factors.html)
+- [CategoricalVectors (Julia)](https://dataframes.juliadata.org/stable/man/categorical/)
+
+These features can result in more efficient storage and faster runtime
+performance, but more importantly, facilitate analysis by indicating that a
+variable should be treated as categorical and by permitting the logical order
+of the categories to differ from their lexical order. And in the case of value
+labels, they permit the analyst to work with variables in numeric form (e.g.,
+in expressions, when fitting models) while generating output (e.g., tables,
+plots) that is labeled with informative strings.
+
+While these features are of limited use in some disciplines, others rely
+heavily on them (e.g., social sciences, epidemiology, clinical research,
+etc.). Thus, before these disciplines can begin to use Frictionless in a
+meaningful way, both the standards and the software tools need to support
+these features. This pattern addresses the necessary extensions to the
+[table schema](https://specs.frictionlessdata.io//table-schema/).
+
+### Principles
+
+Before describing the proposed extensions, here are the principles on which
+they are based:
+
+1. Extensions should be software agnostic (i.e., no additions to the official
+   schema targeted toward a specific piece of software). While the extensions
+   are intended to support the use of features not available in all software,
+   the resulting data package should continue to work as well as possible with
+   software that does not have those features.
+2. Related to (1), extensions should only include metadata that describe the
+   data themselves—not instructions for what a specific software package should
+   do with the data. Users who want to include the latter may do so within
+   a sub-namespace such as `custom` (e.g., see Issues [#103](https://github.com/frictionlessdata/specs/issues/103)
+   and [#663](https://github.com/frictionlessdata/specs/issues/663)).
+3. Extensions should be feature-complete (i.e., they should permit full
+   support of value labels, categoricals and factors by software tools).
+4. Extensions must be backward compatible (i.e., not break existing tools,
+   workflows, etc. for working with Frictionless packages).
+
+It is worth emphasizing that the scope of the proposed extensions is strictly
+limited to the information necessaary to make full use of the features for
+working with categorical data provided by the software packages listed above.
+Previous discussions of this issue have occasionally included references to
+additional variable-level metadata (e.g., multiple sets of category labels
+such as both "short labels" and longer "descriptions", or links to common data
+elements, controlled vocabularies or rdfTypes). While these additional
+metadata are undoubtedly useful, we speculate that the large majority of users
+who would benefit from the extensions propopsed here would not have and/or
+utilize such information, and therefore argue that these should be considered
+under a separate proposal.
+
+### Implementations
+
+We note that our proposal regarding field-specific missing values has been
+discussed frequently in numerous contexts, and is nearly identical to the pattern
+[missing values per field](https://specs.frictionlessdata.io/patterns/#missing-values-per-field)
+appearing in this document above.
+
+Our proposal to add a field-specific `ordered` property has been raised
+[here](https://github.com/frictionlessdata/specs/issues/739) and
+[here](https://github.com/frictionlessdata/specs/issues/156).
+
+Discussions regarding supporting software providing features for working with
+categorical variables appear in the following GitHub issues:
+
+- [https://github.com/frictionlessdata/specs/issues/156](https://github.com/frictionlessdata/specs/issues/156)
+- [https://github.com/frictionlessdata/specs/issues/739](https://github.com/frictionlessdata/specs/issues/739)
+
+and in the Frictionless Data forum:
+
+- [https://discuss.okfn.org/t/can-you-add-code-descriptions-to-a-data-package/](https://discuss.okfn.org/t/can-you-add-code-descriptions-to-a-data-package/)
+- [https://discuss.okfn.org/t/something-like-rs-ordered-factors-or-enums-as-column-type/](https://discuss.okfn.org/t/something-like-rs-ordered-factors-or-enums-as-column-type/)
+
+Finally, while we are unaware of any existing implementations intended for
+general use, it is likely that many users are already exploiting the fact that
+arbitrary fields may be added to the
+[table schema](https://specs.frictionlessdata.io//table-schema/)
+to support internal implementations (e.g., our group is doing so).
+
+### Proposed extensions
+
+We propose three extensions:
+
+1. Add an optional field-specific `missingValues` property. This is necessary
+   so that such values can be included in the definition of a categorical
+   (e.g., `["Yes", "No", "Don't know", "Refused"]`) or a value label, but
+   still ignored by software without such features. Note that unlike the
+   [missing values per field](https://specs.frictionlessdata.io/patterns/#missing-values-per-field)
+   pattern above, we propose that field-specific missing values be *added* to
+   the values appearing in the `missingValues` property at the resource level,
+   rather than replacing them. This is so that software can distinguish
+   between so-called *system missing values* (e.g., "Not applicable") and
+   other values that you may wish to include in certain tabulations/analyses
+   but exclude from others (e.g., "Don't know" or "Refused").
+2. Add an optional field-specific `ordered` property, which can be used when
+   contructing a categorical (or factor) to indicate that the variable is
+   ordinal.
+3. Add an optional field-specific `encoding` property for use when data are
+   stored using integer or other codes rather than using the category labels.
+   This contains an object mapping the codes appearing in the data (keys) to
+   what they mean (values), and can be used by software to construct
+   corresponding value labels or categoricals (when supported) or to translate
+   the values when reading the data.
+
+As none of the three proposed properties is part of the current
+[table schema](https://specs.frictionlessdata.io//table-schema/), the proposed
+extensions are fully backward compatible.
+
+Here is an example using extensions (1) and (2):
+
+```
+{
+  "fields": [
+    {
+      "name": "physical_health",
+      "type": "string",
+      "constraints": {
+        "enum": [
+          "Poor",
+          "Fair",
+          "Good",
+          "Very good",
+          "Excellent",
+        ]
+      }
+      "ordered": true
+      "missingValues": ["Don't know","Refused"]
+    }
+  ],
+  "missingValues": ["Not applicable","No answer"]
+}
+```
+
+This is our preferred strategy, as it provides all of the information
+necessary to support fully the categorical functionality of the software
+packages listed above, while still yielding a useable result for software
+without such capability. As described below, value labels or categoricals can
+be created automatically based on the ordering of the values in the `enum`
+array, and the field level `missingValues` can be incorporated into the value
+labels or categoricals if desired. In those cases where it is desired to have
+more control over how the value labels are constructed, this information can
+be stored in a separate encodings file in JSON format or as part of a custom
+extension to the table schema. Since such instructions do not describe the
+data themselves (but only how a specific software package should handle them),
+and since they are often software- and/or user-specific, we argue that they
+should not be included in the official table schema.
+
+Alternatively, those who wish to store their data in encoded form (e.g., this
+is the default for data exports from [REDCap](https://projectredcap.org), a
+commonly-used platform for collecting data for clinical studies) may use
+extension (3) to do so:
+
+```
+{
+  "fields": [
+    {
+      "name": "physical_health",
+      "type": "integer",
+      "enum": [1,2,3,4,5]
+      "ordered": true
+      "missingValues": ["Don't know","Refused"]
+      "encoding": {
+        "1": "Poor",
+        "2": "Fair",
+        "3": "Good",
+        "4": "Very good",
+        "5": "Excellent"
+      }
+    }
+  ],
+  "missingValues": ["Not applicable","No answer"]
+}
+```
+
+Note that although the field type is `integer`, the keys in the encoding
+object must be enclosed in double quotes because this is required by the JSON
+specification.
+
+A second variant of the example above is the following:
+
+```
+{
+  "fields": [
+    {
+      "name": "physical_health",
+      "type": "integer",
+      "enum": [1,2,3,4,5]
+      "ordered": true
+      "missingValues": [".a",".b"]
+      "encoding": {
+        "1": "Poor",
+        "2": "Fair",
+        "3": "Good",
+        "4": "Very good",
+        "5": "Excellent",
+        ".a": "Don't know",
+        ".b": "Refused"
+      }
+    }
+  ],
+  "missingValues": ["."]
+}
+```
+
+This represents encoded data exported from software with support for  value
+labels. The values `.a`, `.b`, etc. are known as *extended missing values*
+(Stata and SAS only) and provide 26 unique missing values for numeric fields
+(both integer and float) in addition to the system missing value ("`.`"); in
+SPSS these would be replaced with designated numbers (e.g., -97, -98 and -99).
+
+Note that one might argue that the encoding property should instead be
+specified as:
+
+```
+{
+  "encoding": {
+    "Poor": 1,
+    "Fair": 2,
+    "Good": 3,
+    "Very good": 4,
+    "Excellent": 5
+}
+```
+
+since that represents the encoding that has been applied to the data, and the
+table in the example is what is now necessary to *decode* the data. However,
+there are at least three arguments in favor of the proposed specification.
+First, it is the way value labels are uniformly written (e.g., in Stata, SAS
+and SPSS). Second, it automatically imposes the necessary constraint that the
+codes are unique (since a JSON object's keys must be unique). Third, it
+simplifies working with the encoding programmatically, since it can be read as
+an associative array and then applied directly to decode to the data (e.g.,
+using `DataFrame.replace()` in Pandas).
+
+### Specification
+
+1. A field MAY have a `missingValues` property that MUST be an `array` where
+   each entry is a `string`. If not specified, each field shall inherit the
+   entries in the `missingValues` property at the level of the tabular data
+   resource. If present at both the field and resource levels, the
+   field level property will be replaced by the *union* of the two arrays,
+   with the values specified at the resource level appearing in the same order
+   *after* those specified at the field level.
+
+2. A field with an `enum` constraint or an `encoding` property MAY have an
+   `ordered` property that MUST be a boolean. A value of `true` indicates that
+   the field should be treated as having an ordinal scale of measurement, with
+   the ordering given by the order of the field's `enum` array or by the
+   lexical order of the `encoding` object's keys, with the latter taking
+   precedence. Fields without an `enum` constraint or an `encoding` property
+   or for which the encoding object's keys do not include all values observed
+   in the data (excluding any values specified in either the field level or
+   resource level `missingValues` property) SHOULD NOT have an `ordered`
+   property since in that case the correct ordering of the data is ambiguous.
+   The absence of an `ordered` property MUST NOT be taken to imply
+   `ordered: false`.
+
+3. A field MAY have an `encoding` property that MUST be an object. This
+   property SHOULD be used to indicate how the values in the data (represented
+   by the object's keys) are to be labeled or translated (represented by the
+   corresponding value). The object's keys MAY include values that do not
+   appear in the data and MAY omit some values that do appear in the data. For
+   clarity and to avoid unintentional loss of information, the object's values
+   SHOULD be unique.
+
+### Suggested implementations
+
+Note: The use cases below address only *reading data* from a Frictionless data
+package; it is assumed that implementations will also provide the ability to
+write Frictionless data packages using the schema extensions proposed above.
+We suggest two types of implementations:
+
+1. Additions to the official Python Frictionless Framework to generate
+   software-specific scripts that may be executed by a specific software
+   package to read data from a Frictionless data package and create the
+   appropriate value labels or categoricals, as described below. These
+   scripts can then be included along with the data in the package itself.
+
+2. Software-specific extension packages that may be installed to permit users
+   of that software to read data from a Frictionless data package directly,
+   automatically creating the appropriate value labels or categoricals as
+   described below.
+
+The advantage of (1) is that it doesn't require users to install a package,
+which may in some cases be difficult or impossible. The advantage of (2) is
+that it provides native support for working with Frictionless data packages,
+and may be both easier and faster once the package is installed. We are in the
+process of implementing both approaches for Stata; implementations for the
+other software listed above are straightforward.
+
+#### Software that supports value labels (Stata, SAS or SPSS)
+
+1. In cases where a field has an `enum` constraint but no `encoding` property,
+   automatically generate a value label mapping the integers 1, 2, 3, ... to
+   the `enum` values in order, use this to encode the field (thereby changing
+   its type from `string` to `integer`), and attach the value label to the
+   field. Provide option to skip automatically dropping field level
+   `missingValues` and instead add them in order to the end of the value label,
+   encoded using extended missing values if supported.
+
+2. In cases where the data are stored in encoded form (e.g., as integers) and
+   a corresponding `encoding` property is present, and assuming that the keys
+   in the encoding object are limited to integers and extended missing values
+   (if supported), use the `encoding` object to generate a value label and
+   attach it to the field.  As with (1), provide option to skip automatically
+   dropping field level `missingValues` and instead add them in order to the
+   end of the value label, encoded using extended missing values if supported.
+
+3. Although none of Stata, SAS or SPSS currently permit designating a specific
+   variable as ordered, Stata permits attaching arbitrary metadata to
+   individual variables. Thus, in cases where the `ordered` property is
+   present, this information can be stored in Stata to inform the analyst and
+   to permit loss of information when generating Frictionless data packages
+   from within Stata.
+
+#### Software that supports categoricals or factors (Pandas, R, Julia)
+
+1. In cases where a field has an `enum` constraint but no `encoding` property,
+   automatically define a categorical or factor using the `enum` values in
+   order, and convert the variable to categorical or factor type using this
+   definition. Provide option to skip automatically dropping field level
+   `missingValues` and instead add them in order to the end of the `enum`
+   values when defining the categorical or factor.
+
+2. In cases where the data are stored in encoded form (e.g., as integers) and
+   a corresponding `encoding` property is present, translate the data using
+   the `encoding` object, define a categorical or factor using the values of
+   the `encoding` object in lexical order of the keys, and convert the
+   variable to categorical or factor type using this definition. Provide
+   option to skip automatically dropping field level `missingValues` and
+   instead add them to the end of the `encoding` values when defining the
+   categorical or factor.
+
+3. In cases where a field has an `ordered` property, use that when defining
+   the categorical or factor.
+
+#### All software
+
+Although the extensions proposed here are intended primarily to support the
+use of value labels and categoricals in software that supports them, they also
+provide additional functionality when reading data into any software that can
+handle tabular data. Specifically:
+
+1. Field-specific `missingValues`, especially when combined with
+   `missingValues` at the tabular resource level, provide considerably more
+   flexibility in specifying missing values that can benefit reading
+   Frictionless data into any software.
+
+2. The `encoding` property may be used to support any type of encoding, even
+   in cases where value labels or categoricals are not being used. For example,
+   it is standard practice in software for analyzing genetic data to code sex
+   as 0, 1 and 2 (corresponding to "Unknown", "Male" and "Female") and
+   affection status as 0, 1 and 2 (corresponding to "Unknown", "Unaffected"
+   and "Affected"). In such cases, the `encoding` property may be used to
+   confirm that the data follow the standard convention or to indicate that
+   they deviate from it; it may also be used to translate those codes into
+   human-readable values, if desired.