Supporting optionally ordered factors/enums #156

houshuang · 2014-12-17T18:02:02Z

In both R and Julia, there is a concept of factors or enums as an alternative to character strings. The idea is to specify a fixed amount of possible values, and possibly an ordering of these. An example could be a questionnaire, which might have a field with Male/Female, Student/Working/Unemployed/Retired, or a likert-style question with Not at all/Little/Neutral/Quite a bit/Very much.

The two main advantages are validation (you can ensure that there are no other values than the ones enumerated (plus NA)), and storage capacity. (Internally, these are often implemented through storing ints and a lookup table, compare storing 65,000 likert-style questions above as characters, with storing 65,000 ints). The ability to order the levels (for example Not at all > Little > Neutral > Quite a bit > Very much) means that the levels will always be presented in the same order when graphing or showing summary statistics, rather than alphabetically or by order of first appearance.

I typically spend quite a bit of time cleaning up incoming questionnaire data by converting them to ordered factors, and would like to be able to preserve this information when sharing the datasets using dataprotocols.

In terms of implementation, there are two options. The first would be to keep the actual data exactly as it is, but note in the metadata that it is a factor with certain levels, and an optional ordering. This would presumably make the file "backwards-compatible", and still enable anyone to simply import the CSV without bothering with the metadata. We could still use the metadata to validate the CSV.

An alternative which would in many cases drastically reduce the size of the CSV file, is to store ints in the CSV file, and provide a lookup-table in the metadata (1=Not at all, 2=Somewhat, etc). It's possible the difference between the two approaches becomes very small once the files are compressed, so perhaps that makes this less backwards-compatible approach less important.

Discussion and support for this suggestion here: http://discuss.okfn.org/t/something-like-rs-ordered-factors-or-enums-as-column-type/122/4

djvanderlaan · 2014-12-17T19:22:15Z

The solution we have implemented is to add a categories field to the field description. This is either an array with the names, titles and descriptions of each of the categories (following the same fields in the datapackage and resource descriptions) as in the example below:

"schema" : {
  "fields" : [
    "name" : "age",
    "title" : "Age",
    "type" : "string",
    "categories" : [
      {
        "name" : "0_50",
        "title" : "0 to 50 years",
        "description" : "..."
      }, {
        "name" : "50_",
        "title" : "50 years and older",
        "description" : "..."
      }
    ]
  ]
}

The name field should be of the same type as the type of the field. In this case the field is of type string and the names of the categories are also of type string. This implementation has the advantage that it is backwards compatible with previous implementations. It also allows for the ability mentioned by @houshuang for writing integers to the file and storing the labels in the datapackage.

Another option that was requested by our users was that they wanted to be able to also store the category descriptions in a seperate csv-file. In some cases the number of categories can be very large. For example, in case of regions the number is in the thousands. The file only stores the unique region codes, while the csv-file with the category descriptions store the (non-unique) titles. The last we need when we present the data to end users. We solved this by also allowing the categories field to be a url to a csv-file. However, it would probably be better if this was a url to another resource (perhaps in the same datapackage).

houshuang · 2014-12-17T20:45:29Z

This seems like it would work well. The only thing missing would be a way
to indicate whether the categories are ordered or not?

On Wed, Dec 17, 2014 at 2:22 PM, Jan van der Laan notifications@github.com
wrote:

The solution we have implemented is to add a categories field to the
field description. This is either an array with the names, titles and
descriptions of each of the categories (following the same fields in the
datapackage and resource descriptions) as in the example below:

"schema" : {
"fields" : [
"name": "age",
"title": "Age",
"type": "string",
"categories" : [
{
"name" : "0_50",
"title" : "0 to 50 years",
"description" : "..."
}, {
"name" : "50_",
"title" : "50 years and older",
"description" : "..."
}
]
]
}

The name field should be of the same type as the type of the field. In
this case the field is of type string and the names of the categories are
also of type string. This implementation has the advantage that it is
backwards compatible with previous implementations. It also allows for the
ability mentioned by @houshuang https://github.com/houshuang for
writing integers to the file and storing the labels in the datapackage.

Another option that was requested by our users was that they wanted to be
able to also store the category descriptions in a seperate csv-file. In
some cases the number of categories can be very large. For example, in case
of regions the number is in the thousands. The file only stores the unique
region codes, while the csv-file with the category descriptions store the
(non-unique) titles. The last we need when we present the data to end
users. We solved this by also allowing the categories field to be a url
to a csv-file. However, it would probably be better if this was a url to
another resource (perhaps in the same datapackage).

—
Reply to this email directly or view it on GitHub
#156 (comment)
.

http://reganmian.net/blog -- Random Stuff that Matters

rufuspollock · 2014-12-29T13:13:26Z

@djvanderlaan this seems a great proposal - in a sense we almost reuse the structure of the fields themselves.

One thought is whether categories is the right name for the attribute. Alternatives could be:

categories - quite descriptive but has other real world usage (and could be confusing in that this could designate categories for this field)
enums (bit geeky)
factors (R) - not super-meaningful but maybe that is good
...

From my perspective this seems like a highly useful addition to the spec and one that is backwards compatible so the moment we have something solid I think we can get this in.

PS: @houshuang @djvanderlaan love to have your input in this new discussion thread about how people are using data package

djvanderlaan · 2015-01-28T16:34:13Z

Sorry, for the delay in the replay. I've been a bit busy.

@rgrp It is not completely clear to me why categories would be confusing. What other 'real world uses' are you refering to that might cause confusion?

But to add to the options:

levels - used in R; factor is the type; the categories are called levels.
labels - SPSS uses 'Value Labels'; I personally feel that 'label' refers more to the name/title of a category.

What also remains open is how to add support for externally stored categories. That is something that our users really need. I can think of two methods.

One is to add a field category_reference that refers to a resource in a datapackage (as with foreignKey):

"categories_reference": {
    "datapackage": "http://data.okfn.org/data/mydatapackage/",
    "resource": "agecategories",
}

Another possibility is to use the foreignKey. The only thing necessary is to add the kind of relation that exists between this field and the foreign resource. Perhaps by adding a relation field.

"foreignKeys": [
    {
      "fields": "age",
      "relation" : "categories",
      "reference": {
        "datapackage": "http://data.okfn.org/data/mydatapackage/",
        "resource": "agecategories",
      }
    }
]

rufuspollock · 2016-02-29T18:03:02Z

Just to flag we're probably about to make some progress on this and to highlight the pandas approach:

http://pandas.pydata.org/pandas-docs/stable/categorical.html

danfowler · 2016-04-04T14:54:33Z

@rgrp I'd like to flag that support for remote "controlled vocabularies" comes up pretty often.

I have two thoughts:

This issue should probably be tagged spec-json-table-schema.
The foreignKeys approach suggested by @djvanderlaan is actually recommended by W3C CSVW

rufuspollock · 2016-05-19T12:00:07Z

Options:

Introduce categorical type
Use "enum" in constraint
- categorical is both less and more than a constraint. Yes, it does require that the data is in the list but it can have labels as well as codes and my have an order.
Foreign Key references to another table.
- Seems a bit painful (you need a separate table) and requires the FK stuff.
Ignore

My sense is that we do option one, though perhaps as a pattern before having in spec.

Excerpts

Pandas

http://pandas.pydata.org/pandas-docs/stable/categorical.html

This is an introduction to pandas categorical data type, including a short comparison with R’s factor.

Categoricals are a pandas data type, which correspond to categorical variables in statistics: a variable, which can take on only a limited, and usually fixed, number of possible values (categories; levels in R). Examples are gender, social class, blood types, country affiliations, observation time or ratings via Likert scales.

In contrast to statistical categorical variables, categorical data might have an order (e.g. ‘strongly agree’ vs ‘agree’ or ‘first observation’ vs. ‘second observation’), but numerical operations (additions, divisions, ...) are not possible.

All values of categorical data are either in categories or np.nan. Order is defined by the order of categories, not lexical order of the values. Internally, the data structure consists of a categories array and an integer array of codes which point to the real value in the categories array.

The categorical data type is useful in the following cases:

A string variable consisting of only a few different values. Converting such a string variable to a categorical variable will save some memory, see here.
The lexical order of a variable is not the same as the logical order (“one”, “two”, “three”). By converting to a categorical and specifying an order on the categories, sorting and min/max will use the logical order instead of the lexical order, see here.
As a signal to other python libraries that this column should be treated as a categorical variable (e.g. to use suitable statistical methods or plot types).

R

https://stat.ethz.ch/R-manual/R-devel/library/base/html/factor.html

factor(x = character(), levels, labels = levels,
     exclude = NA, ordered = is.ordered(x), nmax = NA)

Tell R that a variable is nominal by making it a factor. The factor stores the nominal values as a vector of integers in the range [ 1... k ](where k is the number of unique values in the nominal variable), and an internal vector of character strings (the original values) mapped to these integers.

djvanderlaan · 2016-05-20T19:20:41Z

@rgrp What do you mean with option 1? Is that the introduction of a categorical type? If so, are you planning to store the categories in the datapackage?

I don't think it is necessary to introduce a separate type for categorical variables. In the CSV-file the data is stored as a string/integer. Applications that aren't able to handle categorical variables can still read and use the variable as string/integer.

danfowler · 2016-09-29T20:10:05Z

Linking to most recent request for this feature here: https://discuss.okfn.org/t/can-you-add-code-descriptions-to-a-data-package/3532

rufuspollock · 2016-11-17T13:47:33Z

Will implement with categorical type.

pwalsh · 2017-02-05T06:48:26Z

@rufuspollock I'm moving this to v1.1 milestone.

rufuspollock · 2021-05-26T10:46:48Z

@krassowski we definitely want to do this - we are waiting on someone to craft a full pattern and then it could become part of a spec.

dtufood-kihen · 2021-06-13T22:30:36Z

Can we expect this to implemented in the near future ?

abers · 2021-11-11T17:03:16Z

@krassowski we definitely want to do this - we are waiting on someone to craft a full pattern and then it could become part of a spec.

Is this something open for anyone to do? If so, what's required for a full pattern?

pschumm · 2022-02-08T23:47:12Z

We are starting to use frictionless for several projects and to recommend it to others, and this issue is a key one for us. Given that, we're willing to help put together a "full pattern" as noted in the comment above. @rufuspollock, could you point me to an example of exactly what is needed here?

Based on the discussion above, I'd be keen to hear people's thoughts on the following:

As noted above, storing data as integers (e.g., 1 for "Yes", 0 for "No") is more efficient, though that is less important today than it was years ago and in cases where the data are compressed. There are, however, a few significant disadvantages:
- The data are no longer very useful for software that can't handle this (e.g., if I want to open the file simply as as a delimited data file).
- There can be an arbitrariness to the mapping between integers and categories, and this introduces that arbitrariness into the data file.
- The risk for errors, especially when using software that can't handle the encoding/mapping automatically, is increased.
Given this, I would advocate for storing the data as type string, or at least for providing that as an option.
I think it is important to distinguish between three things (all related):
- Indicating that a field has the ordered property. This is an inherent property of the data, and useful in any context (e.g., even if not analyzing the data with software capable of representing this). Thus, one might argue that something like ordered: true would make sense for any string field.
- Indicating that the field should be interpreted as a categorical when working in software that can represent this (e.g., Python, R, Julia). By itself, this may not require anything but a single, additional property (e.g., categorical: true). Of course, unlike the ordered property, this is not an inherent property of the data but rather an instruction for how the data should be handled in specific contexts.
- Providing a mapping between categories and integers that may be used by software to create what is called a value label (e.g., Stata, SPSS) or a format (SAS). This is important if you are using the data with one of those packages (quite common in many fields), but as with the item above, it is not really an inherent property of the data but rather an instruction. Given that, one might argue that these last two items shouldn't be an official part of the standard (but could be an extension).
Finally, I believe this issue is related to the issue of missing values, and in particular, the designation of field-specific missing values. This is because for a lot of fields that would be treated as categorical, certain values can be both legitimate or missing depending on context (e.g., "Don't know" or "Refused" may be relevant for certain types of models, but not for others). Moreover, software such as Stata or SAS include what are called "extended missing values" which can be included in value labels or formats. Thus, for example, you can map values such as "Don't know" and "Refused" to entities displayed as .a, .b, etc. (stored as very large or very small integers), and thereby include them for some calculations but exclude them for others.

It's not clear to me at the moment what, if anything, we would need/want to do about this, but for completeness I wanted to mention it.

I'd be keen to hear others' thoughts about any of this. If you'd like, I'd be glad to share how we are handling it at the moment.

rufuspollock · 2022-02-15T13:53:33Z

@pschumm great to hear you are up for taking this forward - very welcome!

We are starting to use frictionless for several projects and to recommend it to others, and this issue is a key one for us. Given that, we're willing to help put together a "full pattern" as noted in the comment above. @rufuspollock, could you point me to an example of exactly what is needed here?

See examples at https://specs.frictionlessdata.io/patterns/ - code at https://github.com/frictionlessdata/specs/tree/master/patterns

Given this, I would advocate for storing the data as type string, or at least for providing that as an option.

Yes, absolutely. Let's take (option of) simplicity for users over (enforced) storage efficiency for sure.

I think it is important to distinguish between three things (all related):

I think your analysis is spot on and i think worth including this in the introduction of your pattern.

Right now I'd focus on proposing a specific approach (can be in the PR) and would suggest modelling on whatever is the best out of pandas, R, Julia in their approach to this.

krassowski · 2022-04-21T14:38:08Z

I just wanted to highlight that pandas went ahead and added ordered: true for Categorical type in their schema (https://pandas.pydata.org/docs/user_guide/io.html#table-schema)

ipimpat · 2022-04-30T13:31:35Z

@pschumm I'm in the same situation, I would also be happy to participate and help in where I can.

The lack of a full pattern for storing categorical data prevents my research group from adapting Frictionless.

All the scientist in my group uses primarily SPSS and we have a lot of SAS datasets too, and there for full pattern for "mapping" between those tools/data formats to Frictionless would definitely be a must for our research group.

pschumm · 2022-05-01T18:47:48Z

Thanks @ipimpat and @krassowski for the nudge. FWIW, we're in a similar situation needing to support workflows using both software packages that support "value labels" (e.g., Stata, SAS and SPSS) as well as packages with support for categoricals or similar (e.g., Pandas and R). Thus, our plan is not only to propose an addition to the frictionless standard, but to create the necessary packages/plugins to make use of this addition from within these analytic packages as seamlessly as possible (e.g., we have already started with Stata). So I think there is much opportunity for us to collaborate.

I'll try to get a full pattern drafted (as requested by @rufuspollock above) within the next week for comments.

rjgladish · 2022-05-02T14:35:00Z

FWIW, On the mapping front, I've been considering the value of an enumeration option that uses an object instead of an array to define the enumeration. I had decided that I didn't think the potential for a breaking change was worth the benefit (for my purposes) because rdfType could be used to indirectly specify the set of enumerated values (although that is an external definition). Although I am unfamiliar with the Stata, SPSS capabilities, some form of enumerant mapping that does not involve rdfType interdependencies would be a great option. Here are a couple of snippets, using simple "integer to category" mapping and an "enumerant to uri" mapping. Please note that these snippets add a enumset contraints property that would violate AdditionalProperties = "false" schema constraint in the current frictionless schema standard. { "name": "GATETYPE", "title": "Gate Type, "description": "Physical gate type", "type": "integer", "rdfType": "http://.../lock/gateType#", "constraints": { "minimum": "1", "maximum": "7", "enumset": { "1": "Miter", "2": "deprecated", "3": "Sector", "4": "Tainter", "5": "Vertical & Submersible", "6": "Leaf", "7": "Replaced" } } }, { "name": "REGION", "title": "Region", "description": "Region Code", "type": "string", "rdfType": "http://.../region/regionCode#", "constraints": { "pattern": "[A-Z]{2}", "enum": [ "AT", "GL", "GU", "IN", "PC" ], "enumset": { "AT": "http://.../region/Atlantic#", "GL": "http://.../region/GreatLakes#", "GU": "http://.../region/GulfofMexico#", "IN": "http://.../region/InlandWaterway#", "PC": "http://.../region /PacificCoast#" } } },

…

On Sun, May 1, 2022 at 2:48 PM Phil Schumm ***@***.***> wrote: Thanks @ipimpat <https://github.com/ipimpat> and @krassowski <https://github.com/krassowski> for the nudge. FWIW, we're in a similar situation needing to support workflows using both software packages that support "value labels" (e.g., Stata, SAS and SPSS) as well as packages with support for categoricals or similar (e.g., Pandas and R). Thus, our plan is not only to propose an addition to the frictionless standard, but to create the necessary packages/plugins to make use of this addition from within these analytic packages as seamlessly as possible (e.g., we have already started with Stata). So I think there is much opportunity for us to collaborate. I'll try to get a full pattern drafted (as requested by @rufuspollock <https://github.com/rufuspollock> above) within the next week for comments. — Reply to this email directly, view it on GitHub <#156 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAFB5VIO35UQD4PKKVUIE7DVH3GWBANCNFSM4AZGSAAA> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

-- Regards, Randy

ipimpat · 2022-05-02T15:19:33Z

Thus, our plan is not only to propose an addition to the frictionless standard, but to create the necessary packages/plugins to make use of this addition from within these analytic packages as seamlessly as possible (e.g., we have already started with Stata). So I think there is much opportunity for us to collaborate.

I'm very open to collaborate.

pschumm · 2023-08-06T13:47:56Z

Finally getting back to this—sorry it's been so long. I have just submitted a PR (#844) for a pattern we are now using to support the use of value labels, categoricals and factors with Frictionless data packages. We are currently working on a full implementation for Stata, but are also keen to expand this to Pandas, R, SAS and SPSS. Would appreciate any comments or feedback, as well as volunteers to help with various implementations.

roll · 2024-01-03T14:16:01Z

Thanks a lot @pschumm! I hope this PR #844 closes this issue 🎉

jpmckinney added the spec-datapackage label Feb 3, 2015

rufuspollock mentioned this issue May 4, 2016

Open questions ropensci-archive/datapkg#3

Open

danfowler added Table Schema and removed spec-datapackage labels Jul 1, 2016

roll added the backlog label Aug 8, 2016

roll removed the backlog label Aug 29, 2016

rufuspollock added this to the Current milestone Sep 27, 2016

rufuspollock self-assigned this Nov 17, 2016

pwalsh modified the milestones: v1.1, v1.0 Feb 5, 2017

krassowski mentioned this issue May 25, 2021

Ordered categories #739

Closed

roll added this to Open Knowledge Apr 14, 2023

roll modified the milestones: v1.1, v2 Apr 14, 2023

pschumm mentioned this issue Aug 6, 2023

Add pattern supporting use of value labels, categoricals and factors #844

Merged

roll unassigned rufuspollock Jan 3, 2024

roll closed this as completed Jan 3, 2024

github-project-automation bot moved this to Done in Open Knowledge Jan 3, 2024

roll removed this from the v2 milestone Jan 3, 2024

djvanderlaan mentioned this issue Mar 15, 2024

Introduce a codeList property to the field descriptor #888

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Supporting optionally ordered factors/enums #156

Supporting optionally ordered factors/enums #156

houshuang commented Dec 17, 2014

djvanderlaan commented Dec 17, 2014

houshuang commented Dec 17, 2014

rufuspollock commented Dec 29, 2014

djvanderlaan commented Jan 28, 2015

rufuspollock commented Feb 29, 2016

danfowler commented Apr 4, 2016

rufuspollock commented May 19, 2016

djvanderlaan commented May 20, 2016

danfowler commented Sep 29, 2016

rufuspollock commented Nov 17, 2016

pwalsh commented Feb 5, 2017

rufuspollock commented May 26, 2021

dtufood-kihen commented Jun 13, 2021

abers commented Nov 11, 2021

pschumm commented Feb 8, 2022

rufuspollock commented Feb 15, 2022

krassowski commented Apr 21, 2022

ipimpat commented Apr 30, 2022

pschumm commented May 1, 2022

rjgladish commented May 2, 2022 via email

ipimpat commented May 2, 2022

pschumm commented Aug 6, 2023

roll commented Jan 3, 2024

Supporting optionally ordered factors/enums #156

Supporting optionally ordered factors/enums #156

Comments

houshuang commented Dec 17, 2014

djvanderlaan commented Dec 17, 2014

houshuang commented Dec 17, 2014

rufuspollock commented Dec 29, 2014

djvanderlaan commented Jan 28, 2015

rufuspollock commented Feb 29, 2016

danfowler commented Apr 4, 2016

rufuspollock commented May 19, 2016

Excerpts

Pandas

R

djvanderlaan commented May 20, 2016

danfowler commented Sep 29, 2016

rufuspollock commented Nov 17, 2016

pwalsh commented Feb 5, 2017

rufuspollock commented May 26, 2021

dtufood-kihen commented Jun 13, 2021

abers commented Nov 11, 2021

pschumm commented Feb 8, 2022

rufuspollock commented Feb 15, 2022

krassowski commented Apr 21, 2022

ipimpat commented Apr 30, 2022

pschumm commented May 1, 2022

rjgladish commented May 2, 2022 via email

ipimpat commented May 2, 2022

pschumm commented Aug 6, 2023

roll commented Jan 3, 2024