Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Supporting optionally ordered factors/enums #156

Closed
houshuang opened this issue Dec 17, 2014 · 23 comments
Closed

Supporting optionally ordered factors/enums #156

houshuang opened this issue Dec 17, 2014 · 23 comments

Comments

@houshuang
Copy link

In both R and Julia, there is a concept of factors or enums as an alternative to character strings. The idea is to specify a fixed amount of possible values, and possibly an ordering of these. An example could be a questionnaire, which might have a field with Male/Female, Student/Working/Unemployed/Retired, or a likert-style question with Not at all/Little/Neutral/Quite a bit/Very much.

The two main advantages are validation (you can ensure that there are no other values than the ones enumerated (plus NA)), and storage capacity. (Internally, these are often implemented through storing ints and a lookup table, compare storing 65,000 likert-style questions above as characters, with storing 65,000 ints). The ability to order the levels (for example Not at all > Little > Neutral > Quite a bit > Very much) means that the levels will always be presented in the same order when graphing or showing summary statistics, rather than alphabetically or by order of first appearance.

I typically spend quite a bit of time cleaning up incoming questionnaire data by converting them to ordered factors, and would like to be able to preserve this information when sharing the datasets using dataprotocols.

In terms of implementation, there are two options. The first would be to keep the actual data exactly as it is, but note in the metadata that it is a factor with certain levels, and an optional ordering. This would presumably make the file "backwards-compatible", and still enable anyone to simply import the CSV without bothering with the metadata. We could still use the metadata to validate the CSV.

An alternative which would in many cases drastically reduce the size of the CSV file, is to store ints in the CSV file, and provide a lookup-table in the metadata (1=Not at all, 2=Somewhat, etc). It's possible the difference between the two approaches becomes very small once the files are compressed, so perhaps that makes this less backwards-compatible approach less important.

Discussion and support for this suggestion here: http://discuss.okfn.org/t/something-like-rs-ordered-factors-or-enums-as-column-type/122/4

@djvanderlaan
Copy link

The solution we have implemented is to add a categories field to the field description. This is either an array with the names, titles and descriptions of each of the categories (following the same fields in the datapackage and resource descriptions) as in the example below:

"schema" : {
  "fields" : [
    "name" : "age",
    "title" : "Age",
    "type" : "string",
    "categories" : [
      {
        "name" : "0_50",
        "title" : "0 to 50 years",
        "description" : "..."
      }, {
        "name" : "50_",
        "title" : "50 years and older",
        "description" : "..."
      }
    ]
  ]
}

The name field should be of the same type as the type of the field. In this case the field is of type string and the names of the categories are also of type string. This implementation has the advantage that it is backwards compatible with previous implementations. It also allows for the ability mentioned by @houshuang for writing integers to the file and storing the labels in the datapackage.

Another option that was requested by our users was that they wanted to be able to also store the category descriptions in a seperate csv-file. In some cases the number of categories can be very large. For example, in case of regions the number is in the thousands. The file only stores the unique region codes, while the csv-file with the category descriptions store the (non-unique) titles. The last we need when we present the data to end users. We solved this by also allowing the categories field to be a url to a csv-file. However, it would probably be better if this was a url to another resource (perhaps in the same datapackage).

@houshuang
Copy link
Author

This seems like it would work well. The only thing missing would be a way
to indicate whether the categories are ordered or not?

On Wed, Dec 17, 2014 at 2:22 PM, Jan van der Laan notifications@github.com
wrote:

The solution we have implemented is to add a categories field to the
field description. This is either an array with the names, titles and
descriptions of each of the categories (following the same fields in the
datapackage and resource descriptions) as in the example below:

"schema" : {
"fields" : [
"name": "age",
"title": "Age",
"type": "string",
"categories" : [
{
"name" : "0_50",
"title" : "0 to 50 years",
"description" : "..."
}, {
"name" : "50_",
"title" : "50 years and older",
"description" : "..."
}
]
]
}

The name field should be of the same type as the type of the field. In
this case the field is of type string and the names of the categories are
also of type string. This implementation has the advantage that it is
backwards compatible with previous implementations. It also allows for the
ability mentioned by @houshuang https://github.com/houshuang for
writing integers to the file and storing the labels in the datapackage.

Another option that was requested by our users was that they wanted to be
able to also store the category descriptions in a seperate csv-file. In
some cases the number of categories can be very large. For example, in case
of regions the number is in the thousands. The file only stores the unique
region codes, while the csv-file with the category descriptions store the
(non-unique) titles. The last we need when we present the data to end
users. We solved this by also allowing the categories field to be a url
to a csv-file. However, it would probably be better if this was a url to
another resource (perhaps in the same datapackage).


Reply to this email directly or view it on GitHub
#156 (comment)
.

http://reganmian.net/blog -- Random Stuff that Matters

@rufuspollock
Copy link
Contributor

@djvanderlaan this seems a great proposal - in a sense we almost reuse the structure of the fields themselves.

One thought is whether categories is the right name for the attribute. Alternatives could be:

  • categories - quite descriptive but has other real world usage (and could be confusing in that this could designate categories for this field)
  • enums (bit geeky)
  • factors (R) - not super-meaningful but maybe that is good
  • ...

From my perspective this seems like a highly useful addition to the spec and one that is backwards compatible so the moment we have something solid I think we can get this in.

PS: @houshuang @djvanderlaan love to have your input in this new discussion thread about how people are using data package

@djvanderlaan
Copy link

Sorry, for the delay in the replay. I've been a bit busy.

@rgrp It is not completely clear to me why categories would be confusing. What other 'real world uses' are you refering to that might cause confusion?

But to add to the options:

  • levels - used in R; factor is the type; the categories are called levels.
  • labels - SPSS uses 'Value Labels'; I personally feel that 'label' refers more to the name/title of a category.

What also remains open is how to add support for externally stored categories. That is something that our users really need. I can think of two methods.

One is to add a field category_reference that refers to a resource in a datapackage (as with foreignKey):

"categories_reference": {
    "datapackage": "http://data.okfn.org/data/mydatapackage/",
    "resource": "agecategories",
}

Another possibility is to use the foreignKey. The only thing necessary is to add the kind of relation that exists between this field and the foreign resource. Perhaps by adding a relation field.

"foreignKeys": [
    {
      "fields": "age",
      "relation" : "categories",
      "reference": {
        "datapackage": "http://data.okfn.org/data/mydatapackage/",
        "resource": "agecategories",
      }
    }
]

@rufuspollock
Copy link
Contributor

Just to flag we're probably about to make some progress on this and to highlight the pandas approach:

http://pandas.pydata.org/pandas-docs/stable/categorical.html

@danfowler
Copy link
Contributor

@rgrp I'd like to flag that support for remote "controlled vocabularies" comes up pretty often.

I have two thoughts:

  1. This issue should probably be tagged spec-json-table-schema.
  2. The foreignKeys approach suggested by @djvanderlaan is actually recommended by W3C CSVW

@rufuspollock
Copy link
Contributor

Options:

  • Introduce categorical type
  • Use "enum" in constraint
    • categorical is both less and more than a constraint. Yes, it does require that the data is in the list but it can have labels as well as codes and my have an order.
  • Foreign Key references to another table.
    • Seems a bit painful (you need a separate table) and requires the FK stuff.
  • Ignore

My sense is that we do option one, though perhaps as a pattern before having in spec.

Excerpts

Pandas

http://pandas.pydata.org/pandas-docs/stable/categorical.html

This is an introduction to pandas categorical data type, including a short comparison with R’s factor.

Categoricals are a pandas data type, which correspond to categorical variables in statistics: a variable, which can take on only a limited, and usually fixed, number of possible values (categories; levels in R). Examples are gender, social class, blood types, country affiliations, observation time or ratings via Likert scales.

In contrast to statistical categorical variables, categorical data might have an order (e.g. ‘strongly agree’ vs ‘agree’ or ‘first observation’ vs. ‘second observation’), but numerical operations (additions, divisions, ...) are not possible.

All values of categorical data are either in categories or np.nan. Order is defined by the order of categories, not lexical order of the values. Internally, the data structure consists of a categories array and an integer array of codes which point to the real value in the categories array.

The categorical data type is useful in the following cases:

A string variable consisting of only a few different values. Converting such a string variable to a categorical variable will save some memory, see here.
The lexical order of a variable is not the same as the logical order (“one”, “two”, “three”). By converting to a categorical and specifying an order on the categories, sorting and min/max will use the logical order instead of the lexical order, see here.
As a signal to other python libraries that this column should be treated as a categorical variable (e.g. to use suitable statistical methods or plot types).

R

https://stat.ethz.ch/R-manual/R-devel/library/base/html/factor.html

factor(x = character(), levels, labels = levels,
     exclude = NA, ordered = is.ordered(x), nmax = NA)

Tell R that a variable is nominal by making it a factor. The factor stores the nominal values as a vector of integers in the range [ 1... k ](where k is the number of unique values in the nominal variable), and an internal vector of character strings (the original values) mapped to these integers.

@djvanderlaan
Copy link

@rgrp What do you mean with option 1? Is that the introduction of a categorical type? If so, are you planning to store the categories in the datapackage?

I don't think it is necessary to introduce a separate type for categorical variables. In the CSV-file the data is stored as a string/integer. Applications that aren't able to handle categorical variables can still read and use the variable as string/integer.

@roll roll added the backlog label Aug 8, 2016
@roll roll removed the backlog label Aug 29, 2016
@rufuspollock rufuspollock added this to the Current milestone Sep 27, 2016
@danfowler
Copy link
Contributor

Linking to most recent request for this feature here: https://discuss.okfn.org/t/can-you-add-code-descriptions-to-a-data-package/3532

@rufuspollock rufuspollock self-assigned this Nov 17, 2016
@rufuspollock
Copy link
Contributor

Will implement with categorical type.

@pwalsh pwalsh modified the milestones: v1.1, v1.0 Feb 5, 2017
@pwalsh
Copy link
Member

pwalsh commented Feb 5, 2017

@rufuspollock I'm moving this to v1.1 milestone.

@rufuspollock
Copy link
Contributor

@krassowski we definitely want to do this - we are waiting on someone to craft a full pattern and then it could become part of a spec.

@dtufood-kihen
Copy link

Can we expect this to implemented in the near future ?

@abers
Copy link

abers commented Nov 11, 2021

@krassowski we definitely want to do this - we are waiting on someone to craft a full pattern and then it could become part of a spec.

Is this something open for anyone to do? If so, what's required for a full pattern?

@pschumm
Copy link
Contributor

pschumm commented Feb 8, 2022

We are starting to use frictionless for several projects and to recommend it to others, and this issue is a key one for us. Given that, we're willing to help put together a "full pattern" as noted in the comment above. @rufuspollock, could you point me to an example of exactly what is needed here?

Based on the discussion above, I'd be keen to hear people's thoughts on the following:

  1. As noted above, storing data as integers (e.g., 1 for "Yes", 0 for "No") is more efficient, though that is less important today than it was years ago and in cases where the data are compressed. There are, however, a few significant disadvantages:

    • The data are no longer very useful for software that can't handle this (e.g., if I want to open the file simply as as a delimited data file).
    • There can be an arbitrariness to the mapping between integers and categories, and this introduces that arbitrariness into the data file.
    • The risk for errors, especially when using software that can't handle the encoding/mapping automatically, is increased.

    Given this, I would advocate for storing the data as type string, or at least for providing that as an option.

  2. I think it is important to distinguish between three things (all related):

    • Indicating that a field has the ordered property. This is an inherent property of the data, and useful in any context (e.g., even if not analyzing the data with software capable of representing this). Thus, one might argue that something like ordered: true would make sense for any string field.
    • Indicating that the field should be interpreted as a categorical when working in software that can represent this (e.g., Python, R, Julia). By itself, this may not require anything but a single, additional property (e.g., categorical: true). Of course, unlike the ordered property, this is not an inherent property of the data but rather an instruction for how the data should be handled in specific contexts.
    • Providing a mapping between categories and integers that may be used by software to create what is called a value label (e.g., Stata, SPSS) or a format (SAS). This is important if you are using the data with one of those packages (quite common in many fields), but as with the item above, it is not really an inherent property of the data but rather an instruction. Given that, one might argue that these last two items shouldn't be an official part of the standard (but could be an extension).
  3. Finally, I believe this issue is related to the issue of missing values, and in particular, the designation of field-specific missing values. This is because for a lot of fields that would be treated as categorical, certain values can be both legitimate or missing depending on context (e.g., "Don't know" or "Refused" may be relevant for certain types of models, but not for others). Moreover, software such as Stata or SAS include what are called "extended missing values" which can be included in value labels or formats. Thus, for example, you can map values such as "Don't know" and "Refused" to entities displayed as .a, .b, etc. (stored as very large or very small integers), and thereby include them for some calculations but exclude them for others.

    It's not clear to me at the moment what, if anything, we would need/want to do about this, but for completeness I wanted to mention it.

I'd be keen to hear others' thoughts about any of this. If you'd like, I'd be glad to share how we are handling it at the moment.

@rufuspollock
Copy link
Contributor

@pschumm great to hear you are up for taking this forward - very welcome!

We are starting to use frictionless for several projects and to recommend it to others, and this issue is a key one for us. Given that, we're willing to help put together a "full pattern" as noted in the comment above. @rufuspollock, could you point me to an example of exactly what is needed here?

See examples at https://specs.frictionlessdata.io/patterns/ - code at https://github.com/frictionlessdata/specs/tree/master/patterns

Given this, I would advocate for storing the data as type string, or at least for providing that as an option.

Yes, absolutely. Let's take (option of) simplicity for users over (enforced) storage efficiency for sure.

  1. I think it is important to distinguish between three things (all related):

I think your analysis is spot on and i think worth including this in the introduction of your pattern.

Right now I'd focus on proposing a specific approach (can be in the PR) and would suggest modelling on whatever is the best out of pandas, R, Julia in their approach to this.

@krassowski
Copy link

I just wanted to highlight that pandas went ahead and added ordered: true for Categorical type in their schema (https://pandas.pydata.org/docs/user_guide/io.html#table-schema)

@ipimpat
Copy link

ipimpat commented Apr 30, 2022

@pschumm I'm in the same situation, I would also be happy to participate and help in where I can.

The lack of a full pattern for storing categorical data prevents my research group from adapting Frictionless.

All the scientist in my group uses primarily SPSS and we have a lot of SAS datasets too, and there for full pattern for "mapping" between those tools/data formats to Frictionless would definitely be a must for our research group.

@pschumm
Copy link
Contributor

pschumm commented May 1, 2022

Thanks @ipimpat and @krassowski for the nudge. FWIW, we're in a similar situation needing to support workflows using both software packages that support "value labels" (e.g., Stata, SAS and SPSS) as well as packages with support for categoricals or similar (e.g., Pandas and R). Thus, our plan is not only to propose an addition to the frictionless standard, but to create the necessary packages/plugins to make use of this addition from within these analytic packages as seamlessly as possible (e.g., we have already started with Stata). So I think there is much opportunity for us to collaborate.

I'll try to get a full pattern drafted (as requested by @rufuspollock above) within the next week for comments.

@rjgladish
Copy link

rjgladish commented May 2, 2022 via email

@ipimpat
Copy link

ipimpat commented May 2, 2022

Thus, our plan is not only to propose an addition to the frictionless standard, but to create the necessary packages/plugins to make use of this addition from within these analytic packages as seamlessly as possible (e.g., we have already started with Stata). So I think there is much opportunity for us to collaborate.

I'm very open to collaborate.

@pschumm
Copy link
Contributor

pschumm commented Aug 6, 2023

Finally getting back to this—sorry it's been so long. I have just submitted a PR (#844) for a pattern we are now using to support the use of value labels, categoricals and factors with Frictionless data packages. We are currently working on a full implementation for Stata, but are also keen to expand this to Pandas, R, SAS and SPSS. Would appreciate any comments or feedback, as well as volunteers to help with various implementations.

@roll
Copy link
Member

roll commented Jan 3, 2024

Thanks a lot @pschumm! I hope this PR #844 closes this issue 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

No branches or pull requests