-
Notifications
You must be signed in to change notification settings - Fork 117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Supporting optionally ordered factors/enums #156
Comments
The solution we have implemented is to add a "schema" : {
"fields" : [
"name" : "age",
"title" : "Age",
"type" : "string",
"categories" : [
{
"name" : "0_50",
"title" : "0 to 50 years",
"description" : "..."
}, {
"name" : "50_",
"title" : "50 years and older",
"description" : "..."
}
]
]
} The name field should be of the same type as the type of the field. In this case the field is of type Another option that was requested by our users was that they wanted to be able to also store the category descriptions in a seperate csv-file. In some cases the number of categories can be very large. For example, in case of regions the number is in the thousands. The file only stores the unique region codes, while the csv-file with the category descriptions store the (non-unique) titles. The last we need when we present the data to end users. We solved this by also allowing the |
This seems like it would work well. The only thing missing would be a way On Wed, Dec 17, 2014 at 2:22 PM, Jan van der Laan notifications@github.com
http://reganmian.net/blog -- Random Stuff that Matters |
@djvanderlaan this seems a great proposal - in a sense we almost reuse the structure of the fields themselves. One thought is whether
From my perspective this seems like a highly useful addition to the spec and one that is backwards compatible so the moment we have something solid I think we can get this in. PS: @houshuang @djvanderlaan love to have your input in this new discussion thread about how people are using data package |
Sorry, for the delay in the replay. I've been a bit busy. @rgrp It is not completely clear to me why But to add to the options:
What also remains open is how to add support for externally stored categories. That is something that our users really need. I can think of two methods. One is to add a field
Another possibility is to use the
|
Just to flag we're probably about to make some progress on this and to highlight the pandas approach: http://pandas.pydata.org/pandas-docs/stable/categorical.html |
@rgrp I'd like to flag that support for remote "controlled vocabularies" comes up pretty often. I have two thoughts:
|
Options:
My sense is that we do option one, though perhaps as a pattern before having in spec. ExcerptsPandashttp://pandas.pydata.org/pandas-docs/stable/categorical.html This is an introduction to pandas categorical data type, including a short comparison with R’s factor. Categoricals are a pandas data type, which correspond to categorical variables in statistics: a variable, which can take on only a limited, and usually fixed, number of possible values (categories; levels in R). Examples are gender, social class, blood types, country affiliations, observation time or ratings via Likert scales. In contrast to statistical categorical variables, categorical data might have an order (e.g. ‘strongly agree’ vs ‘agree’ or ‘first observation’ vs. ‘second observation’), but numerical operations (additions, divisions, ...) are not possible. All values of categorical data are either in categories or np.nan. Order is defined by the order of categories, not lexical order of the values. Internally, the data structure consists of a categories array and an integer array of codes which point to the real value in the categories array. The categorical data type is useful in the following cases: A string variable consisting of only a few different values. Converting such a string variable to a categorical variable will save some memory, see here. Rhttps://stat.ethz.ch/R-manual/R-devel/library/base/html/factor.html
|
@rgrp What do you mean with option 1? Is that the introduction of a I don't think it is necessary to introduce a separate type for categorical variables. In the CSV-file the data is stored as a |
Linking to most recent request for this feature here: https://discuss.okfn.org/t/can-you-add-code-descriptions-to-a-data-package/3532 |
Will implement with |
@rufuspollock I'm moving this to |
@krassowski we definitely want to do this - we are waiting on someone to craft a full pattern and then it could become part of a spec. |
Can we expect this to implemented in the near future ? |
Is this something open for anyone to do? If so, what's required for a full pattern? |
We are starting to use frictionless for several projects and to recommend it to others, and this issue is a key one for us. Given that, we're willing to help put together a "full pattern" as noted in the comment above. @rufuspollock, could you point me to an example of exactly what is needed here? Based on the discussion above, I'd be keen to hear people's thoughts on the following:
I'd be keen to hear others' thoughts about any of this. If you'd like, I'd be glad to share how we are handling it at the moment. |
@pschumm great to hear you are up for taking this forward - very welcome!
See examples at https://specs.frictionlessdata.io/patterns/ - code at https://github.com/frictionlessdata/specs/tree/master/patterns
Yes, absolutely. Let's take (option of) simplicity for users over (enforced) storage efficiency for sure.
I think your analysis is spot on and i think worth including this in the introduction of your pattern. Right now I'd focus on proposing a specific approach (can be in the PR) and would suggest modelling on whatever is the best out of pandas, R, Julia in their approach to this. |
I just wanted to highlight that pandas went ahead and added |
@pschumm I'm in the same situation, I would also be happy to participate and help in where I can. The lack of a full pattern for storing categorical data prevents my research group from adapting Frictionless. All the scientist in my group uses primarily SPSS and we have a lot of SAS datasets too, and there for full pattern for "mapping" between those tools/data formats to Frictionless would definitely be a must for our research group. |
Thanks @ipimpat and @krassowski for the nudge. FWIW, we're in a similar situation needing to support workflows using both software packages that support "value labels" (e.g., Stata, SAS and SPSS) as well as packages with support for categoricals or similar (e.g., Pandas and R). Thus, our plan is not only to propose an addition to the frictionless standard, but to create the necessary packages/plugins to make use of this addition from within these analytic packages as seamlessly as possible (e.g., we have already started with Stata). So I think there is much opportunity for us to collaborate. I'll try to get a full pattern drafted (as requested by @rufuspollock above) within the next week for comments. |
FWIW,
On the mapping front, I've been considering the value of an enumeration
option that uses an object instead of an array to define the enumeration. I
had decided that I didn't think the potential for a breaking change
was worth the benefit (for my purposes) because rdfType could be used to
indirectly specify the set of enumerated values (although that is an
external definition).
Although I am unfamiliar with the Stata, SPSS capabilities, some form of
enumerant mapping that does not involve rdfType interdependencies would be
a great option.
Here are a couple of snippets, using simple "integer to category" mapping
and an "enumerant to uri" mapping. Please note that these snippets add a
enumset contraints property that would violate AdditionalProperties =
"false" schema constraint in the current frictionless schema standard.
{
"name": "GATETYPE",
"title": "Gate Type,
"description": "Physical gate type",
"type": "integer",
"rdfType": "http://.../lock/gateType#",
"constraints": {
"minimum": "1",
"maximum": "7",
"enumset": {
"1": "Miter",
"2": "deprecated",
"3": "Sector",
"4": "Tainter",
"5": "Vertical & Submersible",
"6": "Leaf",
"7": "Replaced"
}
}
},
{
"name": "REGION",
"title": "Region",
"description": "Region Code",
"type": "string",
"rdfType": "http://.../region/regionCode#",
"constraints": {
"pattern": "[A-Z]{2}",
"enum": [
"AT",
"GL",
"GU",
"IN",
"PC"
],
"enumset": {
"AT": "http://.../region/Atlantic#",
"GL": "http://.../region/GreatLakes#",
"GU": "http://.../region/GulfofMexico#",
"IN": "http://.../region/InlandWaterway#",
"PC": "http://.../region /PacificCoast#"
}
}
},
…On Sun, May 1, 2022 at 2:48 PM Phil Schumm ***@***.***> wrote:
Thanks @ipimpat <https://github.com/ipimpat> and @krassowski
<https://github.com/krassowski> for the nudge. FWIW, we're in a similar
situation needing to support workflows using both software packages that
support "value labels" (e.g., Stata, SAS and SPSS) as well as packages with
support for categoricals or similar (e.g., Pandas and R). Thus, our plan is
not only to propose an addition to the frictionless standard, but to create
the necessary packages/plugins to make use of this addition from within
these analytic packages as seamlessly as possible (e.g., we have already
started with Stata). So I think there is much opportunity for us to
collaborate.
I'll try to get a full pattern drafted (as requested by @rufuspollock
<https://github.com/rufuspollock> above) within the next week for
comments.
—
Reply to this email directly, view it on GitHub
<#156 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAFB5VIO35UQD4PKKVUIE7DVH3GWBANCNFSM4AZGSAAA>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
--
Regards,
Randy
|
I'm very open to collaborate. |
Finally getting back to this—sorry it's been so long. I have just submitted a PR (#844) for a pattern we are now using to support the use of value labels, categoricals and factors with Frictionless data packages. We are currently working on a full implementation for Stata, but are also keen to expand this to Pandas, R, SAS and SPSS. Would appreciate any comments or feedback, as well as volunteers to help with various implementations. |
In both R and Julia, there is a concept of factors or enums as an alternative to character strings. The idea is to specify a fixed amount of possible values, and possibly an ordering of these. An example could be a questionnaire, which might have a field with Male/Female, Student/Working/Unemployed/Retired, or a likert-style question with Not at all/Little/Neutral/Quite a bit/Very much.
The two main advantages are validation (you can ensure that there are no other values than the ones enumerated (plus NA)), and storage capacity. (Internally, these are often implemented through storing ints and a lookup table, compare storing 65,000 likert-style questions above as characters, with storing 65,000 ints). The ability to order the levels (for example Not at all > Little > Neutral > Quite a bit > Very much) means that the levels will always be presented in the same order when graphing or showing summary statistics, rather than alphabetically or by order of first appearance.
I typically spend quite a bit of time cleaning up incoming questionnaire data by converting them to ordered factors, and would like to be able to preserve this information when sharing the datasets using dataprotocols.
In terms of implementation, there are two options. The first would be to keep the actual data exactly as it is, but note in the metadata that it is a factor with certain levels, and an optional ordering. This would presumably make the file "backwards-compatible", and still enable anyone to simply import the CSV without bothering with the metadata. We could still use the metadata to validate the CSV.
An alternative which would in many cases drastically reduce the size of the CSV file, is to store ints in the CSV file, and provide a lookup-table in the metadata (1=Not at all, 2=Somewhat, etc). It's possible the difference between the two approaches becomes very small once the files are compressed, so perhaps that makes this less backwards-compatible approach less important.
Discussion and support for this suggestion here: http://discuss.okfn.org/t/something-like-rs-ordered-factors-or-enums-as-column-type/122/4
The text was updated successfully, but these errors were encountered: