Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conventions about the representations of scalar data #86

Closed
ablaom opened this issue Feb 20, 2019 · 9 comments
Closed

Conventions about the representations of scalar data #86

ablaom opened this issue Feb 20, 2019 · 9 comments
Labels
design discussion Discussing design issues

Comments

@ablaom
Copy link
Member

ablaom commented Feb 20, 2019

I think MLJ should have clear conventions regarding the representation of the various "scientific" data types (continuous, ordered factor, and so forth). To this end, I have drafted this document and invite collaborators' responses.

Related: #81

@ablaom ablaom added the design discussion Discussing design issues label Feb 20, 2019
@fkiraly
Copy link
Collaborator

fkiraly commented Feb 20, 2019

Makes sense to me, small comments:

  • should orderedFactorSth also be parametric with number of classes?
  • should continuous be parametric on a lower and upper bound? There's a a few qualitative differences between "continuous on a bounded set" vs "continuous non-negative" vs "continuous (without restrictions)"
  • do we want "disjunction kinds", e.g., for allowing kinds such as "continuous, but could also be NA", or "either a number or one of these special classes"?

In addition, it might be worth considering functionality which:

  • for a table, gives the vector of kinds, in the same sequence as its columns
  • for a table, gives the set of kinds it has

@fkiraly
Copy link
Collaborator

fkiraly commented Feb 20, 2019

also, I find the name "kind" weird.
Formally, it's simply another type, but the strange thing is that we're saying

"in type system 1 (= Julia standard) it has type X"
"in type system 2 (= scientific) it has type Y"

rather than "in system 1 it has type X and also type Y".
(thus it is not a conjunction type)

More precisely, we want to assert
"if a has type A in system 1, it has type B in system 2".

Is there a more natural way to do this "double typing" (i.e., the first option) in Julia?
Given that typing, and structured types, are key language features.

And given that we do not want to re-write all of Julia's type management in "system 2".

@ablaom
Copy link
Member Author

ablaom commented Feb 20, 2019

Before reading previous two comments (addressed in additional post(s) below):

Updated document to include built-in Missing type and changed "kind" to "scitype".

@ablaom
Copy link
Member Author

ablaom commented Feb 20, 2019

Makes sense to me, small comments:

  • should orderedFactorSth also be parametric with number of classes?

Sure. Makes sense.

  • should continuous be parametric on a lower and upper bound? There's a a few qualitative differences between "continuous on a bounded set" vs "continuous non-negative" vs "continuous (without restrictions)"

I don't see why not. So, I could have, eg, Continuous{0}{Inf}?

  • do we want "disjunction kinds", e.g., for allowing kinds such as "continuous, but could also be NA", or "either a number or one of these special classes"?

Yes. Julia already has Missing we can add to the mix (see updated doc).

In addition, it might be worth considering functionality which:

  • for a table, gives the vector of kinds, in the same sequence as its columns
  • for a table, gives the set of kinds it has

Absolutely and needed for the task interface (see next comment box below). There is minor technical annoyance: while an element in a column determines the scitype, the type of the element does not. The Tables.jl interface provides the eltypes but we have to dig inside the table to get the scitype.

@ablaom
Copy link
Member Author

ablaom commented Feb 20, 2019

also, I find the name "kind" weird.
Formally, it's simply another type, but the strange thing is that we're saying

Agreed, anticipated and corrected: kind -> scitype

"in type system 1 (= Julia standard) it has type X"
"in type system 2 (= scientific) it has type Y"

rather than "in system 1 it has type X and also type Y".
(thus it is not a conjunction type)

More precisely, we want to assert
"if a has type A in system 1, it has type B in system 2".

Unfortunately, the julia type does not determine the scitype. It would if CategoricalValue had order and levels as type parameters but this is not the case. I can get this information from an object but not its type.

Is there a more natural way to do this "double typing" (i.e., the first option) in Julia?
Given that typing, and structured types, are key language features.

And given that we do not want to re-write all of Julia's type management in "system 2".

I'm not sure I understand your objection. Let me say that the scientific type hierarchy is a hierarchy of Julia types. So we get all the type semantics for free! So, eg, I can specify things like:

  • model type M can handle a multivariate target whose scitype is any subtype of Tuple{Continous, Multiclass}, and handle input features with scitype any subtype of Union{Missing, Discrete}.

And the logic for matching models to tasks is compactly expressed. So, eg, if the task data has univariate target scitype T (eg, Tuple{Continous{0,1}, Binary}) then model M works if T <: Tuple{Continuous, Multiclass} is true.

@ablaom
Copy link
Member Author

ablaom commented Feb 21, 2019

Mmm. One misgiving about parameterising Continuous: one cannot infer the parameters (bounds) from data instances, unlike the other scitypes.

@fkiraly
Copy link
Collaborator

fkiraly commented Feb 21, 2019

Regarding "objection": it's not an objection, just a comment that it appears that it might not be possible for a number (e.g., 42) to "have" both the type integer and the "scitype" OrderedFactorInfinite, in the sense that both are Julia types of the number.

@fkiraly
Copy link
Collaborator

fkiraly commented Feb 21, 2019

Regarding bounds of Continuous: I see, that's a bit troubling.
But in a sense it mirrors the issue that integer class labels are automatically recognized as OrderedFactorInfinite, even if the user has a finite number of classes in mind.

Should there hence be an optional step of user input (e.g., triggering a type conversion to CategoricalArray)?

@ablaom
Copy link
Member Author

ablaom commented Mar 1, 2019

Now implemented as the basis of an overhaul of trait functions (metadata). See Scientific Data Types and the updated Adding New Models guide for details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design discussion Discussing design issues
Projects
None yet
Development

No branches or pull requests

2 participants