API for models which can do multiple things (e.g., predict multiple output types) #81

fkiraly · 2019-02-14T08:58:37Z

Currently, the API follows the idea of separating deterministic/probabilistic models which I second, as I think these are different types of tasks.

My question is now though: should we split models by output type too, as specitied in traits? E.g., as @ablaom has stated in MLJmodels issue no.2, should GLM for discrete, or discrete multi-class be different?
Or would it even make sense to join the regressors and classifiers in this case, as the central machinery seems highly overlapping?

I think we may like to think about whether we need API patterns for "agglomerating" functionality of models that solve different task types.

Unfortunately, it is less clear to me how (or why) to do this than the clean "splitting up" alternative.

Note that the latter "splitting up" option can perfectly use helper functions and only split the interface, so the latter also allows to avoid computational redundancy.

fkiraly · 2019-02-14T09:00:33Z

Would it make sense to set traits depending on hyper-parameter settings (e.g., link = Poisson vs link = Gaussian)?

tlienart · 2019-02-14T09:24:30Z

Wasn't another motivation for separating by outputs to help with the chaining of models? (and maybe the appropriate selection of metrics)

Initially my thinking behind just interfacing with the generic GLM.glm was that it would be easier to maintain over time but I don't think this is really an issue.

A minor side issue with using the fully generic GLM.glm is that we would potentially allow non-matching distribution/link pairs (or would need to write slightly messy code to check) e.g. Poisson wouldn't go with LogitLink.

fkiraly · 2019-02-14T14:14:35Z

Well, if the traits are set depending on which hyperparameters (e.g., link functions) you choose, that shouldn't be an issue?

ablaom · 2019-02-14T18:57:12Z

For dispatch reasons traits are functions of model types not instances. So if a trait is to depend on a hyperparameter, then that hyper parameter needs to become a type parameter as well. I vote against this complication.

However, we can reduce some code duplication by organizing the trait values (which can be types, instead of symbols, as present) into a type hierarchy. I was planning this anyway to improve things like metric selection. So the hierarchy for output_kind values would be

Continuous

Discrete
    Muliticlass
        Binary
    OrderedFactor
        OrderedFactorFinite
        OrderedFactorInfinite

fkiraly · 2019-02-15T11:18:27Z

Hm, I see, @ablaom, makes sense.

I've thought about a hierarchy of traits too, but the problem here is that there are traits that are not in an order relation to each other, but "orthogonal": i.e., probabilistic/deterministic, discrete/continuous, and univariate/multivariate.

Not sure what the most practical way is to account for that if you want to introduce a hierarchy.

ablaom · 2019-02-17T18:25:20Z

Univariate/Multivariate are values of a different trait, output_quantity. Probabilistic/Deterministic is not currently determined by a trait, but determined by a model's super type. So presently a combination of model type heirachy and traits are used to specify properties.

Recall, that the current model type heirachy is:

abstract type MLJType end
abstract type Supervised{R} <: Model end
abstract type Unsupervised <: Model end
abstract type Probabilistic{R} <: Supervised{R} end
abstract type Deterministic{R} <: Supervised{R} end

And the complete model traits are:

method	return type	declarable return values	default value
`output_kind`	`Symbol`	`:continuous`, `:binary`, `:multiclass`, `:ordered_factor_finite`, `:ordered_factor_infinite`, `:mixed`	`:unknown`
`output_quantity`	`Symbol`	`:univariate`, `:multivariate`	`:univariate`
`input_kinds`	`Vector{Symbol}`	one or more of: `:continuous`, `:multiclass`, `:ordered_factor_finite`, `:ordered_factor_infinite`, `:missing`	`Symbol[]`
`input_quantity`	`Symbol`	`:univariate`, `:multivariate`	`:multivariate`
`is_pure_julia`	`Symbol`	`:yes`, `:no`	`:unknown`
`load_path`	`String`	unrestricted	"unknown"
`package_name`	`String`	unrestricted	"unknown"
`package_uuid`	`String`	unrestricted	"unknown"
`package_url`	`String`	unrestricted	"unknown"

So, in particular, probabilistic/deterministic applies only to SuperviseModels. Maybe, it should be a trait too?

In Julia it us slightly more cumbersome dealing with traits than a type heirachy for dispatch. But perhaps the overall design is more consistent if we move probabilistic/deterministic to a trait? (What to call this trait? )

datnamer · 2019-02-17T18:44:27Z

Is there a general logic determining what should be a trait and what should be part of the type hierarchy?

This package by @mauro3 could help the trait verbosity: https://github.com/mauro3/SimpleTraits.jl It's just macro sugar over holy traits IIRC

ablaom · 2019-02-17T19:11:26Z

Thanks for that. Yes, I know about SimpleTraits. I guess I prefer to keep things explicit and avoid macros where they are not really essential.

Re your question: I guess this is more an issue of pragmatism than logic. Theoretically, we could use a model heirachy for everything, but we would have a proliferation of types (eg, SupervisedProbabilisticContinuousUnivariateOutputDiscreteOrOrderedFactorOrMissinUnivariateInputModel or some such).

fkiraly · 2019-02-17T20:04:03Z

@ablaom I see, I think the current solution for trait specification makes perhaps most sense of all alternatives I can think of. My earlier comment was due to not being fully up-to-date on the matter of traits.

I.e., currently different "kinds" of traits are simply retrieved by different methods. I'm slightly worried about method proliferation - as opposed there being a single one, e.g., "traits", but that maybe comes too close to the full 1st order logic or language specification that we didn't want.

Regarding the issue of deterministic vs probabilistic: I think the distinction between "trait" and "child type" (e.g., prob/det vs uni/multi) makes most sense on the level of interface. If the interface is substantially different, I'd make a new type - if not, manage it as a trait.

The interface is clearly different for prob/det, so-so different for output_quantity, and not substantially different for output_type or input_kinds, in my subjective qualitiative opinion.

So the current cut between prob/det and output_quantity makes sense to me.

ablaom · 2019-02-17T20:07:00Z

Okay. So then, in the current state, the Probabilistic/Determ distinction applies only to SupervisedModels. Is that acceptable? On 18/02/2019, at 9:04 AM, fkiraly <notifications@github.com> wrote: @ablaom <https://github.com/ablaom> I see, I think the current solution for trait specification makes perhaps most sense of all alternatives I can think of. My earlier comment was due to not being fully up-to-date on the matter of traits. I.e., currently different "kinds" of traits are simply retrieved by different methods. I'm slightly worried about method proliferation - as opposed there being a single one, e.g., "traits", but that maybe comes too close to the full 1st order logic or language specification that we didn't want. Regarding the issue of deterministic vs probabilistic: I think the distinction between "trait" and "child type" (e.g., prob/det vs uni/multi) makes most sense on the level of interface. If the interface is substantially different, I'd make a new type - if not, manage it as a trait. The interface is clearly different for prob/det, so-so different for output_quantity, and not substantially different for output_type or input_kinds, in my subjective qualitiative opinion. So the current cut between prob/det and output_quantity makes sense to me. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#81 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AdGnYCovrZeZZot5BkBk0x_M-1sQ0wk6ks5vObW0gaJpZM4a7Htk>.

…

--- Dr. Anthony Blaom Mathematician and Data Scientist Auckland

fkiraly · 2019-02-17T20:29:16Z

Okay. So then, in the current state, the Probabilistic/Determ distinction applies only to SupervisedModels. Is that acceptable?

Of course. Though,

I'm not sure at which state of maturity the unsupervised interface is, see Unsupervised learning interfaces - is transformer too narrow? #51 - it might be interesting to think about what "probabilistic" means there, but that's maybe a discussion to have later.
do you think, @ablaom , that the above discussion affects in any way the original issue "what about models that can do multiple things",

ablaom · 2019-02-17T21:26:05Z

Well, as I said, I vote for splitting models according to output type. It seems to me that the alternative is to add trait values as model type parameters which will complicate the existing task interface. I thought the only substantive issue with splitting was code redundancy, but I think giving ouput_kind values a heirachy, together with Julia-trait dispatch, should avoid that mostly. Splitting also avoids a lot of case-distinctions that can make reading and maintaining the code harder.

In 33 issue you did say you were in favour of the "more models" design, so I'm a bit confused about your new concern.

Maybe I misunderstand your original question above.

fkiraly · 2019-02-17T21:50:45Z

Ok, to explain my concern: this is methods vs models.

On the side of models (or more precisely, model types), I still favour disentangling tacked-together functions or models that are actually composite, above having flags and added methods like predict_proba.
A bad example would be to have "the one model that does everything".

On the side of methods, I am against proliferation - i.e., not having many methods but only a few but meaningful ones. Simply to avoid "functionality overload", confusion of the user through a large set of features, and to ensure that each interface point is well thought out.
A bad example would be to have multiple named and separately implemented hyperparameter accessor functions, e.g., depending on whether these are regularization parameters, optimizer settings, or sth else.

The two "ideals" may actually have to be traded off against each other in cases, similar to the more general design principles of API modularity and avoidance of API (method or class) proliferation.

fkiraly · 2019-02-17T22:01:19Z

@ablaom , given current information, I also vote for "splitting up" (i.e., status quo), just wanted to have this discussion since chopping up GLM seemed a bit brutal, but I can see how this can be mitigated to avoid redundancy.

ablaom · 2019-02-18T18:48:24Z

@fkiraly Thanks for the clarification and for articulating some good design principles. I agree that "functionality overload" is to be overloaded.

ablaom · 2019-02-21T02:15:08Z

After further thought, combining models that handle different target scientific types (a la #86) is possible without a big impact on the existing API design. You could introduce the target scientific type as a model type parameter ,TargetSciType, for dispatch, so long as the value of TargetSciType can be inferred from hyperparameters in the model keyword constructor.

So, for example, @tlienart could do GLM{Continuous}, GLM{Muliticlass}, and so forth.

There are some implications for the task-matching process as it is planned now, but I don't think they are as worrisome as I first thought. The algorithm searching for models for given task searches over "root" model types (so GLM is a root type but GLM{TargetSciType, D} is not). So what the algorithm will return in the case of models split by target scitype will be more specific than what is returned if the models are "agglomerated". Since the task will know what the actual target scitype is, this doesn't matter, right?

Technical note: Because of the way model search (and loading) works, you need, in GLM for example, to define target_scitype_upperbound(GLM). It is not enough to define target_scitype_upperbound(GLM{T}) for each T.

ablaom · 2019-06-05T21:47:54Z

Okay, revisiting this after some time.

Okay I think we could "combine models" with a simple non-breaking change to base. By combining, I mean one struct (eg, GLM) with a type parameter taking a finite number of values (eg, :linear, :log, :logit).

Currently, because of the way the Metadata.toml is constructed, one can't do things like

target_scitype_union(::Type{GLM{:linear}) = Continuous
target_scitype_union(::Type{GLM{:log}) = Count

Rather, one can only do one declaration for the UnionAll type, as in

target_scitype_union(::Type{GLM}) = something

which requires us to split GLM into GLMRegressor and GLMCount, and so forth. To enable combinations I think we can just add one new method submodels which enumerates all of the actual cases, as in

submodels(GLM) = [GLM{:linear}, GLM{:log}, GLM{:logit}, etc]

This can have a fallback submodes(M) = [M,] meaning nothing is combined. This method tells MLJRegistry what it needs to know to create individual model entries for each parameter value.

This change would not allow you to combine probabilistic and deterministic models, because this is "hard-wired" in the type hierarchy, not "soft-wired" via a trait.

But, should we do this? On the one hand, I am encountering some resistance among MLJ API implementers to the idea of splitting models. But that is partly because they are so used to being vague about the output of their models, which, as we know from other toolboxes, is kinda bad. Do we want to complicate our design to accommodate "bad practice" and a little code reduction?

fkiraly · 2019-06-06T10:21:03Z

@ablaom, to be fair, I think there is an argument for combining models which is (if I remember correctly) the only reason I brought this up initially: it can be algorithmically or computationally reasonable to have joint subroutines for models with different output scitypes, such as in the case of GLM where you slap a different link function on.

I agree with the main disadvantage: sloppiness in specification. A vagueness that API implementers may find slighly more permissive may, in the worst case, lead to end users cursing us until the end of our days. It's also a slippery slope, as it may lock you into broken secondary API decisions that in the end you can never get rid of without complete re-design. I may be exaggerating a little, but I guess we agree on the gist of the downside.

Personally, I think the current solution - where joint functionality should be within the interfaced method's home package, but MLJ interfaces separate, is workable.
I'm also not sure whether the change in design, by using submodels, doesn't have complicating secondary effects, since after the change you always need to call submodels in order for inspection not to be broken. I.e., to me it looks you now need to change this for every case of type inspection?

On a higher level, I'd rather see the problem with your initial "one can't do things like".
I think it would be good if one could - but that may require a re-think of our ad-hoc (sci-)typing system - or of Julia's type system overall.

datnamer · 2019-06-06T14:06:37Z

The nice thing about Julia's development is that it takes explicit feedback from ML usecases.. so if there are any ways the type system is limiting I'm sure @JeffBezanson would be happy to hear.

fkiraly · 2019-06-06T17:13:01Z

Just an update, I think it's partially incorrect what I said: the "one can't do" is due to limitations in MLJregistry.jl functionality rather than Julia type system.

So @ablaom 's suggestion is to change functionality/format of metadata.toml in MLJregistry.jl together with the model API and model registration API?
If so, I may have misunderstood the original intention, and what @ablaom is may be more reasonable than the status quo.

ablaom · 2019-06-10T03:52:47Z

@fkiraly I think we basically agree that, while not ideal in every way, the status quo - in terms of the API exposed to the model implementers, is probably best.

As far as leveraging the type system to somehow overcome "one can't do" without adding some new requirements of model implementer, I don't think see how to do this, but will keep the problem in mind.

fkiraly · 2019-06-10T17:00:15Z

@ablaom regarding "how to", would this not be solved by a smart use of scientific union types? E.g., a GLM model inhabits the union type of "supervised learner with probabilistic count output" and "supervised learner with probabilistic continuous output"?

fkiraly changed the title ~~API for models which can do multiple things (e.g., predict multiple outputs)~~ API for models which can do multiple things (e.g., predict multiple output types) Feb 14, 2019

ablaom added the design discussion Discussing design issues label Feb 18, 2019

ablaom mentioned this issue Feb 20, 2019

Conventions about the representations of scalar data #86

Closed

datnamer mentioned this issue Feb 20, 2019

Use for MLJ thautwarm/MLStyle-Playground#2

Open

ablaom mentioned this issue Feb 21, 2019

GLMCount + probabilistic output JuliaAI/MLJModels.jl#3

Merged

fkiraly mentioned this issue Jun 6, 2019

scitype #155

Closed

ablaom mentioned this issue Jun 27, 2019

Implementing MLJ model interface for EvoTrees.jl #122

Closed

ablaom closed this as completed Aug 22, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API for models which can do multiple things (e.g., predict multiple output types) #81

API for models which can do multiple things (e.g., predict multiple output types) #81

fkiraly commented Feb 14, 2019

fkiraly commented Feb 14, 2019

tlienart commented Feb 14, 2019 •

edited

Loading

fkiraly commented Feb 14, 2019

ablaom commented Feb 14, 2019

fkiraly commented Feb 15, 2019

ablaom commented Feb 17, 2019

datnamer commented Feb 17, 2019 •

edited

Loading

ablaom commented Feb 17, 2019

fkiraly commented Feb 17, 2019

ablaom commented Feb 17, 2019 via email

fkiraly commented Feb 17, 2019

ablaom commented Feb 17, 2019

fkiraly commented Feb 17, 2019

fkiraly commented Feb 17, 2019 •

edited

Loading

ablaom commented Feb 18, 2019

ablaom commented Feb 21, 2019

ablaom commented Jun 5, 2019

fkiraly commented Jun 6, 2019 •

edited

Loading

datnamer commented Jun 6, 2019

fkiraly commented Jun 6, 2019 •

edited

Loading

ablaom commented Jun 10, 2019

fkiraly commented Jun 10, 2019

API for models which can do multiple things (e.g., predict multiple output types) #81

API for models which can do multiple things (e.g., predict multiple output types) #81

Comments

fkiraly commented Feb 14, 2019

fkiraly commented Feb 14, 2019

tlienart commented Feb 14, 2019 • edited Loading

fkiraly commented Feb 14, 2019

ablaom commented Feb 14, 2019

fkiraly commented Feb 15, 2019

ablaom commented Feb 17, 2019

datnamer commented Feb 17, 2019 • edited Loading

ablaom commented Feb 17, 2019

fkiraly commented Feb 17, 2019

ablaom commented Feb 17, 2019 via email

fkiraly commented Feb 17, 2019

ablaom commented Feb 17, 2019

fkiraly commented Feb 17, 2019

fkiraly commented Feb 17, 2019 • edited Loading

ablaom commented Feb 18, 2019

ablaom commented Feb 21, 2019

ablaom commented Jun 5, 2019

fkiraly commented Jun 6, 2019 • edited Loading

datnamer commented Jun 6, 2019

fkiraly commented Jun 6, 2019 • edited Loading

ablaom commented Jun 10, 2019

fkiraly commented Jun 10, 2019

tlienart commented Feb 14, 2019 •

edited

Loading

datnamer commented Feb 17, 2019 •

edited

Loading

fkiraly commented Feb 17, 2019 •

edited

Loading

fkiraly commented Jun 6, 2019 •

edited

Loading

fkiraly commented Jun 6, 2019 •

edited

Loading