Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API for models which can do multiple things (e.g., predict multiple output types) #81

Closed
fkiraly opened this issue Feb 14, 2019 · 22 comments
Labels
design discussion Discussing design issues

Comments

@fkiraly
Copy link
Collaborator

fkiraly commented Feb 14, 2019

Currently, the API follows the idea of separating deterministic/probabilistic models which I second, as I think these are different types of tasks.

My question is now though: should we split models by output type too, as specitied in traits? E.g., as @ablaom has stated in MLJmodels issue no.2, should GLM for discrete, or discrete multi-class be different?
Or would it even make sense to join the regressors and classifiers in this case, as the central machinery seems highly overlapping?

I think we may like to think about whether we need API patterns for "agglomerating" functionality of models that solve different task types.

Unfortunately, it is less clear to me how (or why) to do this than the clean "splitting up" alternative.

Note that the latter "splitting up" option can perfectly use helper functions and only split the interface, so the latter also allows to avoid computational redundancy.

@fkiraly fkiraly changed the title API for models which can do multiple things (e.g., predict multiple outputs) API for models which can do multiple things (e.g., predict multiple output types) Feb 14, 2019
@fkiraly
Copy link
Collaborator Author

fkiraly commented Feb 14, 2019

Would it make sense to set traits depending on hyper-parameter settings (e.g., link = Poisson vs link = Gaussian)?

@tlienart
Copy link
Collaborator

tlienart commented Feb 14, 2019

Wasn't another motivation for separating by outputs to help with the chaining of models? (and maybe the appropriate selection of metrics)

Initially my thinking behind just interfacing with the generic GLM.glm was that it would be easier to maintain over time but I don't think this is really an issue.

A minor side issue with using the fully generic GLM.glm is that we would potentially allow non-matching distribution/link pairs (or would need to write slightly messy code to check) e.g. Poisson wouldn't go with LogitLink.

@fkiraly
Copy link
Collaborator Author

fkiraly commented Feb 14, 2019

Well, if the traits are set depending on which hyperparameters (e.g., link functions) you choose, that shouldn't be an issue?

@ablaom
Copy link
Member

ablaom commented Feb 14, 2019

For dispatch reasons traits are functions of model types not instances. So if a trait is to depend on a hyperparameter, then that hyper parameter needs to become a type parameter as well. I vote against this complication.

However, we can reduce some code duplication by organizing the trait values (which can be types, instead of symbols, as present) into a type hierarchy. I was planning this anyway to improve things like metric selection. So the hierarchy for output_kind values would be

Continuous

Discrete
    Muliticlass
        Binary
    OrderedFactor
        OrderedFactorFinite
        OrderedFactorInfinite

@fkiraly
Copy link
Collaborator Author

fkiraly commented Feb 15, 2019

Hm, I see, @ablaom, makes sense.

I've thought about a hierarchy of traits too, but the problem here is that there are traits that are not in an order relation to each other, but "orthogonal": i.e., probabilistic/deterministic, discrete/continuous, and univariate/multivariate.

Not sure what the most practical way is to account for that if you want to introduce a hierarchy.

@ablaom
Copy link
Member

ablaom commented Feb 17, 2019

Univariate/Multivariate are values of a different trait, output_quantity. Probabilistic/Deterministic is not currently determined by a trait, but determined by a model's super type. So presently a combination of model type heirachy and traits are used to specify properties.

Recall, that the current model type heirachy is:

abstract type MLJType end
abstract type Supervised{R} <: Model end
abstract type Unsupervised <: Model end
abstract type Probabilistic{R} <: Supervised{R} end
abstract type Deterministic{R} <: Supervised{R} end

And the complete model traits are:

method return type declarable return values default value
output_kind Symbol :continuous, :binary, :multiclass, :ordered_factor_finite, :ordered_factor_infinite, :mixed :unknown
output_quantity Symbol :univariate, :multivariate :univariate
input_kinds Vector{Symbol} one or more of: :continuous, :multiclass, :ordered_factor_finite, :ordered_factor_infinite, :missing Symbol[]
input_quantity Symbol :univariate, :multivariate :multivariate
is_pure_julia Symbol :yes, :no :unknown
load_path String unrestricted "unknown"
package_name String unrestricted "unknown"
package_uuid String unrestricted "unknown"
package_url String unrestricted "unknown"

So, in particular, probabilistic/deterministic applies only to SuperviseModels. Maybe, it should be a trait too?

In Julia it us slightly more cumbersome dealing with traits than a type heirachy for dispatch. But perhaps the overall design is more consistent if we move probabilistic/deterministic to a trait? (What to call this trait? )

@datnamer
Copy link

datnamer commented Feb 17, 2019

Is there a general logic determining what should be a trait and what should be part of the type hierarchy?

This package by @mauro3 could help the trait verbosity: https://github.com/mauro3/SimpleTraits.jl It's just macro sugar over holy traits IIRC

@ablaom
Copy link
Member

ablaom commented Feb 17, 2019

Thanks for that. Yes, I know about SimpleTraits. I guess I prefer to keep things explicit and avoid macros where they are not really essential.

Re your question: I guess this is more an issue of pragmatism than logic. Theoretically, we could use a model heirachy for everything, but we would have a proliferation of types (eg, SupervisedProbabilisticContinuousUnivariateOutputDiscreteOrOrderedFactorOrMissinUnivariateInputModel or some such).

@fkiraly
Copy link
Collaborator Author

fkiraly commented Feb 17, 2019

@ablaom I see, I think the current solution for trait specification makes perhaps most sense of all alternatives I can think of. My earlier comment was due to not being fully up-to-date on the matter of traits.

I.e., currently different "kinds" of traits are simply retrieved by different methods. I'm slightly worried about method proliferation - as opposed there being a single one, e.g., "traits", but that maybe comes too close to the full 1st order logic or language specification that we didn't want.

Regarding the issue of deterministic vs probabilistic: I think the distinction between "trait" and "child type" (e.g., prob/det vs uni/multi) makes most sense on the level of interface. If the interface is substantially different, I'd make a new type - if not, manage it as a trait.

The interface is clearly different for prob/det, so-so different for output_quantity, and not substantially different for output_type or input_kinds, in my subjective qualitiative opinion.

So the current cut between prob/det and output_quantity makes sense to me.

@ablaom
Copy link
Member

ablaom commented Feb 17, 2019 via email

@fkiraly
Copy link
Collaborator Author

fkiraly commented Feb 17, 2019

Okay. So then, in the current state, the Probabilistic/Determ distinction applies only to SupervisedModels. Is that acceptable?

Of course. Though,

  • I'm not sure at which state of maturity the unsupervised interface is, see Unsupervised learning interfaces - is transformer too narrow? #51 - it might be interesting to think about what "probabilistic" means there, but that's maybe a discussion to have later.
  • do you think, @ablaom , that the above discussion affects in any way the original issue "what about models that can do multiple things",

@ablaom
Copy link
Member

ablaom commented Feb 17, 2019

Well, as I said, I vote for splitting models according to output type. It seems to me that the alternative is to add trait values as model type parameters which will complicate the existing task interface. I thought the only substantive issue with splitting was code redundancy, but I think giving ouput_kind values a heirachy, together with Julia-trait dispatch, should avoid that mostly. Splitting also avoids a lot of case-distinctions that can make reading and maintaining the code harder.

In 33 issue you did say you were in favour of the "more models" design, so I'm a bit confused about your new concern.

Maybe I misunderstand your original question above.

@fkiraly
Copy link
Collaborator Author

fkiraly commented Feb 17, 2019

Ok, to explain my concern: this is methods vs models.

On the side of models (or more precisely, model types), I still favour disentangling tacked-together functions or models that are actually composite, above having flags and added methods like predict_proba.
A bad example would be to have "the one model that does everything".

On the side of methods, I am against proliferation - i.e., not having many methods but only a few but meaningful ones. Simply to avoid "functionality overload", confusion of the user through a large set of features, and to ensure that each interface point is well thought out.
A bad example would be to have multiple named and separately implemented hyperparameter accessor functions, e.g., depending on whether these are regularization parameters, optimizer settings, or sth else.

The two "ideals" may actually have to be traded off against each other in cases, similar to the more general design principles of API modularity and avoidance of API (method or class) proliferation.

@fkiraly
Copy link
Collaborator Author

fkiraly commented Feb 17, 2019

@ablaom , given current information, I also vote for "splitting up" (i.e., status quo), just wanted to have this discussion since chopping up GLM seemed a bit brutal, but I can see how this can be mitigated to avoid redundancy.

@ablaom
Copy link
Member

ablaom commented Feb 18, 2019

@fkiraly Thanks for the clarification and for articulating some good design principles. I agree that "functionality overload" is to be overloaded.

@ablaom
Copy link
Member

ablaom commented Feb 21, 2019

After further thought, combining models that handle different target scientific types (a la #86) is possible without a big impact on the existing API design. You could introduce the target scientific type as a model type parameter ,TargetSciType, for dispatch, so long as the value of TargetSciType can be inferred from hyperparameters in the model keyword constructor.

So, for example, @tlienart could do GLM{Continuous}, GLM{Muliticlass}, and so forth.

There are some implications for the task-matching process as it is planned now, but I don't think they are as worrisome as I first thought. The algorithm searching for models for given task searches over "root" model types (so GLM is a root type but GLM{TargetSciType, D} is not). So what the algorithm will return in the case of models split by target scitype will be more specific than what is returned if the models are "agglomerated". Since the task will know what the actual target scitype is, this doesn't matter, right?

Technical note: Because of the way model search (and loading) works, you need, in GLM for example, to define target_scitype_upperbound(GLM). It is not enough to define target_scitype_upperbound(GLM{T}) for each T.

@ablaom
Copy link
Member

ablaom commented Jun 5, 2019

Okay, revisiting this after some time.

Okay I think we could "combine models" with a simple non-breaking change to base. By combining, I mean one struct (eg, GLM) with a type parameter taking a finite number of values (eg, :linear, :log, :logit).

Currently, because of the way the Metadata.toml is constructed, one can't do things like

target_scitype_union(::Type{GLM{:linear}) = Continuous
target_scitype_union(::Type{GLM{:log}) = Count

Rather, one can only do one declaration for the UnionAll type, as in

target_scitype_union(::Type{GLM}) = something 

which requires us to split GLM into GLMRegressor and GLMCount, and so forth. To enable combinations I think we can just add one new method submodels which enumerates all of the actual cases, as in

submodels(GLM) = [GLM{:linear}, GLM{:log}, GLM{:logit}, etc]

This can have a fallback submodes(M) = [M,] meaning nothing is combined. This method tells MLJRegistry what it needs to know to create individual model entries for each parameter value.

This change would not allow you to combine probabilistic and deterministic models, because this is "hard-wired" in the type hierarchy, not "soft-wired" via a trait.

But, should we do this? On the one hand, I am encountering some resistance among MLJ API implementers to the idea of splitting models. But that is partly because they are so used to being vague about the output of their models, which, as we know from other toolboxes, is kinda bad. Do we want to complicate our design to accommodate "bad practice" and a little code reduction?

@fkiraly
Copy link
Collaborator Author

fkiraly commented Jun 6, 2019

@ablaom, to be fair, I think there is an argument for combining models which is (if I remember correctly) the only reason I brought this up initially: it can be algorithmically or computationally reasonable to have joint subroutines for models with different output scitypes, such as in the case of GLM where you slap a different link function on.

I agree with the main disadvantage: sloppiness in specification. A vagueness that API implementers may find slighly more permissive may, in the worst case, lead to end users cursing us until the end of our days. It's also a slippery slope, as it may lock you into broken secondary API decisions that in the end you can never get rid of without complete re-design. I may be exaggerating a little, but I guess we agree on the gist of the downside.

Personally, I think the current solution - where joint functionality should be within the interfaced method's home package, but MLJ interfaces separate, is workable.
I'm also not sure whether the change in design, by using submodels, doesn't have complicating secondary effects, since after the change you always need to call submodels in order for inspection not to be broken. I.e., to me it looks you now need to change this for every case of type inspection?

On a higher level, I'd rather see the problem with your initial "one can't do things like".
I think it would be good if one could - but that may require a re-think of our ad-hoc (sci-)typing system - or of Julia's type system overall.

@datnamer
Copy link

datnamer commented Jun 6, 2019

The nice thing about Julia's development is that it takes explicit feedback from ML usecases.. so if there are any ways the type system is limiting I'm sure @JeffBezanson would be happy to hear.

@fkiraly
Copy link
Collaborator Author

fkiraly commented Jun 6, 2019

Just an update, I think it's partially incorrect what I said: the "one can't do" is due to limitations in MLJregistry.jl functionality rather than Julia type system.

So @ablaom 's suggestion is to change functionality/format of metadata.toml in MLJregistry.jl together with the model API and model registration API?
If so, I may have misunderstood the original intention, and what @ablaom is may be more reasonable than the status quo.

@fkiraly fkiraly mentioned this issue Jun 6, 2019
@ablaom
Copy link
Member

ablaom commented Jun 10, 2019

@fkiraly I think we basically agree that, while not ideal in every way, the status quo - in terms of the API exposed to the model implementers, is probably best.

As far as leveraging the type system to somehow overcome "one can't do" without adding some new requirements of model implementer, I don't think see how to do this, but will keep the problem in mind.

@fkiraly
Copy link
Collaborator Author

fkiraly commented Jun 10, 2019

@ablaom regarding "how to", would this not be solved by a smart use of scientific union types? E.g., a GLM model inhabits the union type of "supervised learner with probabilistic count output" and "supervised learner with probabilistic continuous output"?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design discussion Discussing design issues
Projects
None yet
Development

No branches or pull requests

4 participants