Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unsupervised learning interfaces - is transformer too narrow? #51

Open
fkiraly opened this issue Jan 23, 2019 · 23 comments
Open

Unsupervised learning interfaces - is transformer too narrow? #51

fkiraly opened this issue Jan 23, 2019 · 23 comments
Labels
design discussion Discussing design issues

Comments

@fkiraly
Copy link
Collaborator

fkiraly commented Jan 23, 2019

Regarding unsupervised models such as PCA, kmeans, etc discussed in #44.

I know these are commonly encapsulated within the transformer formalism, but it would do the methodology behind them injustice as feature extraction is only one major usage cases of unsupervised models. More precisely, there are, as far as I can see, three use cases:

(i) feature extraction. For clusterers, create a column with cluster assignment. For continuous dimension reducers, create multiple continuous columns.

(ii) model structure inference - essentially, inspection of the fitted parameters. E.g., PCA components and loadings. Cluster separation metrics etc. These may be of interest in isolation, or used as an (hyper-parameter) input of other atomic models in a learning pipeline.

(iii) full probabilistic modelling aka density estimation. This behaves as a probabilistic multivariate regressor/classifier on the input variables.

For the start if makes sense to implement only "transformer" functionality, but it is maybe good to keep in mind for implementation that eventually one may like to expose the other outputs via interfaces. E.g., the estimated multivariate density in a fully probabilistic implementation of k-means.

@ablaom
Copy link
Member

ablaom commented Jan 23, 2019

I think this is a good point. There are two choices for exposing extra functionality at present:

(i) fit may return additional information in its report dictionary (this could include functions/closures but was not the original intention)

(ii) one implements methods beyond transform dispatched on the fit-result. This presently requires adding ("registering") the method name to MLJBase.

@ablaom ablaom added the design discussion Discussing design issues label Jan 23, 2019
@fkiraly
Copy link
Collaborator Author

fkiraly commented Jan 24, 2019

@ablaom, I think the report dictionary returned by fit should, at most, be diagnostic reports of the fitting itself and not be abused for parameter inference or reporting.

I'd personally introduce a single method for all models, e.g., fitted_params which could return a dictionary of model parameters and diagnostics. These would be different for each model - for example, for ordinary least squares regression, it might return coefficients, CI, R-squared, and t/F test results.

What we may want to be careful about is the interaction with the parameter interface. I usually like to distinguish hyper-parameters = set externally, not changed by fit, and model parameters = no external access, set by fit.

@ablaom
Copy link
Member

ablaom commented Jan 24, 2019

Two issues here:

Type of information to be accessed after a fit call. I suppose we can classify these into "parameter inference" and "other". It's not clear to me how "other" can be unambiguously divided further, but help me out here if you can.

Method of access. Dictionary or method. The original idea of dictionary was that it would be a persistent kind of thing, or even some kind of log/history. A dictionary has the added convenience that one adds keys according to circumstance (e.g., if I set a hyperparameter requesting fit to rank features, then :feature_rankings is a key of the report dictionary, otherwise it is not.) Actually, report isn't used currently to maintain a running log at the moment (by the correspondingmachine) but it could be. A method has the advantage that extra computation required to produce the information wanted can be avoided until the user calls for it. Now that I think of it, method and dictionary could be combined - method computes a dictionary that it returns.

I like the simplicity of a returning a single object to report all information of possible interest, computed after every fit, whether it be fitted parameters or whatever. What is less clear to me is whether information that requires extra computation should be accessed:

(i) by requesting the computation through an "instruction" hyperparameter and returning the result in the same report object; or

(ii) having a dedicated method dispatched on the fit-result, like predict.

Your thoughts?

What we may want to be careful about is the interaction with the parameter interface. I usually like to distinguish hyper-parameters = set externally, not changed by fit, and model parameters = no external access, set by fit.
Agreed!

@fkiraly
Copy link
Collaborator Author

fkiraly commented Feb 4, 2019

Some thoughts (after a longer time of thinking):

I think it would be a good idea to have a dedicated interface for fitted parameters, just as we have for hyperparameters, i.e., dictionary-style, and following exactly the same structure, nesting and accessor conventions for the fitting result as we have for the models.

What is automatically returned in this extension of fitresult are "standard model parameters that are easy to compute", i.e., it can be more than what predict needs but shouldn't add a lot of computational overhead. It also should be data-agnostic model structure parameters (e.g., model coefficients), or easy-to-obtain intermediate results for diagnostics (e.g., R-squared?).

Separate from this should be operations on the model that require significant computational overhead over fit/predict (e.g., variable importance), or that are data-dependent (e.g., F-test in-sample).

The standard stuff - i.e., standard methodology for diagnostics and parameter inference (e.g., for OLS, t-tests, CI, F-test, R-squared, diagnostic plots) I'd put in fixed dispatch methods diagnose (returns pretty-printable dict-like of summaries) or diagnose_visualize (produces plots/visualizations).

Advanced and non-standard diagnostics (e.g., specialized diagnostics or non-canonical visualizations) should be external, but these will be facilitated through the standardized model parameter interface once it exists.

Thoughts?

@ablaom
Copy link
Member

ablaom commented Mar 5, 2019

@fkiraly I have come around to accepting your suggestion for a dedicated method to retrieve fitted parameters, separate from the report field of a machine. I also agree that params and fitted_params (which will have "nested" values for composite models) should return the same kind of object. I think a Julia NamedTuple (like a dict but with ordered keys and type parameters for each value) is the way to go. This will also be the form of the (possibly nested) report field, and report will get an accessor function, so that params, fitted_params, report are all methods that can be called on a (fitted) machine to return a named tuple.

I am working on implementing these various things simultaneously.

@tlienart
Copy link
Collaborator

tlienart commented Mar 6, 2019

I think a Julia NamedTuple (like a dict but with ordered keys and type parameters for each value) is the way to go

A noteworthy difference being that a NamedTuple is immutable, could that cause a problem here?

@fkiraly
Copy link
Collaborator Author

fkiraly commented Mar 6, 2019

@ablaom, I'm onboard with NamedTuple or dictionary returned by method. The method be able to return abstract structs in its fields, and should be able to change with each run of fit.

Regarding user interface: I'd make it a method (by dispatch), and call it "inspect" unless you have a better idea.

On a side note, I think this would also help greatly with the issue highlighted in the visualization issue #85 , the "report" being possibly arcane and non-standardized.

Further to this, I think computationally expensive diagnostics such as "interpretable machine learning" style meta-methods should not be bundled with "inspect", but rather with external "interpretability meta-methods" (to be dealt with at a much later point).
The "inspect" interface point should be reserved for parameters or properties which do not add substantial computational overhead over "fit" - this could, for example, be defined as only constant (or log(# training data pts) ) added computational effort above "fit".

@fkiraly
Copy link
Collaborator Author

fkiraly commented Mar 6, 2019

Hm, maybe another two default interface points - "print" and "plot" would be great?
These are default interface points in R.

"print" gives back a written summary, for example

Call:
lm(formula = weight ~ group - 1)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.0710 -0.4938  0.0685  0.2462  1.3690 

Coefficients:
         Estimate Std. Error t value Pr(>|t|)    
groupCtl   5.0320     0.2202   22.85 9.55e-15 ***
groupTrt   4.6610     0.2202   21.16 3.62e-14 ***
---
Signif. codes:  0***0.001**0.01*0.05.0.1 ‘ ’ 1

Residual standard error: 0.6964 on 18 degrees of freedom
Multiple R-squared:  0.9818,	Adjusted R-squared:  0.9798 
F-statistic: 485.1 on 2 and 18 DF,  p-value: < 2.2e-16

"plot" produces a series of standard diagnostic plots, which may differ by model type and/or task. I would conjecture there's some that you always want for a task (e.g., cross-plot and residual plot for deterministic supervised regerssion; calibration curves for probabilistic classification), and some that you only want for a specific model class (e.g., learning curves for SGD based methods, heatmaps for tuning methods)

@fkiraly
Copy link
Collaborator Author

fkiraly commented Mar 6, 2019

Interesting question: where would "cross-plots out-of-sample" sit? Probably only available in the evaluation/validation phase, i.e., with the benchmark orchestrator.

@fkiraly
Copy link
Collaborator Author

fkiraly commented Mar 6, 2019

Actually, I notice you already made a suggestion for a name: fitted_params.
Also fine with me - though I wonder, should this include easy-to-compute stuff such as F-statistic and in-sample-R-squared as well? Or should that be left to (a separate interface point!) "inspect"? Thoughts?

@fkiraly
Copy link
Collaborator Author

fkiraly commented Mar 6, 2019

Also I realize, I've already said some of these things, albeit slightly differently, on Feb 4.
So greetings, @fkiraly from the past, I reserve the right to not fully agree with you.

@ablaom
Copy link
Member

ablaom commented Mar 7, 2019

To clarify the existing design, we have these methods (dispatched on machines, params also on models):

  • params to retrieve possibly nested hyperparameters
  • fitted_params to retrieve possibly nested learned parameters
  • report to retrieve most everything else (could be nested), including computationally expensive stuff

As laid out in the guide (see below): Whether or not a computationally expensive item is actually computed is controlled by an "instruction" hyperparameter of the model. If a default value is not overridden, the item is empty (but the key is still there), a clue to user that more is available. I prefer this to a separate method to avoid method name proliferation.

I think the above cover MLR's "print" method. But we could overload Base.show for named tuples to make more user-friendly. Don't like name "print". Print what? Just about every command prints something. (edit but you could say the same about "report" - aarrgh!. Maybe "extras" ??)

Not so keen on changing name of "report" as this is breaking.

@tlienart I think every item of report should be regenerated at every call to fit (or update) so that information there is synchronised with the hyperparamter values attached to the machine's current model. So immutability not an issue. So far, the params method is just a convenience method for the user; tuning is carried out using other methods.


From the guide:

  1. report is a (possibly empty) NamedTuple, for example,
    report=(deviance=..., dof_residual=..., stderror=..., vcov=...).
    Any training-related statistics, such as internal estimates of the
    generalization error, and feature rankings, should be returned in
    the report tuple. How, or if, these are generated should be
    controlled by hyperparameters (the fields of model). Fitted
    parameters, such as the coefficients of a linear model, do not go
    in the report as they will be extractable from fitresult (and
    accessible to MLJ through the fitted_params method, see below).

...

A fitted_params method may be optionally overloaded. It's purpose is
to provide MLJ accesss to a user-friendly representation of the
learned parameters of the model (as opposed to the
hyperparameters). They must be extractable from fitresult.

MLJBase.fitted_params(model::SomeSupervisedModelType, fitresult) -> friendly_fitresult::NamedTuple

For a linear model, for example, one might declare something like
friendly_fitresult=(coefs=[...], bias=...).

The fallback is to return (fitresult=fitresult,).

@fkiraly
Copy link
Collaborator Author

fkiraly commented Mar 7, 2019

Very sensible. Maybe, do you want to make plot a specified/uniform interface point as well, along the lines of your suggestion in #85 (and/or mine above)?

Small detail regarding your reference "mlr's print".
mlr doesn't have a too good interface for pretty-printing or plotting.

It is actually the R language itself (i.e., base R) which has "print" and "plot" as designated interface points.
Agreed with "print" being a strange choice of name though for pretty-printed reports - when I first saw this long long ago, I thought it might mean saving to a file, or calling an actual printer.

@fkiraly
Copy link
Collaborator Author

fkiraly commented Mar 7, 2019

"report" could be "inspect" the next time we write an MLJ, but let's not change a working system.

@ablaom
Copy link
Member

ablaom commented Mar 8, 2019

At the moment the Plots.jl package "plot" function just about the "standard" Julia interface point for plotting, although the future is not clear to me and others may have a better crystal ball.

Plots.jl is a front end for plotting and, at present, most of the backends are still wrapped C/Python/Java code. It is a notorious nuisance to load and execute first time. However, there is a "PlotsBase" (called PlotRecipes) which allows you to import the "plot" function you overload in your application, without loading Plots or a backend (until you need it).

@fkiraly
Copy link
Collaborator Author

fkiraly commented Mar 8, 2019

... we could factor out in a MLJplots module, thus solving the dependency issue?
I come starting to appreciate how Julia's dispatch philosophy makes this easy (though its package management functionality could be improved).

@ablaom
Copy link
Member

ablaom commented Mar 8, 2019

No, no. This is not necessary. We only need PlotsBase (lightweight) as a dependency. The user does need to manually load Plots.jl if they want to plot, but I don't think that's a big deal. The backends get lazy-loaded (ie, as needed).

@ablaom
Copy link
Member

ablaom commented May 26, 2019

@fkiraly and others. Returning to your original comment opening this thread, where should one-class classification fit into our scheme? Unsupervised, yes?

@fkiraly
Copy link
Collaborator Author

fkiraly commented May 27, 2019

In terms of taxonomy, I'd consider that something completely different, i.e., neither supervised nor unsupervised.

I'd consider one-class classifiers (including one-class kernel SVM) as an instance of outlier detectors, or anomaly detectors (if also on-line).

Even in the case where labelled outliers/artefacts/anomalies are provided in the training set, it's different from the (semi-)supervised task, since there is a designated "normal" class.

It's also different from unsupervised, since unsupervised methods have no interface point to feed back "this is an anomaly".

I.e., naturally, the one-class-SVM would have a task-specific fit/detect interface (or similar, I'm not too insistent on naming here).

One could also consider it sitting in the wider class of "annotator" tasks.

@datnamer
Copy link

Does this mean the type hierarchy is not granular enough. Maybe it should be traits

@fkiraly
Copy link
Collaborator Author

fkiraly commented May 27, 2019

@datnamer, that's an interesting question for @ablaom - where do we draw the distinction between type and trait?

If I recall an earlier discussion correctly, whenever we need to dispatch or inherit differently?

It's just a feeling, but I think anomaly detectors and (un)supervised learners should be different - you can use the latter to do the former, so if feels more like a wrapper/reduction rather than trait variation.

@ablaom
Copy link
Member

ablaom commented May 28, 2019

Some coarse distinctions are realised in a type hierarchy. From the docs:


The ultimate supertype of all models is MLJBase.Model, which
has two abstract subtypes:

abstract type Supervised <: Model end
abstract type Unsupervised <: Model end

Supervised models are further divided according to whether they are
able to furnish probabilistic predictions of the target (which they
will then do so by default) or directly predict "point" estimates, for each
new input pattern:

abstract type Probabilistic <: Supervised end
abstract type Deterministic <: Supervised end

All further distinctions are realised with traits some of which take values in the scitype hierarchy or in types derived from them. An example of such a trait is target_scitype_union.

So, I suppose we create a new abstract subtype of MLJ.Model, called AnomalyDetection? With a predict method that only predicts Bool ? Or only predicts objects of scitype Finite{2} (a CategoricalValue{Bool})? With the same traits delineating input scitype types that we have for Unsupervised models, yes?

Obviously this not a priority right now but it did recently come up.

@fkiraly
Copy link
Collaborator Author

fkiraly commented Jun 13, 2019

@ablaom regarding AnomalyDetection agreed, though I'd just call it detect rather than predict.

Regarding unsupervised learners: have we progressed about the distinction between (i) and (ii) at least, from the first post? For #161 especially, a "transformer" type (or sub-type? aspect?) as in (i) would be necessary.

Update: actually, I think we will be fine with (i), i.e., transformer style behaviour only for ManifoldLearning.jl in #161.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design discussion Discussing design issues
Projects
None yet
Development

No branches or pull requests

4 participants