Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port InferenceData conversion code to pymc3 codebase #4489

Merged
merged 7 commits into from
Mar 26, 2021

Conversation

OriolAbril
Copy link
Member

@OriolAbril OriolAbril commented Feb 26, 2021

I have started porting the arviz.from_pymc3 converter to the pymc3 codebase.
This should first allow to simplify significantly the code (which has already happened quite a lot
and probably can be further simplified) and to allow having an ArviZ independent test suite.

closes arviz-devs/arviz#1278, closes arviz-devs/arviz#939 and closes arviz-devs/arviz#1470

I think we can also solve the issue in arviz-devs/arviz#1224 by vectorizing and being fast enough a progressbar makes no sense anymore.

There surely is a lot of room for improvement as I am not yet too familiar with all the v4 changes.

Depending on what your PR does, here are a few things you might want to address in the description:

Copy link
Member Author

@OriolAbril OriolAbril left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added some comments about places where I expect more changes to be needed. I have pushed to a branch on pymc3 instead of on my fork to allow anyone with permisson to work directly on this.

The gist of each of the converter methods is to generate a dict of str: np.ndarray with the arrays following the chain, draw, *shape convention, then call dict_to_dataset.

I don't know if it would make sense to try and use __numpy_func__ on Aesara (and I don't know enough about either Aesara nor __numpy_func__ to venture a guess), I am only mentioning, because if it did, the InferenceDatas could have Aesara TensorVariables as the underlying data structure and when working with xarray objects, one could use .values to get a numpy array or .data to get the actual data array used. Now ArviZ forces convertion to numpy, but we could change that.

import xarray as xr

from aesara.gof.graph import ancestors
from aesara.tensor.var import TensorVariable
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It used to check for PyMC3Variable, I had some doubts between using Variable or TensorVariable

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually, the difference comes down to whether or not the code requires shape information (e.g. the dtype and broadcastable fields that are exposed by TensorVariable via TensorVariable.type).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we explicitly use this info, but it should be taken into account when converting to numpy to make sure we preserve the dtypes of the original parameters and data. Thanks for the explanation, I will test to make sure dtypes are preserved.

pymc3/backends/arviz.py Show resolved Hide resolved
pymc3/backends/arviz.py Outdated Show resolved Hide resolved
Comment on lines 182 to 281
def log_likelihood_vals_point(self, point, var, log_like_fun):
"""Compute log likelihood for each observed point."""
log_like_val = utils.one_de(log_like_fun(point))
if var.missing_values:
mask = var.observations.mask
if np.ndim(mask) > np.ndim(log_like_val):
mask = np.any(mask, axis=-1)
log_like_val = np.where(mask, np.nan, log_like_val)
return log_like_val

def _extract_log_likelihood(self, trace):
"""Compute log likelihood of each observation."""
if self.trace is None:
return None
if self.model is None:
return None

if self.log_likelihood is True:
cached = [(var, var.logp_elemwise) for var in self.model.observed_RVs]
else:
cached = [
(var, var.logp_elemwise)
for var in self.model.observed_RVs
if var.name in self.log_likelihood
]
log_likelihood_dict = _DefaultTrace(len(trace.chains))
for var, log_like_fun in cached:
for chain in trace.chains:
log_like_chain = [
self.log_likelihood_vals_point(point, var, log_like_fun)
for point in trace.points([chain])
]
log_likelihood_dict.insert(var.name, np.stack(log_like_chain), chain)
return log_likelihood_dict.trace_dict
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am hoping that all this will go away and pointwise log likelihood can now be evaluated in a vectorized way, looping over variables to loop over chains to loop over draws should not be the way to go.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it should be possible to evaluate a likelihood in a vectorized fashion. You'll still have to compile an Aesara function with a fixed set of dimensions for the input, but, assuming that the number/order of the input dimensions are always the same (e.g. chains + samples + var dims), that need only be done once for a given model.


@requires(["trace", "predictions"])
@requires("model")
def constant_data_to_xarray(self):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may also need a significant refactor, but honestly I have no idea

Copy link
Contributor

@brandonwillard brandonwillard Feb 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of those attributes in self.model are still available, but they might correspond to different variables. For example, model.vars returns the log-likelihood-space TensorVariables, and the model.*_RVs methods return the sample-space TensorVariables.

Regardless, you can always get the corresponding log-likelihood-space TensorVariable from the sample-space variable using var.tag.value_var (I should probably rename it to var.tag.measure_space_var or var.tag.loglik_var, though).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these checks and selection of variables have only the goal of gathering all named variables that are not sampled parameters nor observed/observations. Up until now that basically meant pm.Data objects that were not passed as observed, things like the x in a linear regression, the county and floor indexes in the radon example... Maybe we can think of a better way to identify these variables

Copy link
Contributor

@brandonwillard brandonwillard Mar 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely; we can very easily obtain variables like that using something as simple as aesara.graph.basic.vars_between, where the ins are the sampled/unobserved variables and the outs are the observed variables.

@OriolAbril
Copy link
Member Author

Thanks @Spaak ! On a more general note, does ArviZ preserve the chain indexes when they don't start at 0? I fear the current code may work when different chain idxs are used but the resulting xarray coordinates will still be 0, 1, 2...

Which also poses a question about how we want to handle this going further. Do we want to keep chain_idx handling within sample or leave it to the conversion to inferencedata? Not sure this can be solved satisfactorily before updating the backend from multitrace to something else.

Comment on lines 203 to 271
(var, var.logp_elemwise)
for var in self.model.observed_RVs
if var.name in self.log_likelihood
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will need to be updated, because the var.log* methods are gone now. Instead, you can use something like logpt(var).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want all of log_likelihood_vals_point and _extract_log_likelihood to go and instead evaluate the likelihood in a vectorized fashion, looping over observed variables at most. I am reading the developer guide and trying to see how that would be done.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have given this a try now, and I am not understanding what is happening nor where to find the relevant docs between pymc3 docstrings and aesara. Say I have this model:

with pm.Model() as model:
    m = pm.Normal("m", 0.0, 1.0)
    a = pm.Normal("a", mu=m, sigma=1, observed=0.0)

I then undestood I have to use logpt on a to get its pointwise log likelihood: log_lik = pm.distributions.logpt(a) and the observed values should be pulled from a. I then try to compile a function to calculate log_lik (for now with manual scalar input for m, but ideally with a numpy array containing the whole trace). However, I can't seem to do that:

aesara.function([], log_lik)

errors out with:

...
    raise MissingInputError(error_msg, variable=var)
aesara.graph.fg.MissingInputError: Input 0 (m) of the graph (indices start from 0),
used to compute InplaceDimShuffle{x}(m), was not provided and not given a value. Use
 the Aesara flag exception_verbosity='high', for more information on this error.

Backtrace when that variable is created:

  File "<stdin>", line 2, in <module>
  File "/home/oriol/miniconda3/envs/arviz/lib/python3.8/site-packages/pymc3/distribu
tions/distribution.py", line 96, in __new__
    rv_out = cls.dist(*args, rng=rng, **kwargs)
  File "/home/oriol/miniconda3/envs/arviz/lib/python3.8/site-packages/pymc3/distribu
tions/continuous.py", line 477, in dist
    return super().dist([mu, sigma], **kwargs)
  File "/home/oriol/miniconda3/envs/arviz/lib/python3.8/site-packages/pymc3/distribu
tions/distribution.py", line 105, in dist
    rv_var = cls.rv_op(*dist_params, **kwargs)

and

aesara.function([m], log_lik)

errors with:

...
    raise UnusedInputError(msg % (inputs.index(i), i.variable, err_msg))
aesara.compile.function.types.UnusedInputError: aesara.function was asked to create
a function computing outputs given certain inputs, but the provided input variable a
t index 0 is not part of the computational graph needed to compute the outputs: m.
To make this error into a warning, you can pass the parameter on_unused_input='warn'
 to aesara.function. To disable it completely, use on_unused_input='ignore'.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you need to use m.tag.value_var.

Copy link
Contributor

@brandonwillard brandonwillard Mar 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's rather confusing, because the display names for these two variables (i.e. m and m.tag.value_var) are exactly the same. I thought about changing the name of the latter to something like "{rv_name}_value" or "{rv_name}_lik", but I couldn't decide.

Anyway, it's very important to understand that there are now two distinct, but complementary "variables" for each random variable in a model: TensorVariables that represent sample-able random variables, and TensorVariables that represent concrete values of those random variables. This idea is directly related to the observed parameter, because the values passed as observed are essentially a concrete instance of the latter type of variable.

More specifically, when one (sloppily) writes P(Y = y) in "math"-like notation, Y is the random variable and y is an "instance" of Y, but they are two distinct things with different properties/constraints. The latter term (i.e. the "value" variable), y, is the input argument used in density functions, e.g. f(y) = exp(- y**2 * ...) * ...; this is why the "value" variable is the only one used in the log-likelihood graphs.

Really, what I'm pointing out here is that random variables are actually (measurable) functions, both mathematically and, now, in v4. Technically, in v4, random variables are Aesara graphs, but, since Aesara graphs can represent—and be converted into—functions, it's the same idea.

N.B.: These "value" variables are conveniently stored within their corresponding random variable's tag (i.e. a Variable's arbitrary "storage" space) under the name value_var.

Copy link
Contributor

@brandonwillard brandonwillard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you might've accidentally merge-committed; you'll definitely need to rebase instead.

for obs in self.model.observed_RVs:
if hasattr(obs.tag, "observations"):
aux_obs = obs.tag.observations
observations[obs.name] = aux_obs.data if hasattr(aux_obs, "data") else aux_obs
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is data the right field? Or should it be value? Or try both?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That will work for Constant objects, but not for shared variables, which will require aux_obs.get_value().

There's also the case where obs.tag.observations is a partially observed variable (i.e. has missing data). The result will be an AdvancedIncSubtensor1 in that case.

Here's a helper function for all those situations:

import numpy as np

from aesara.graph.basic import Constant
from aesara.tensor.sharedvar import SharedVariable
from aesara.tensor.subtensor import AdvancedIncSubtensor1


def extract_data(x):
    if isinstance(x, Constant):
        return x.data
    if isinstance(x, SharedVariable):
        return x.get_value()
    if x.owner and isinstance(x.owner.op, AdvancedIncSubtensor1):
        array_data = extract_data(x.owner.inputs[0])
        mask_idx = extract_data(x.owner.inputs[2])
        mask = np.zeros_like(array_data)
        mask[mask_idx] = 1
        return np.ma.MaskedArray(array_data, mask)

    raise TypeError(f"Data cannot be extracted from {x}")

def log_likelihood_to_xarray(self):
"""Extract log likelihood and log_p data from PyMC3 trace."""
# TODO: add pointwise log likelihood extraction to the converter
return None
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have disabled the log likelihood conversion for now so we can already start working on tests that depend on this. The ones I am adding from ArviZ do test specifically for log likelihood conversion but I don't think the tests on pymc3 that use sample will. I hope this will already work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good!

@brandonwillard
Copy link
Contributor

I'm about to push a rebased version of this branch. Once that's done, you can fetch and pull this branch. If you don't have any unpushed local changes, you can git rebase --skip any conflicts to get a clean local copy of this rebased branch.

@OriolAbril
Copy link
Member Author

Thanks! I have no unpushed local changes. I also realized that the code I'm using runs only on ArviZ master, so tests won't pass until we make the next release. We don't have a very clear timeline, but it should happen soon

@brandonwillard
Copy link
Contributor

I also realized that the code I'm using runs only on ArviZ master, so tests won't pass until we make the next release.

I'm updating all that now. I'll push those changes shortly.

@brandonwillard brandonwillard force-pushed the to_inference_data branch 2 times, most recently from 816a3fa to 9622401 Compare March 26, 2021 01:10
library=pymc3,
coords=self.coords,
dims=self.dims,
# default_dims=[],
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't work yet I think, dict_to_dataset will try to add the chain and draw dimensions and fail. The old version (with old meaning the current release) had all the code in dict_to_dataset repeated here to not assume chain and draw were present.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dims=self.dims won't work?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, commenting default_dims=[] because it should revert to the default behaviour of assuming chain and draw are present and they are not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_idata_conversion.py passes locally with that commented out, but I had to comment out dims as well.

Copy link
Contributor

@brandonwillard brandonwillard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like I've gotten most/all of the major things convert, so I'll stop making edits for now.

Otherwise, @OriolAbril, I don't know what the deal is with dims, so you'll have to look at that one.

@OriolAbril
Copy link
Member Author

I don't know what the deal is with dims, so you'll have to look at that one.

I'll try to make a workaround so we can merge asap. To properly fix that we need an ArviZ release to break all the entanglement.

dict_to_dataset was an internal function that I extended and added to the docs to allow other libraries to easily write their own converters, now it's public and documented so that pymc3 doesn't need to do any xarray black magic to write the converter, but it has not been released yet. The old dict_to_dataset can be imported but it is very limited and there is no way to make it work for observed or constant data groups.

brandonwillard
brandonwillard previously approved these changes Mar 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants