Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pm.Model(coords=coords) should preserve coordinate type if it is a pandas Index #5994

Closed
aseyboldt opened this issue Jul 21, 2022 · 7 comments

Comments

@aseyboldt
Copy link
Member

aseyboldt commented Jul 21, 2022

We currently convert all coordinates to tuples, but this removes valuable information, if the coordinates already are a pandas Index.

I think we should only convert the input values if they are not already a pandas index, and if they are not we should wrap them in an pd.Index instead of a tuple. This is also immutable, but preserves the meaning of a coordinate.

import pymc as pm
import pandas as pd

coords = {
    "dim": pd.date_range(start="2022-01-01", end="2022-02-01")
}

with pm.Model(coords=coords) as model:
    pm.Normal("x")

type(model.coords["dim"])  # -> tuple
# We lost the valuable information from the original pandas `DatetimeIndex`.

Relevant PR that changed this: #5061

@michaelosthege
Copy link
Member

michaelosthege commented Jul 21, 2022

I'm not convinced. What valuable information exactly is lost? Can you give an example?

Also, with the current API (tuples) one can do pmodel.coords["city"].index("London") which is incredibly useful when building long-form models.
A pd.Index doesn't have .index(), so using pandas indices would be a major breaking change.

In addition, remember that the coords become coords in the InferenceData/xarray, so any "useful information" that a pd.Index could hold would get lost at that later point.

If you can give examples of useful information that should be captured for a coord, we should consider adding it here:

https://github.com/michaelosthege/mcbackend/v0.1.1/2706f8570e395bdcc006111911f6716d73ecc5ee/protobufs/meta.proto#L40-L46

@aseyboldt
Copy link
Member Author

Fair warning up front: I feel strongly about this, I hope this doesn't get too preachy, and if it does I'm sorry ;-)

Just to get the model.coords["something"].index("London") out of the way: you can do that with an index just as well, but safer, and with more functionality:

image

About the lost info:
A pandas (or xarray, for the most part it just reuses pandas) index is more than the set of values. A few examples:

  • The simplest example I guess would be the pd.RangeIndex, which works a lot like the individual values, but it is smart enough to not actually store the values. In get_loc it also just returns the correct value without any searching.
  • CategoricalIndex takes advantage of the fact that often the same string in an index appears more than once. So it keeps a list of all strings that could appear around (in the categorical dtype) and stores just integers for the index itself. This way it speeds up a lot of operations and saves memory. It also means that it can handle the case where a category contains values that do not appear in the index at all. Say we have four cities in our complete dataset, but the subset we are using in a model happens to only contain three of those. This information would be captured in CategoricalIndex (if used correctly at least). (By the way, pd.factorize is really useful when working with those, we get the Index and the indices into this index for usage in the model at the same time).
  • All the time related index stuff in pandas: DatetimeIndex, PeriodIndex, TimedeltaIndex. They all somehow use time information, but store it efficiently and also contain a lot of context information: Are we talking about a time point, a duration etc, do all days count, or only work days, timezone etc. This info is also used during plotting.

I guess the reason why I feel that strongly is that I usually put quite a bit of thought into what kind of Index I use for each dimension. If all that is then just turned into a tuple that doesn't know anything about what it represents, all that is useless.

@michaelosthege
Copy link
Member

Sorry for the delayed response, I was pretty busy this week...

No doubt that pd.Index is really useful for people who know the tricks, but I'm worried that it could create a some really nasty headaches in other places.

First of all, I'm investing a lot of thought into standardizing the metainformation around models/MCMCs in a way that is PPL and language-agnostic. This has many benefits including things like live-streaming traces, RVs with varying shapes and much more. I'd be happy to elaborate in a call or something.
But this also means that we should think datastructures independent of pandas or xarray.
I'm sure that's possible for pd.Index too, and IMO we should start there.
After all, if a pd.Index really holds so much more information than a tuple of its values, we'd have to store a bunch of additional fields.

Is the xarray.Index a feature subset/superset of the pd.Index?

The current implementation just does values = tuple(values) which should preserve Datetime/Period/Timedelta and so on.
Memory efficiency of coords at runtime is super irrelevant compared to all the rest.

So it keeps a list of all strings that could appear around (in the categorical dtype) and stores just integers for the index itself.
[...]
It also means that it can handle the case where a category contains values that do not appear in the index at all. Say we have four cities in our complete dataset, but the subset we are using in a model happens to only contain three of those.

What does this mean for the dims in a PyMC model? What if it's a MultiIndex?
Are xarray and ArviZ compatible with masked-out indexes?

And most people don't know about this feature, which makes it super confusing that pd.Index.get_level_values() can sometimes return a superset of set(index.values).

@aseyboldt
Copy link
Member Author

Ok, I guess I'm now starting to understand why you prefer to just have tuples, and I can see that in the context of mcbackend that might make a few things easier.
I don't think it is a good idea though to limit (or remove, this broke a bunch of my old models I was porting to pymc4) functionality because that is more convenient for an external project.

It seems to me we already have something like a standard for traces, namely arviz. It sure isn't perfect (eg I'd love to have hierarchical variables, for which I guess we'd first need something like pydata/xarray#4118), but overall I'm pretty happy with it. If you think you can improve on it and write a different one, that's great, but it'll take a lot of great features to make me use something based on protocol buffers when I can have xarray...

But this also means that we should think datastructures independent of pandas or xarray.

Again, I don't think I want to go from pandas and xarray to protocol buffers. Pandas and xarray sure have their flaws but for pretty much everything I might want to do except maybe streaming traces pandas and xarray seem like a way better fit. And I don't care about streaming traces a lot to be honest.

@aseyboldt
Copy link
Member Author

Hm, maybe that was a bit harsh...

I actually think mcbackend looks pretty nice, it's just that I'm not so sure it should be the default, and I don't like to get worse interoperability with pandas and xarray because of it.

If you think live preview of the trace while it is sampling is important, I think we can find a solution for that, that is much less involved. I'll play around with nutpie in that regard a little, I think the infrastructure there should make it pretty easy to experiment there.

If we want to send the trace to a remote machine, then using protobuf to send the updates might be a good solution, but I don't think this needs to be the default. For that usecase we could also always just pickle the index object if that makes it easier.

@michaelosthege
Copy link
Member

If you think live preview of the trace while it is sampling is important, I think we can find a solution for that, that is much less involved.

This is already possible even without changes to PyMC, but we can reduce complexity in pm.sampling a lot by replacing BaseTrace with Backend/Run/Chain.

McBackend is essentially a cleaned-up refactoring of pm.backends.base.BaseTrace:

mcbackend.Backend is the equivalent of the weakly typed thing that you can pass to pm.sample(trace=...)

mcbackend.Backend.init_run() is the equivalent of BaseTrace.setup() and mcbackend.Run.create_chain() the equivalent of calling _choose_backend(), but with a cleaned-up execution order:
In PyMC we call _init_trace() once per chain, which in the default case instantiates a NDArray, thereby compiling initial point functions in the BaseTrace.__init__ per chain 😬

In McBackend the order is:

  1. Create a Run with RunMeta information about the variables, shapes, dtypes, coords in the model (the RunMeta class is autogenerated from the protobuf) and thereafter
  2. Run.create_chain() implementations can simply handover the RunMeta information to the Chain object.

So for the initialization of the storage backend, McBackend takes the good things from pm.BaseTrace and cleans up the mess that we have in pm.sampling.*.

And then there's conversion to InferenceData: Starting from the standardized Backend/Run/Chain interface, and most importantly because metadata about model variables is passed to RunMeta before starting the MCMC this is pretty straightforward.
Meanwhile, PyMC has >1000 lines of code just for InferenceData conversion and corresponding tests.
And I'm not even counting the PPL-specific converters in the ArviZ codebase.

@michaelosthege
Copy link
Member

Oh and just to be clear, I think our intentions are well aligned:

  • Making working with dims, coords more capable & robust
  • Promoting good Bayesian workflow practices (with and without PyMC)
  • Making our shared implementations more capable such that people with special needs (slow models, memory-intensive draws, chain restarting...) write fewer custom solutions

@pymc-devs pymc-devs locked and limited conversation to collaborators Aug 30, 2022
@ricardoV94 ricardoV94 converted this issue into discussion #6081 Aug 30, 2022

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants