Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjusting the iarange dim allocation scheme #1115

Closed
neerajprad opened this issue May 2, 2018 · 10 comments
Closed

Adjusting the iarange dim allocation scheme #1115

neerajprad opened this issue May 2, 2018 · 10 comments

Comments

@neerajprad
Copy link
Member

neerajprad commented May 2, 2018

Currently, our iarange dimension allocator assigns dimensions to iarange starting from the right - i.e. the outermost iarange starts at -1, the next one at -2 and so on. Consider the following code:

with pyro.iarange("outer", 5):
    outer_sample = pyro.sample("outer_sample", dist.Categorical(torch.ones(5, 4) / 4))
    assert outer_sample.shape == (5,)
    with pyro.iarange("inner", 3):
        inner_sample = pyro.sample("inner_sample", dist.Categorical(torch.ones(3, 5, 4) / 4))
        # Note how the shape is inverted from the order in which the `iarange`s appear.
        assert inner_sample.shape == (3, 5)  

After discussing with @fritzo, @eb8680 and @martinjankowiak, I would like to propose the following change to the iarange dimension allocation: when the user specifies max_iarange_nesting, we allocate dims starting from -max_iarange_nesting and moving to the right, i.e. iarange dims follow their order of appearance in the model and guide. Notice how the code sample above would look like under this scheme:

# set_max_iarange_nesting = 2
with pyro.iarange("outer", 5):
    outer_sample = pyro.sample("outer_sample", 
# Note the additional 1 in the batch_shape below, which corresponds to the additional 
# batch dim specified by `max_iarange_nesting`. 
# With implicit broadcasting, this won't be neeeded.
    dist.Categorical(torch.ones(5, 1, 4) / 4))  
    assert outer_sample.shape == (5, 1)
    with pyro.iarange("inner", 3):
        inner_sample = pyro.sample("inner_sample", dist.Categorical(torch.ones(5, 3, 4) / 4))
        assert inner_sample.shape == (5, 3)

What we would finally like to have with implicit broadcasting (no need to adjust the batch sizes manually):

# set_max_iarange_nesting = 2
with pyro.iarange("outer", 5):
    outer_sample = pyro.sample("outer_sample", dist.Categorical(torch.ones(4) / 4))  
    assert outer_sample.shape == (5, 1)
    with pyro.iarange("inner", 3):
        inner_sample = pyro.sample("inner_sample", dist.Categorical(torch.ones(4) / 4))
        assert inner_sample.shape == (5, 3)

What does this give us?

  • Intuitive shape scoping. e.g. in the above case, assume that the outer dim was for parallelizing over num_particles, this would give samples where the outermost iarange dimension is also reflected in the outermost dimension of the sample shape and has shape[0]=num_particles.

  • More importantly, I think this will make it much easier to implement implicit broadcasting (using the iarange size param to automatically broadcast sample sites rather than relying on the user to .expand_by) as discussed in Parallelize ELBO computation over num_particles #791. For instance, in the second example, all this involves is looking at the cond_indep_stack and doing .expand_by (iarange_sizes,) + (1,) * max_iarange_nesting - len(cond_stack) (only for the topmost nodes in the graph). Doing this with our current dim allocation scheme would involve replicating this logic by hand through the dim arg to iarange to make the batch shapes align correctly, which kind of beats the point of doing implicit broadcasting.

  • Note that this will be more general than the particular issue of parallelizing over num_particles. e.g. we could have the following structure where the outermost dimension corresponds to running parallel chains (for HMC) or parallel evolution candidates, and the inner one corresponds to the Monte Carlo estimator for ELBO.

    with iarange("num_chains", 4):
        with iarange("num_particles", 100):
            ...

Cons:

  • The biggest one is just the users having to unsqueeze sample sites to accommodate max_iarange_nesting (as also mentioned in the design doc). (EDIT: This will not be an issue in practice, as even a manual .expand_by without .unsqueeze will be allowed with implicit broadcasting.) I believe that this will be completely obviated by implicit broadcasting, and it is not as much of a cognitive burden when the user anyways is aware of max_iarange_nesting.

Let me know your thoughts, and if I may have overlooked something.

Related: #791

@martinjankowiak
Copy link
Collaborator

questions:

  • does implicit broadcasting automatically play nicely with any deterministic computations in between sample statements (assuming said computations broadcast nicely) or would the user need additional (un)squeezes?
  • how would this support/complicate implementing something like Implement IWELBO #671, where we would like to automagically add two sample dimensions that are treated differently when constructing the ELBO

@neerajprad
Copy link
Member Author

does implicit broadcasting automatically play nicely with any deterministic computations in between sample statements (assuming said computations broadcast nicely) or would the user need additional (un)squeezes?

Since the batch shape is completely determined by iarange and hence by max_iarange_nesting, I cannot think of a situation where the user would have to manually .unsqueeze or .expand_by.

how would this support/complicate implementing something like #671?

I don't have full context into this, but you can always wrap the model internals inside an iarange of size 2, and treat the sample sites appropriately when you are computing the ELBO; they can be accessed as sample[0] and sample[1], which seems simpler? Maybe @jpchen can comment on this.

@fritzo
Copy link
Member

fritzo commented May 2, 2018

How does this affect model composability? For example suppose I compose a model

def temporal_model(times):
   with pyro.iarange("time"):
        ...

def spatial_model(pixels):
    with pyro.iarange("x", 320):
        with pyro.iarange("y", 200):
            ...

def model(videos):
    with pyro.iarange("batch", len(videos), dim=-3):
        temporal_noise = temporal_model(
            torch.arange(len(videos[0])).expand(len(videos), -1)
        spatial_noise = spatial_model(videos)
        ...

Here temporal_model and spatial_model are "leaf" models that should act on the rightmost dimensions (the rightmost one and the rightmost two dimensions, resp). That way I can use them independently without any batching.

Now I've built a more complex model using these components and batching. Since I know that there are at most two iaranges in my submodels I can set the iarange("batch", ..., dim=-3) to avoid those two dims. Note that the component models do not need to specify dim. The assumption in Pyro 0.2 is that model components will be more stable dimensions than model aggregates, and hence we've designed so that model aggregates incur more bookkeeping overhead (needing to specify max headroom via dim=-3 here).

Would the proposed change preserve this assumption that model components are more stable than model aggregates, and that most bookkeeping should be at the aggregate level? Are any extra utilities or patterns needed to make it easy to compose models?

@neerajprad
Copy link
Member Author

neerajprad commented May 3, 2018

Since this is only exercised at the time of inference, I don't think it should need any change to the component models. With max_iarange_nesting=3, spatial_noise will have shape (len(videos), 320, 200).

Also assume that the user decides to not rely on implicit broadcasting and manually does an expand by in spatial_model as .expand_by([320, 200]). Then when we are inside the aggregate model, we'll check that the fn.batch_shape of the sample returned by the enclosing model is [320, 200], and simply tack on an additional "batch" dim on the left and make it [len(videos), 320, 200]. If max_iarange_nesting=4, then the sample would have shape [len(videos), 320, 200, 1]

@neerajprad
Copy link
Member Author

Specifically for fritz's example, it's not apparent to me how to get the batch dimensions for broadcasting without it being explicitly specified (-3).

We currently need to specify this but we can just use max_iarange_nesting (when available) and order them from the left. We shouldn't need any more information than that.

@fritzo
Copy link
Member

fritzo commented May 3, 2018

I don't think it should need any change to the component models

It is currently nice that the iarange dims inside components do not depend on max_iarange_nesting. This allows us to reuse those components in different places without clumsy .reshape()ing based on inspecting the iarange.dim attribute. It is currently gross that iarange dims inside components do depend on iaranges in wrapping code. Ideally I could write model components that intelligently allocated fixed rightmost dims without needing to specify those dims. To achieve this under from-the-left allocation, we'd at least need an extra iarange to pad the temporal_model:

def model(videos):
    with pyro.iarange("batch", len(videos)):  # <--- automatically allocated
        with pyro.iarange("padding", 1):      # <--- manually padded
            temporal_noise = temporal_model(
                torch.arange(len(videos[0])).expand(len(videos), -1)
        spatial_noise = spatial_model(videos)
        ...

Aside: This discussion seems better suited for a design doc tha a github issue.

@neerajprad
Copy link
Member Author

This discussion seems better suited for a design doc tha a github issue.

Good idea, I will move this to a design doc, but I want to keep this issue open for some time because I am finding this useful to just understand the issues raised, and gather some problematic examples that would need to be addressed.

Will sync offline on the issue of composability that you mentioned above.

@eb8680
Copy link
Member

eb8680 commented May 3, 2018

Agreed that this issue is metastasizing into a design doc.

At any rate, I'm pretty sure I'm still misunderstanding everything, but here are a couple thoughts I had after a discussion with @neerajprad:

First, in this proposal iarange is doing double duty as an annotation and a modifier. I'd prefer to have implicit broadcasting happen via a separate annotation, so that the default semantics of Pyro remains vanilla Python/PyTorch.

Second, here's a sketch of an alternative allocation strategy that favors composition and preserves the current indexing scheme in user-land (again, modulo ellipses to the left of every user-land indexing operation):

Consider a slightly-crazy scenario in which we'd want to compose a bunch of different effects that exploit broadcasting in a model that itself has iaranges nested in components. For example, suppose we wanted to use an ES optimizer to optimize a multi-sample importance-weighted elbo with some discrete variables enumerated in @fritzo's model above.

We could give each effect its own identical, independent sandbox: allocate blocks of dimensions to each broadcasting effect, starting with user-specified iaranges in the right-most block and adding new blocks to the left of existing ones. iarange's job would simply be to unsqueeze/expand_by a site to have block_max_iarange_nesting - len(site.batch_shape) extra singleton dimensions, like in @neerajprad's original proposal above but without the non-singleton outermost dimensions.

The order of dimension allocation can then be forward or reverse (or manual, via iarange(..., dim=...)) independently in each block; for example, we might want to preserve the current ordering in the user block, but allocate from left to right in the leftmost block (the ES block in this scenario) to facilitate reduction or ellipsis-based slicing along the outermost dimension.

We'd need an extra construction to delimit blocks of effect dimensions and ensure that iarange blocks are applied in the proper order, just like we currently group all enumeration dimensions to the left of all iarange dimensions. One potential solution is to use something like ContinuationMessenger (for lack of a better term) from #950 and the factor graph design doc:

  • each instance of this Messenger should have its own internal _DimAllocator and broadcasting effect (realized as, say, a callable or method applied at each site, like expand_by-ing only top-level sites or enumerating the support of each discrete site),
  • each should capture and be responsible for entering/exiting/applying any iaranges used between it and the next ContinuationMessenger entering the handler stack,
  • and each should apply an offset to any internal iarange dims that is its block depth * max_iarange_nesting.

This explicit delimiting construction should allow each broadcasting effect in the hypothetical scenario to be as handled cleanly and separately as in @neerajprad's original comment, as if there were no others, and by not delimiting in model and using the reverse allocation order, we can preserve the existing broadcasting/iarange semantics of model.

One potential problem I see with this proposal is incorporating dependent enumeration (#915) which in principle could require a dimension block of unbounded length. Two hacky solutions are to allocate a large but fixed number of enumeration dimensions (say, 1000 instead of the usual 32 or so), or require enumeration to be below all other ContinuationMessengers in the global handler stack in order to maintain the status quo of always having all enumeration dimensions be left of all others. There may be a similar issue with non-tree-structured iarange configurations. A less hacky solution might involve compressing the block into a single dimension, storing and updating a block shape internally, applying a reshape using the internal shape, applying the enumeration effect, and recompressing the block at each sample site.

@neerajprad
Copy link
Member Author

After some discussions with @fritzo, I have come to the conclusion that the best way forward is to implement automatic distribution broadcasting (#1119), so that users do not need to manually expand distribution instances inside iarange. For many cases, we will need to manually allocate the dim for the iarange via the dim arg (e.g. for chaining or num_particles or data batching), and I am convinced that there is no easy way to automate that and also provide the flexibility to users to locate their iarange dim manually. The dim allocation scheme plays a small role in that on-the-left allocation aligns well with the external data that we might see in the observe statements, but we could alternatively permute the data to align with on-the-right allocation as well.

I believe that #1119 will solve for the bulk of the inconvenience associated with using nested iaranges, and introducing this additional complexity of dim ordering doesn't seem worth it at this point. There are some great points here, which we could revisit in a design doc later if we are not happy with the state of iarange broadcasting later.

@fritzo
Copy link
Member

fritzo commented May 4, 2018

Just to clarify: #1119 is completely manual. It addresses only the lowest-level issue. I think a good plan is to

  1. start with manual iarange(..., dim=-n) annotations everywhere;
  2. implement very limited auto-broadcasting to increase the sizes of individual batch_shape dimensions via TorchDistribution.expand() Implement TorchDistribution.expand() #1119;
  3. implement parallel num_particles Parallelize ELBO computation over num_particles #791
  4. reopen the interesting discussion/design doc about automatic dim allocation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants