Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't mix offset/multiplier with upper/lower bounds #659

Open
andrjohns opened this issue Jul 31, 2020 · 24 comments
Open

Can't mix offset/multiplier with upper/lower bounds #659

andrjohns opened this issue Jul 31, 2020 · 24 comments
Labels
feature New feature or request

Comments

@andrjohns
Copy link
Contributor

The following code:

data { 
  int<lower=0> N; 
  int<lower=0,upper=1> y[N];
} 
parameters {
  real<multiplier=1,lower=0,upper=1> theta;
} 
model {
  theta ~ beta(1,1);  // uniform prior on interval 0,1
  y ~ bernoulli(theta);
}

Gives the error:

   -------------------------------------------------
     4:  } 
     5:  parameters {
     6:    real<multiplier=1,offset=0,lower=0,upper=1> theta;
                                     ^
     7:  } 
     8:  model {
   -------------------------------------------------

Expected '>' after multiplier expression.

Should this be legal syntax?

@nhuurre
Copy link
Collaborator

nhuurre commented Jul 31, 2020 via email

@bob-carpenter
Copy link
Contributor

We definitely need to figure out how to compose transformations (what TensorFlow Probability calls "bijectors"). I'd prefer to have the transformations separated with:

real<multiplier = m, offset = o><lower = l, upper = u> theta;

being read left to right as:

  • real: start with unconstrained real
  • real<multiplier = m, offset = o>: start with unconstrained, multiply by m then add o
  • real<multipler = m, offset = o><lower = l, upper = u>: start with unconstrained, multiply by m then add o, then finally apply inverse logit transform designated by <lower = l, upper = u>.

That way, the subsequences of the type expression denote reasonable types, and each additional transform gets added to the end.

@nhuurre --- any idea how hard this would be to do? It'd be really useful for things like non-centered lognormal parameterizations.

IIRC correlation matrix unconstraining functions are not autodiffable so corr_matrix<offset=x> might not work.

It's the constraining functions that need to be autodiffable, and they are here (it's an inverse logit plus stick breaking process to get each unit-length row with positive element on the diagonal; the details are in the ference manual). So we could compose an offset and correlation matrix constraint, but it'd only make sense doing the offset/multiplier first as otherwise the result won't be a correlation matrix. But I don't see an application for it.

We could stack offset/multiplier for correlation matrices, but I don't think it'd gain us anything. I don't think we could easily stack lower/upper bounds.

@bob-carpenter
Copy link
Contributor

Sorry, forgot to answer @andrjohns question. No, it should not be legal syntax now. But we want something like it in the future.

@avehtari
Copy link
Contributor

I just had a model for which optimizing failed as inits are far from sensible, but making the posterior closer to unit scale helps. I have these different length scale and magnitude scale paraneters

  real<lower=0> rho;   // lengthscale of f1
  real<lower=0> alpha; // scale/magnitude of f1
  real<lower=0> sigma; // residual scale

and then the priors were

  rho ~ normal(0, 1000);
  alpha ~ normal(0, 500);
  sigma ~ normal(0, 500);

For me the most logical thing was to try

  real<lower=0,multiplier=1000> rho;   // lengthscale of f1
  real<lower=0,multiplier=500> alpha; // scale/magnitude of f1
  real<lower=0,multiplier=500> sigma; // residual scale

which btw gives different error than when multiplier is before lower. @bob-carpenter suggested

real<multipler = m, offset = o><lower = l, upper = u>

to mean " start with unconstrained, multiply by m then add o, then finally apply inverse logit transform designated by", but can you confirm whether then

real<lower = l, upper = u><multipler = m, offset = o>

apply inverse logit transform and multiply constrained by m then add o?

This combined feature would be useful for me as now to get optimizing to work for my model I need to scale the data to have unit scale to get the parameters also to unit scale.

@bob-carpenter
Copy link
Contributor

bob-carpenter commented Dec 28, 2020

The natural way to set it up is from inside out, thinking of each < ... > as applying an inverse transform.

real<lower = l, upper = u><multipler = m, offset = o> y;

So we start with a type real and unconstrained value x. Then we apply <lower = l, upper = u> to get a variable u of type real<lower = l, upper = u> with value

w = l + (u - l) * inv_logit(x)

Then you apply the final transform <multiplier = m, offset = o> to get

y = o + m * w

The reuslting y will satisfy the constraint <lower = o + m * l, upper = o + m * u>, but there's nothing to do at compile time about that, so we don't have to worry about the algebra.

The other way around,

real<offset = o, multiplier = m><lower = l, upper = u> y;

will guarantee that y satisifes the constraint lower = l, upper = u.

@avehtari
Copy link
Contributor

Could you use something else than u for variable, so that u is not on both left hand and right hand side?

u = l + (u - l) * inv_logit(x)

@bob-carpenter
Copy link
Contributor

Could you use something else than u for variable,

Oops. I edited the original comment.

@betanalpha
Copy link

I was reading through #947 but I think the right place to comment is here. Apologies for the late addition. I'm hesitant about the claimed utility of all of this machinery given how few transformations actually compose and the relatively inaccessibility of those compositions to end users.

Right now all of our transformations map from a constrained space (a one-dimensional subspace of R in the scalar case or a lower-dimensional subspace of R^{N} in the multivariate case) to the same unconstrained space (all of R in the scalar case or all of R^{D} where D is the dimension of the subspace). This means the the only valid composition would be maps from constrained spaces to constrained spaces, of which we currently do not have any, and maps from unconstrained spaces to unconstrained spaces of which we only have the R -> R multiply/offset.

This means that there's just one valid set of compositions (1D constrained -> 1D unconstrained then 1D unconstrained -> 1D unconstrained) and, critically, the behavior of that second map depends on how the first map is implemented. Right now users only see the bounds that define the constrained spaces but the implementation of the constrained -> unconstrained maps is completely abstracted which means that there's no guarantees in the language for what any prepended unconstrained -> unconstrained maps would actually do. In particular if the constrained -> unconstrained maps was every changed, a discussion which has happened in the past, then the effective action of any prepended maps would change!

I think this also highlights a problem with how the multiplier/offset was implemented in the language. Before multiplier/offset all of the <> annotations to types simply defined a constraint, and hence a constrained space of valid values, but they did not define how the algorithms ensures that the constraint was satisfied. Again that was all abstracted behind the user-facing language. In other words the <> annotations did not define a map.

The multiplier/offset <> annotation, however, behaves fundamentally differently. Instead of defining a space it defines an explicit user-facing transformation. Because these two annotations technically define different concepts they don't actually compose naturally from the user perspective. Again one has to jump from constraint to constrained -> unconstrained map, which is abstracted from the user, before one can consider compositions with transformation defined by the multiplier/offset <> annotation.

I doubt anyone will agree with me at this point but personally I think this overloaded use of the <> annotation is too confusing to be useful as demonstrated by the creation of this issue in the first place. For example a more consistent usage would be to drop constraints entirely and instead require users to specify explicit constrained -> unconstrained maps. Then constrained -> unconstrained -> unconstrained maps could be defined without the need for a full compositional backend. To be clear this is just an example, and not one that I think would be all that useful given the cognitive burden of understanding constrained -> unconstrained maps instead of just constraints would place on users. I personally would prefer to introduce additional transformations between the space defined by the parameters block and the space considered by the algorithms in a different way, for example a new block or something.

@bob-carpenter
Copy link
Contributor

bob-carpenter commented Sep 21, 2021

@betanalpha I think the main issue for the language is whether to go with something that looks like composable transforms such as

real<offset=m, multiplier=s><lower = 0> a;

Or whether to just roll together the ones that work:

real<offset=m, multiplier=s, lower = 0> a;

The only other constrained types that are naturally scalable are ordered and positive_ordered.

I think Oryx, the probabilistic programming library that sits on top of JAX, does just what you're proposing in terms of specifying transforms more explicitly.

There's also a related issue of whether we let users define their own transforms somehow, but it's messy given the need for the transform, inverse transform, and inverse transform plus log absolute Jacobian determinant. We'd ideally want to validate their inverse and their derivatives as a general determinant calculation wouldn't be efficient.

@betanalpha
Copy link

My point is that <lower = 0> doesn't explicitly define a transformation! There are an infinite number of transformations from the positive real line to the full real line, and in order for <lower=0> to even implicitly define a transformation would require a global convention for which transformation is chosen. Right now a transformation is chosen implicitly behind the scenes so that users don't need to be aware of that convention, but if a declaration like real<offset=m, multiplier=s><lower = 0> a; is allowed then that abstraction is broken and users would have to educate themselves on that convention. Implicit implementation assumptions that have user-facing effects is asking for confusion.

The problem holds for both real<offset=m, multiplier=s><lower = 0> a; or real<offset=m, multiplier=s, lower = 0> a;`.

If one moved to explicit transformations then one could right something like real<log><offset=m><multiply=s> a where at least each annotation is of the same object. Again I'm not advocating for this; I think it's a completely different feature than the original constraint annotations and if added should have its own syntax. This is one of the reasons why I complained about the original addition of offset/multiply, although I may not have been as clear about it.

@WardBrian
Copy link
Member

WardBrian commented Sep 21, 2021

I think @betanalpha brings up an interesting question that I had asked myself while working on the composable transforms, which is "why are offset and multiplier considered transforms and not <something else>". This makes the "composition" a lot easier if there is simply some other "thing", (which is not a transform), and there are transforms. If instead of "every variable can have a transform" we had some concept of "every variable can have a transform and a Scale" (for lack of imagination) it covers probably 90% of use cases with 10% the complexity, because the transform can stay the basic non-list type it is now.

I have seen a few people request other compositions, like a postiive simplex, but I suspect what they were hoping for would be the intersection of those constraints (the variable is both a simplex and has all positive elements) rather than the composition (a vector is stick-broken and then exp'd).

Edit: This doesn't fix the other point you bring up about needing to understand which transform is occurring, or the syntax concern, both of which I agree are confusing in this context

@bob-carpenter
Copy link
Contributor

@WardBrian: Simplexes have all positive elements by definition. I think what people were asking for is an ordered simplex. But we don't have any way to declare an ordered simplex because both are types, not something we currently write in angle brackets.

@betanalpha: I'm less worried about implementation dependency than you. It happens in every programming language that you have to break the encapsulation to understand how something's implemented in order to write efficient code, such as understanding that Stan uses column-major order for matrices or that we use autodiff which can fail in places where algebra doesn't fail or that non-centered parameterizations are more efficient because of the sampler we happen to be using, and so on. More constructively, how would you propose someone code a non-centered, positive-constrained variable with prior alpha ~ lognormal(mu, sigma)? That's what I'd like to make easier for users.

@WardBrian
Copy link
Member

WardBrian commented Sep 21, 2021

You're right, I meant ordered simplex. And while those are treated as types in the syntax, in the implementation both of them just declare variables with the underlying type and a transformation, so in theory one could imagine syntax which (combined with #947) would declare a variable as both ordered and a simplex.

This was brought up when I was first discussing the composable transforms idea, and it can be very tricky to explain why the output is not actually something that is both ordered and a simplex. It's not clear to me how useful the thing the result ends up being actually is. Making the offset and multiplier options orthogonal to the transformation of a variable allows the compositions we'd want, but would not allow things like this.

@betanalpha
Copy link

betanalpha commented Sep 21, 2021 via email

@bob-carpenter
Copy link
Contributor

Thanks, @betanalpha. I get the conceptual difference you're trying to make between transforms and constraints. I'm more of a lumper, so I see a transform like the affine transform as a constraint that doesn't impose a constraint, i.e., it maps from reals to reals.

For non-centering, I current manually do something like the following.

parameters {
  vector[N] alpha_std;
  real mu;
  real<lower = 0> sigma;
}
transformed parameters {
  vector<lower = 0>[N] alpha = exp(mu + sigma * alpha_std);
}
model {
  alpha_std ~ normal(0, 1);
  mu ~ ...;
  sigma ~ ...;
}

What I'd like to do is be able to write something like the following instead.

parameters {
  vector<offset = mu, multiplier = sigma><lower = 0>[N] alpha;
}
model {
  alpha ~ lognormal(mu, sigma);
}

With the understanding that the transform is log followed by z-score and the inverse transform is affine followed by exp.

@nhuurre
Copy link
Collaborator

nhuurre commented Sep 25, 2021

Right now a transformation is chosen implicitly behind the scenes so that users don't need to be aware of that convention, but if a declaration like real<offset=m, multiplier=s><lower = 0> a; is allowed then that abstraction is broken and users would have to educate themselves on that convention.

For example many use multiplier/offset to try to hack better initial conditions, but when that transformation is applied after an unconstraining transformation the user would have to know exactly how the unconstraining is implemented in order to set the right multiplier/offset.

This is an important point. Currently the (un)constraining functions are not exposed in the language (although maybe they should be?)

In my opinion the only interpretable abstraction is to automatically unconstrain the user-supplied offset value.

real<lower=low, offset=ofs> x;

would then transpile to

var x = lb_constrain(x_raw + lb_free(ofs, low), low);

This guarantees that the offset and the parameter are comparable despite the implicit transforms.

It's not at all obvious how to handle the multiplier, though.

@bob-carpenter
Copy link
Contributor

@nhuure: Neither the constraining or unconstraining functions are exposed. Nevertheless, it's like any compiled language in that using it at maximal efficiency requires understanding what the compiler does under the hood. For example, C++ compilers make all kinds of decisions that aren't part of the spec and part of programming efficient C++ is figuring out what those are.

The offset can be anything, so I'm not sure what you mean by "unconstraining the user-supplied offset". The multiplier needs to be positive. With our current constraints, we require the lower bound to be less than the upper bound, but otherwise don't constrain arguments.

My suggestion was to transpile as follows:

real<offset = mu, multiplier = sigma><lower = L> x;
if (sigma <= 0) throw ...;
var x_mid = offset_multiplier_constrain(x_raw, mu, sigma, lp__);
var x = lb_constrain(x_mid, L, lp__);

With vectors, we can formulate a natural lognormal hierarchical model as:

vector<offset = mu, multiplier = sigma><lower = 0>[N] tau;

tau[1:N] ~ lognormal(mu, sigma);

This is indeed relying on our understanding of how multiplier and offset are implemented as well as how lower bounds are implemented. If those things change, the program doesn't become wrong, it just becomes less efficient.

@WardBrian
Copy link
Member

@bob-carpenter's suggested c++ code is very similar to what I implemented in #971 (the if (sigma <=0) throw ... is in the math library, so it's actually a direct composition x = lb_constrain(OM_constrain(x_raw, mu, sigma, lp__), L, lp__);)

@nhuurre
Copy link
Collaborator

nhuurre commented Sep 28, 2021

The offset can be anything, so I'm not sure what you mean by "unconstraining the user-supplied offset".

I mean typical Stan code should look like this

data {
  int K;
  simplex[K] init;
}
parameters {
  simplex<offset=init>[K] par;
}

and not like this

data {
  int K;
  simplex[K] init;
}
parameters {
  simplex<offset=simplex_free(init)>[K] par;
}

and definitely not like this

functions {
  vector calc_offset(vector v) {
    int k = size(v);
    vector[k-1] w;
    real l = v[k];
    for (i in 1:k-1) {
      l += v[k-i];
      w[k-i] = logit(v[k-i] / l) + log(i);
    }
    return w;
  }
}
data {
  int K;
  simplex[K] init;
}
parameters {
  simplex<offset=calc_offset(init)>[K] par;
}

@betanalpha
Copy link

betanalpha commented Oct 1, 2021 via email

@bob-carpenter
Copy link
Contributor

@betanalpha I'd like to be able to write this:

parameters {
  vector<lower=0><offset=mu, multiplier=sigma>[N] a;
}
model {
  a ~ lognormal(mu, sigma);
}

instead of what I currenlty do, which is this

parameters {
  vector<offset = mu, multiplier = sigma>[N] log_a:
}
transformed parameters {
  vector<lower=0>[N] a = exp(log_a);
}
model {
  log_a ~ normal(mu, sigma);
}

or even worse, what I'd have to do without offset/multiplier, which is this:

parameters {
  vector[N] log_std_a;
}
transformed parameters {
  vector<lower = 0>[N] a = exp(mu + sigma * log_std_a);
}
model {
  log_std_a ~ normal(0, 1);
}

Not to bring up another can of worms, but this is exactly the situation where I'd like to be able to write a one-liner to keep everything together for maximal readability,

parameters {
  vector<lower=0><offset=mu, multiplier=sigma>[N] a ~ lognormal(mu, sigma);
}

@nhuurre
Copy link
Collaborator

nhuurre commented Oct 1, 2021

Not to bring up another can of worms, but this is exactly the situation where I'd like to be able to write a one-liner to keep everything together for maximal readability,

If we're going to open that can of worms may I propose that

parameters {
  vector[N] a ~ lognormal(mu, sigma);
}

means quantile function application

parameters {
  vector<lower=0, upper=1>[N] a_raw;
}
transformed parameters {
  vector[N] a;
  for (i in 1:N)
    a[i] = lognormal_qf(a_raw[i] | mu, sigma);
}

No need for additional transformations, it's automatically sort of "non-centered."

@betanalpha
Copy link

betanalpha commented Oct 1, 2021 via email

@nhuurre
Copy link
Collaborator

nhuurre commented Oct 1, 2021

That said either way the implicit transformation perspective only allows you to monolithically center or non-center all of the parameters which is the optimal parameterization only in special circumstances.

You can work around it with array-like offset/multipliers, for example

data {
  int N;
  int K;
  real y[N,K];
  int<lower=0, upper=1> is_centered[N];
}
transformed data {
  int centered_idx[N];
  for (n in 1:N)
    centered_idx[n] = is_centered[n] ? 1 : 2;
}
parameters {
  real mu;
  real sigma;
  vector<offset=[0.0, mu]'[centered_idx], multiplier=[1.0, sigma]'[centered_idx]>[N] eta;
}
model {
  for (n in 1:N) {
    y[n] ~ normal(eta, 1.0);
  }
}

@WardBrian WardBrian added the feature New feature or request label Nov 9, 2021
@WardBrian WardBrian linked a pull request May 16, 2022 that will close this issue
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants