Add jitter to Cholesky factorization in Gaussian ops #3151

fritzo · 2022-10-28T00:12:09Z

This numerically stabilizes Gaussian parallel-scan operations for use in very long chains, say > 10000 items. This use case is important for heterogeneous-length batching, where a batch of sequences can be concatenated and operated on as a single chain. In this setting I was seeing Cholesky errors in both single and double precision.

The fix in this PR is to modify pyro.ops.tensor_utils.cholesky() to add a small adaptive amount of jitter to all computations. This appears to both pass all existing tests (which are quite strong, it doesn't work to add a non-adaptive jitter), and to make Gaussian parallel-scan filtering work for very long sequences. 🎉

Tested

added tests for very long sequences
tried on a real-world example of length up to 1e6

martinjankowiak · 2022-10-28T12:19:05Z

pyro/ops/tensor_utils.py

        return x.sqrt()
+
+    # Add adaptive jitter.
+    x = x.clone()


are you sure you need/want clones? fwiw i do this in millipede, which is similar to what's done in gpytorch

Yup this .clone() is needed because we're mutating the matrix. Nice, I'll rename this to safe_cholesky() as in millipede.

Thanks, I did try various gpytorch-style lazy tactics that avoid adding jitter until a failure has occurred. I found that since the Cholesky error happens only late in the filtering process, by that point the filter state had already been corrupted by nearly-singular matrices that just barely didn't trigger an error. The best solution I've found so far is to add a tiny amount of noise to all matrices so that error doesn't build up during filtering. Other solutions include using svd or pinv or ldl_factor, but they were more expensive.

Note the core piece of linear algebra is in Gaussian.marginalize() which is repeatedly called in the filter pass of sequential_gaussian_filter_sample(). It's just the blockwise symmetric matrix inverse formula:

# in Gaussian.marginalize(): P_aa = self.precision[..., a, a] P_ba = self.precision[..., b, a] P_bb = self.precision[..., b, b] P_b = safe_cholesky(P_bb) # Note if we add a little jitter here... P_a = triangular_solve(P_ba, P_b, upper=False) # ...then this is smaller... P_at = P_a.transpose(-1, -2) precision = P_aa - matmul(P_at, P_a) # ...so this is even better conditioned.

This code has the nice property that if we add a little bit of jitter before Cholesky factorizing, the next precision matrix becomes only better-conditioned. Empirically this allowed me to get away with much smaller jitter than was needed if I waited for an error to occur.

BTW it looks like you could speed up millipede by switching from try: c = cholesky() to the faster c, info = cholesky_ex(); if not info.any(): return c, which is used in gpytorch. The only reason I'm not using cholesky_ex() here is that I found the decision-based version was too unstable.

martinjankowiak

lgtm. but CHOLESKY_JITTER isn't actually easily toggle-able is it?

pyro/ops/tensor_utils.py

fehiepsi · 2022-10-28T20:37:29Z

pyro/ops/tensor_utils.py

    if x.size(-1) == 1:
+        if CHOLESKY_JITTER:


Do you intend to clamp by (CHOLESKY_JITTER * finfo(x.dtype).eps) ** 2 here?

My intention was to scale by about finfo(x.dtype).eps * x.max() so that the jitter was just barely detectable by the largest matrix entry before Cholesky factorizing. That way if we set RELATIVE_CHOLESKY_JITTER = 1/2 then jitter will only affect matrix entries less than half the size of the max. And it kindof makes sense to me that each additional bit of precision would mean we would need to add half as much jitter, thus jitter would be proportional to finfo(x.dtype).eps. Mostly the proportional think helps us keep a constant RELATIVE_CHOLESKY_JITTER across float32 and float64.

What's your intuition behind the square here, is that to keep constant error post-Cholesky-factorization?

I missed the x_max term in the above comment. Using square seems to be more consistent w.r.t. to cases x.size(-1) > 1 - but I like your clamp by tiny better.

Re x_max: using global max makes sense, but I feel that it might be better to use max of rows instead, e.g. considering the diagonal matrix [0.0001, 10000], the global jitter is large w.r.t. the first diagonal term.

Nice idea! I've switched to using a row-wise max. This required increasing CHOLESKY_RELATIVE_JITTER from 1.0 to 4.0, but this way still seems better 👍

is this really preferred? this changes the eigenvalues and eigenvectors as opposed to jitter that is proportional to the identity (which only changes eigenvalues)

@martinjankowiak that's a good point, I didn't know about eigenvector preservation. I'd be ok with either version.

One thing I like about @fehiepsi's solution is that users can on their side rotate the system before performing Gaussian ops, e.g. I'm approximately diagonalizing via QR

evals, evecs = torch.linalg.eig(transition_matrix) Q, R = torch.linalg.qr(evecs.real) transition_matrix = Q.T @ transition_matrix @ Q

which shrinks my diagonal perturbations

- [5.45, 1.81, 0.86, 0.76] * eps * CHOLESKY_RELATIVE_JITTER + [4.91, 1.67, 0.21, 0.17] * eps * CHOLESKY_RELATIVE_JITTER

fritzo · 2022-10-30T02:20:08Z

@martinjankowiak CHOLESKY_JITTER isn't actually easily toggle-able is it?

That's correct, I'm hoping we won't actually need to toggle it and I'm intending the global variable as an emergency switch in case I need to change something post-release in a production model and want to avoid monkey patching.

Actually we ought to have some sort of standard interface for all of Pyro's global settings, similar to GPyTorch's settings. Mind if I attempt that in a follow-up PR #3152? Whichever of these PRs merges first, I'll be sure to add to the second PR a registration

@settings.register("cholesky_relative_jitter", __name__, "CHOLESKY_RELATIVE_JITTER")
def _validate_jitter(value):
    assert isinstance(value, (int, float))
    assert 0 <= value

EDIT done in #3152.

fritzo · 2022-10-30T16:39:25Z

Thanks for reviewing!

fritzo added 2 commits October 27, 2022 13:25

Attempt to add jitter to Gaussian ops

be0e68d

Isolate changes to pyro.ops.tensor_utils

102abf5

fritzo added enhancement awaiting review labels Oct 28, 2022

fritzo requested a review from fehiepsi October 28, 2022 00:12

fritzo added 3 commits October 27, 2022 17:19

Add a warning

ae184e8

Simplify

fd40b54

Simplify

d78b83c

fritzo added discussion awaiting review and removed enhancement awaiting review labels Oct 28, 2022

Simplify to fixed jitter based on finfo.eps

d542b83

fritzo added awaiting review and removed discussion labels Oct 28, 2022

Simplify

2f5e184

martinjankowiak reviewed Oct 28, 2022

View reviewed changes

fritzo added 2 commits October 28, 2022 08:01

Rename cholesky -> safe_cholesky

fdb1be1

Allow disabling jitter

2da3c75

martinjankowiak previously approved these changes Oct 28, 2022

View reviewed changes

fehiepsi previously approved these changes Oct 28, 2022

View reviewed changes

pyro/ops/tensor_utils.py Outdated Show resolved Hide resolved

fehiepsi reviewed Oct 28, 2022

View reviewed changes

Address review comment

152b913

fritzo dismissed stale reviews from fehiepsi and martinjankowiak via 152b913 October 30, 2022 02:20

fritzo mentioned this pull request Oct 30, 2022

Clean up handling of global settings #3152

Merged

1 task

Switch to column-wise max, increase jitter

3e273e7

fehiepsi approved these changes Oct 30, 2022

View reviewed changes

fehiepsi merged commit ed54fe8 into dev Oct 30, 2022

fritzo deleted the gaussian-jitter branch October 30, 2022 16:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add jitter to Cholesky factorization in Gaussian ops #3151

Add jitter to Cholesky factorization in Gaussian ops #3151

fritzo commented Oct 28, 2022 •

edited

Loading

martinjankowiak Oct 28, 2022

fritzo Oct 28, 2022 •

edited

Loading

martinjankowiak left a comment

fehiepsi Oct 28, 2022 •

edited

Loading

fritzo Oct 30, 2022

fehiepsi Oct 30, 2022

fritzo Oct 30, 2022

martinjankowiak Oct 30, 2022

fritzo Oct 30, 2022 •

edited

Loading

fritzo commented Oct 30, 2022 •

edited

Loading

fritzo commented Oct 30, 2022

Add jitter to Cholesky factorization in Gaussian ops #3151

Add jitter to Cholesky factorization in Gaussian ops #3151

Conversation

fritzo commented Oct 28, 2022 • edited Loading

Tested

martinjankowiak Oct 28, 2022

Choose a reason for hiding this comment

fritzo Oct 28, 2022 • edited Loading

Choose a reason for hiding this comment

martinjankowiak left a comment

Choose a reason for hiding this comment

fehiepsi Oct 28, 2022 • edited Loading

Choose a reason for hiding this comment

fritzo Oct 30, 2022

Choose a reason for hiding this comment

fehiepsi Oct 30, 2022

Choose a reason for hiding this comment

fritzo Oct 30, 2022

Choose a reason for hiding this comment

martinjankowiak Oct 30, 2022

Choose a reason for hiding this comment

fritzo Oct 30, 2022 • edited Loading

Choose a reason for hiding this comment

fritzo commented Oct 30, 2022 • edited Loading

fritzo commented Oct 30, 2022

fritzo commented Oct 28, 2022 •

edited

Loading

fritzo Oct 28, 2022 •

edited

Loading

fehiepsi Oct 28, 2022 •

edited

Loading

fritzo Oct 30, 2022 •

edited

Loading

fritzo commented Oct 30, 2022 •

edited

Loading