Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enforce pred_var is always greater than zero on GRF #480

Merged
merged 12 commits into from
Aug 7, 2021

Conversation

arose13
Copy link
Contributor

@arose13 arose13 commented Jun 8, 2021

ensure that the pred_var is always greater than zero.

This prevents NaNs from being created for some outputted values when creating the confidence interval.

The NaNs were previously being created when the variance was converted to the sd for scipy's distribution models.

PS: I also removed duplicated code that used to appear on line 798 and 799

Copy link
Collaborator

@vsyrgkanis vsyrgkanis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Weird that the bayesian debiasing was not ensuring that, but maybe here you were getting exact zeros, which would also be problematic.

@arose13
Copy link
Contributor Author

arose13 commented Jun 8, 2021

I was surprised also when I was looking through the predict_point_and_var function but then numbers I was getting out that were breaking things were variances like -2e28 and -3e31. So numerically it might actually be zero

Copy link
Collaborator

@kbattocchi kbattocchi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this contribution! Looks good, although I added one minor suggestion.

Also, would it be possible to add a simple test where the current code is failing but this change works so that we can make sure not to regress in the future?

econml/grf/_base_grf.py Outdated Show resolved Hide resolved
@arose13 arose13 requested a review from kbattocchi June 22, 2021 01:34
Copy link
Collaborator

@kbattocchi kbattocchi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new changes look good, thanks for contributing! I'm approving the PR because it looks good code-wise, but before we can merge it there are two issues:

  • A minor line-too-long linting problem
  • We have a real test failure when running the notebooks/Generalized Random Forests.ipynb notebook where the assertion is being triggered. Could you check if this is just a case where we should be using a slightly looser tolerance, or if we're really getting big negative values there for some reason?

(there are also a couple of other random test failures that I suspect are sporadic and could be fixed by just rerunning)

@@ -793,10 +793,13 @@ def predict_full(self, X, interval=False, alpha=0.05):
"""
if interval:
point, pred_var = self._predict_point_and_var(X, full=True, point=True, var=True)
assert np.isclose(pred_var[pred_var < 0], 0, atol=1e-8).all(), '`pred_var` should not produce large negative values'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately this line is failing our linting step because it's too long, so you'll need to break it up over two lines instead before we can merge.

@arose13
Copy link
Contributor Author

arose13 commented Jun 23, 2021

Some of the failed test appears to be caused by an import/ModuleNotFoundError error for a library called ipyparallel and inexplicible kernel timeout errors.

@kbattocchi
Copy link
Collaborator

@arose13 The transient test failures are gone; the remaining notebook failures are due to triggering the assert within notebooks/Generalized Random Forests.ipynb as I mentioned in my previous comment. Could you run through this notebook locally on your branch and see whether it's being triggered by negative values that are just barely too large for the current check or whether they are far from zero?

@arose13
Copy link
Contributor Author

arose13 commented Jun 29, 2021

This is the list of negative numbers that the GRF is generating in cell 5 of the Generalized Random Forests notebook. There are not large but they are orders of magnitude larger than the numbers that my dataset created.

[-9.71148020e-05 -9.71148020e-05 -3.71913346e-04 -3.71913346e-04
 -1.81617759e-04 -1.81617759e-04 -1.90164152e-03 -1.90164152e-03
 -3.72337774e-05 -3.72337774e-05 -3.12026815e-03 -3.12026815e-03
 -2.87514374e-03 -2.87514374e-03 -1.64442561e-03 -1.64442561e-03
 -3.89969475e-03 -3.89969475e-03 -4.03462227e-03 -4.03462227e-03
 -3.57581040e-03 -3.57581040e-03 -4.10402059e-03 -4.10402059e-03
 -9.66940790e-05 -9.66940790e-05 -9.64839622e-05 -9.64839622e-05
 -2.56454874e-03 -2.56454874e-03 -7.59402132e-03 -7.59402132e-03
 -3.00155712e-03 -3.00155712e-03 -9.65701081e-04 -9.65701081e-04
 -3.91570169e-03 -3.91570169e-03 -1.14883637e-03 -1.14883637e-03
 -1.67404798e-03 -1.67404798e-03 -2.44639496e-03 -2.44639496e-03
 -4.44131189e-03 -4.44131189e-03 -1.78629255e-03 -1.78629255e-03
 -5.10844232e-03 -5.10844232e-03 -5.39320397e-03 -5.39320397e-03
 -3.21969607e-04 -3.21969607e-04 -1.89189299e-03 -1.89189299e-03
 -1.07363686e-03 -1.07363686e-03 -4.63204525e-04 -4.63204525e-04
 -6.20925242e-03 -6.20925242e-03 -5.48326105e-04 -5.48326105e-04
 -7.92044922e-03 -7.92044922e-03 -1.67513014e-03 -1.67513014e-03
 -1.91783295e-03 -1.91783295e-03 -2.66217332e-03 -2.66217332e-03
 -8.06691208e-03 -8.06691208e-03 -3.80646330e-03 -3.80646330e-03
 -1.02306806e-03 -1.02306806e-03 -6.63986327e-03 -6.63986327e-03
 -2.49315492e-03 -2.49315492e-03 -5.62818743e-03 -5.62818743e-03
 -4.81508894e-03 -4.81508894e-03 -1.57566769e-02 -1.57566769e-02
 -1.91329013e-03 -1.91329013e-03 -1.39286160e-03 -1.39286160e-03
 -4.21758085e-03 -4.21758085e-03 -3.99219257e-04 -3.99219257e-04
 -7.18139476e-03 -7.18139476e-03 -4.56547781e-03 -4.56547781e-03
 -5.17726669e-03 -5.17726669e-03 -3.46554222e-03 -3.46554222e-03
 -1.61704394e-02 -1.61704394e-02 -1.79248516e-02 -1.79248516e-02
 -4.90204604e-02 -4.90204604e-02 -2.02323367e-02 -2.02323367e-02
 -6.68202655e-03 -6.68202655e-03 -5.17581754e-02 -5.17581754e-02
 -1.38303209e-02 -1.38303209e-02 -5.87194234e-03 -5.87194234e-03
 -1.24390342e-02 -1.24390342e-02 -1.71015258e-02 -1.71015258e-02
 -7.49818211e-03 -7.49818211e-03 -2.96637026e-02 -2.96637026e-02
 -1.38563185e-02 -1.38563185e-02 -8.27955845e-02 -8.27955845e-02
 -8.46787177e-02 -8.46787177e-02 -5.53417005e-02 -5.53417005e-02
 -8.17552761e-02 -8.17552761e-02 -5.25008524e-02 -5.25008524e-02]

Let me know what you think.

@vsyrgkanis
Copy link
Collaborator

This is the list of negative numbers that the GRF is generating in cell 5 of the Generalized Random Forests notebook. There are not large but they are orders of magnitude larger than the numbers that my dataset created.

[-9.71148020e-05 -9.71148020e-05 -3.71913346e-04 -3.71913346e-04
 -1.81617759e-04 -1.81617759e-04 -1.90164152e-03 -1.90164152e-03
 -3.72337774e-05 -3.72337774e-05 -3.12026815e-03 -3.12026815e-03
 -2.87514374e-03 -2.87514374e-03 -1.64442561e-03 -1.64442561e-03
 -3.89969475e-03 -3.89969475e-03 -4.03462227e-03 -4.03462227e-03
 -3.57581040e-03 -3.57581040e-03 -4.10402059e-03 -4.10402059e-03
 -9.66940790e-05 -9.66940790e-05 -9.64839622e-05 -9.64839622e-05
 -2.56454874e-03 -2.56454874e-03 -7.59402132e-03 -7.59402132e-03
 -3.00155712e-03 -3.00155712e-03 -9.65701081e-04 -9.65701081e-04
 -3.91570169e-03 -3.91570169e-03 -1.14883637e-03 -1.14883637e-03
 -1.67404798e-03 -1.67404798e-03 -2.44639496e-03 -2.44639496e-03
 -4.44131189e-03 -4.44131189e-03 -1.78629255e-03 -1.78629255e-03
 -5.10844232e-03 -5.10844232e-03 -5.39320397e-03 -5.39320397e-03
 -3.21969607e-04 -3.21969607e-04 -1.89189299e-03 -1.89189299e-03
 -1.07363686e-03 -1.07363686e-03 -4.63204525e-04 -4.63204525e-04
 -6.20925242e-03 -6.20925242e-03 -5.48326105e-04 -5.48326105e-04
 -7.92044922e-03 -7.92044922e-03 -1.67513014e-03 -1.67513014e-03
 -1.91783295e-03 -1.91783295e-03 -2.66217332e-03 -2.66217332e-03
 -8.06691208e-03 -8.06691208e-03 -3.80646330e-03 -3.80646330e-03
 -1.02306806e-03 -1.02306806e-03 -6.63986327e-03 -6.63986327e-03
 -2.49315492e-03 -2.49315492e-03 -5.62818743e-03 -5.62818743e-03
 -4.81508894e-03 -4.81508894e-03 -1.57566769e-02 -1.57566769e-02
 -1.91329013e-03 -1.91329013e-03 -1.39286160e-03 -1.39286160e-03
 -4.21758085e-03 -4.21758085e-03 -3.99219257e-04 -3.99219257e-04
 -7.18139476e-03 -7.18139476e-03 -4.56547781e-03 -4.56547781e-03
 -5.17726669e-03 -5.17726669e-03 -3.46554222e-03 -3.46554222e-03
 -1.61704394e-02 -1.61704394e-02 -1.79248516e-02 -1.79248516e-02
 -4.90204604e-02 -4.90204604e-02 -2.02323367e-02 -2.02323367e-02
 -6.68202655e-03 -6.68202655e-03 -5.17581754e-02 -5.17581754e-02
 -1.38303209e-02 -1.38303209e-02 -5.87194234e-03 -5.87194234e-03
 -1.24390342e-02 -1.24390342e-02 -1.71015258e-02 -1.71015258e-02
 -7.49818211e-03 -7.49818211e-03 -2.96637026e-02 -2.96637026e-02
 -1.38563185e-02 -1.38563185e-02 -8.27955845e-02 -8.27955845e-02
 -8.46787177e-02 -8.46787177e-02 -5.53417005e-02 -5.53417005e-02
 -8.17552761e-02 -8.17552761e-02 -5.25008524e-02 -5.25008524e-02]

Let me know what you think.

I'm confused: the cell below plots thje confidence interval and seems to have no problem. Is this triggered by some change here?

econml/grf/_base_grf.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@vsyrgkanis vsyrgkanis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kbattocchi kbattocchi enabled auto-merge (squash) July 9, 2021 19:02
@kbattocchi kbattocchi disabled auto-merge August 7, 2021 13:50
@kbattocchi kbattocchi enabled auto-merge (squash) August 7, 2021 13:50
@kbattocchi kbattocchi merged commit 2783e09 into py-why:master Aug 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants