Westfall and Young p-values for feols #698

marcandre259 · 2024-11-06T18:14:24Z

In response to #485, I followed the algorithm given in paper in the appendix. Only difference is that we are working with t statistics instead of p-values directly

A lot of code is copied from the rwolf function. I've removed the randomization option, and kept only the wild bootstrap, because I was getting strange results with randomization. Did not deeply look into it however, so it could be a simple issue.

For testing, I wrote a private "slow" function that follows the algorithm given in the above paper quite literaly. It's not ideal, but I do not have access to Stata. I did find out about @s3alfisc experimental R package, and that could maybe also be used as a baseline?

Cheers

marcandre259 · 2024-11-06T18:28:27Z

pre-commit.ci autofix

for more information, see https://pre-commit.ci

s3alfisc · 2024-11-06T22:18:47Z

Very cool @marcandre259! I'll review tomorrow after work. Will also check if we can test against my r package - I recall not being happy with some simulation results I had seen, will try to piece together why I never properly released it. Re stata - the problem is that both the wyoung and rwolf implementations are very slow, so in order to get results matching at even the first digit behind the comma, we'd have to wait quite a long time. And then if results don't match, we'd have to increase the number of iterations in stata, and wait even longer... Though if we prepare tests, we could ask @leostimpfle to make use of his license and help us with stata results? 😀

s3alfisc · 2024-11-07T21:04:13Z

Hi @marcandre259 , the implementation looks good to me, though I am sceptical about the use of t-stats over pvalues. Though I think it is fine as long as the t-stats are rank preserving? I would prefer to use p-values though, as it will help users to follow the implementation if we link to the stata journal paper.

Beyond that, the implementation looks very good & I think the algorithm is implemented correctly =)

I have two other suggestions:

a) I would like to add support for randomization inference, just as in the rwolf method; and

b) what do you think about economizing a bit between the rwolf and wyoung functions? I think that both functions could share all of their code except for this part:

        t_stats[i] = wildboot_res_df["t value"]
        boot_t_stats[:, i] = bootstrapped_t_stats

    pval = _get_wyoung_pval(t_stats, boot_t_stats)

For example, we could define a function

def _multcomp_resample(
   type: str, 
    models: list[Union[Feols, Fepois]],
    param: str,
    reps: int,
    seed: int,
) -> pd.DataFrame:

    models = _post_processing_input_checks(models)
    all_model_stats = pd.DataFrame()
    full_enumeration = False

    S = 0
    for model in models:
        if param not in model._coefnames:
            raise ValueError(
                f"Parameter '{param}' not found in the model {model._fml}."
            )

        if model._is_clustered:
            # model._G a list of length 3
            # for oneway clusering: repeated three times
            G = min(model._G)
            if reps > 2**G:
                warnings.warn(
                    f"""
                              2^(the number of clusters) < the number of boot iterations for at least one model,
                              setting full_enumeration to True and reps = {2**G}.
                              """
                )
                full_enumeration = True

        model_tidy = model.tidy().xs(param)
        all_model_stats = pd.concat([all_model_stats, model_tidy], axis=1)
        S += 1
   
    if type == "rwolf": 
       t_stats = np.zeros(S)
       boot_t_stats = np.zeros((2**G, S)) if full_enumeration else np.zeros((reps, S))
   else: 
       p_vals = np.zeros(S)
       boot_p_vals = np.zeros((2**G, S)) if full_enumeration else np.zeros((reps, S))

    for i in range(S):
        model = models[i]

        wildboot_res_df, bootstrapped_t_stats = model.wildboottest(
            param=param,
            reps=reps,
            return_bootstrapped_t_stats=True,
            seed=seed,  # all S iterations require the same bootstrap samples, hence seed needs to be reset
        )

        if type == "rwolf": 
           t_stats[i] = wildboot_res_df["t value"]
           boot_t_stats[:, i] = bootstrapped_t_stats
           pval = _get_rwolf_pval(t_stats, boot_t_stats)
        else: 
           p_vals[i]= ...
           boot_p_vals[i] = ...
           pval = _get_wyoung_pval(p_vals, boot_p_vals)

    all_model_stats.loc["WY Pr(>|t|)"] = pval
    all_model_stats.columns = pd.Index([f"est{i}" for i, _ in enumerate(models)])
    return all_model_stats

and then define

def rwolf(...): 
   return _multcomp_resample(type = "rwolf", ...)

def wyoung(...): 
   return _multcomp_resample(type = "wyoung", ...)

?

marcandre259 · 2024-11-07T21:17:23Z

Hi @s3alfisc,

Thank you for the great suggestions.

Indeed, I agree with the refactoring you suggest, I'll implement this.

How hard is it to get the unadjusted p-values out of the developed randomization/bootstrapping functions?

It would be nice if I can keep using the existing tools.

Best,

@marcandre259

s3alfisc · 2024-11-07T21:28:25Z

How hard is it to get the unadjusted p-values out of the developed randomization/bootstrapping functions?

You can take the unadjusted p-values directly from the feols object. I only take the t-stat from the wildboottest output because wildboottest() might use different small sample corrections than pyfixest, so the comparison of the "original" t-stat and the "bootstrapped" t-stats might be incorrect. I.e. we might do something as

$$ (N-1) / N \times \beta / ses > (N-1) / (N-k) \times \beta_{boot} / ses_boot $$

which I thought we should avoid =)

For the p-value, this does not matter, so you can simply call Feols.pvalues().values.
For the bootstrapped, randomization inference p-values, you can take them directly from the wildboottest or ritest output, which is of the following form:

import pyfixest as pf 
data = pf.get_data()

fit = pf.feols("Y ~ X1", data=data)
fit.wildboottest(param = "X1", reps = 10).xs("Pr(>|t|)")
fit.ritest(resampvar = "X1", reps = 10).xs("Pr(>|t|)")

marcandre259 · 2024-11-09T16:52:59Z

Hi @s3alfisc,

the issue I'm currently facing is that

fit.wildboottest(param = "X1", reps = 10).xs("Pr(>|t|)")

returns a single p-value based on the method:

  def get_pvalue(self, pval_type = "two-tailed"):

    if pval_type == "two-tailed":
      self.pvalue = np.mean(np.abs(self.t_stat) < abs(self.t_boot))
    elif pval_type == "equal-tailed":
      pl = np.mean(self.t_stat < self.t_boot)
      ph = np.mean(self.t_stat > self.t_boot)
      self.pvalue = 2 * min(pl, ph)
    elif pval_type == ">":
      self.pvalue = np.mean(self.t_stat < self.t_boot)
    else:
      self.pvalue = np.mean(self.t_stat > self.t_boot)

of wildboottest. In the wyoung algorithm, I need to do a comparison against each of the resampled B sets of p-values.

I'm not sure what is the best way to go from there. On the one-hand, the wildboottest could give the option to output an array of B p-values. My question is then how to compute them?

On the other hand, keeping things as they are in my initial commit w.r.t. to getting the S adjusted p-values is not really different from what is already going with rwolf? Namely, the same small sample corrections are applied, although they differ from those of fixest.

I'm looking forward to getting your opinion :)

s3alfisc · 2024-11-10T10:37:40Z

I'm not sure what is the best way to go from there. On the one-hand, the wildboottest could give the option to output an array of B p-values. My question is then how to compute them?

The wild bootstrap computes a "bootstrapped p-value" by counting how often the "sample" t statistic is larger than the individual bootstrap t-statistics. Thats why we only get a single pvalue computed e.g. via np.mean(np.abs(self.t_stat) < abs(self.t_boot)).

If I understand the WY method correctly, we compute a p-value for the test of interest for each of the B bootstrap samples. So if we have the t-statistic, we could do something like this:

import pyfixest as pf
from scipy.stats import t
import numpy as np

data = pf.get_data()
fit = pf.feols("Y ~ X1", data = data)

_, t_boot = fit.wildboottest(param = "X1", reps = 1000, return_bootstrapped_t_stats = True)
t_boot[0:5]
# array([ 0.75552776, -0.84928264, -1.04821196,  0.84407764,  1.15154982])
#df = _N - _k if _vcov_type in ["iid", "hetero"] else _G - 1
df = fit._N - fit._k if fit._vcov_type in ["iid", "hetero"] else fit._G - 1
p_vals_boot = 2 * (1 - t.cdf(np.abs(t_boot), df))
p_vals_boot[0:5]
# array([0.45011102, 0.39592812, 0.29479525, 0.3988287 , 0.24978246])

Here I compute p-values as in the Feols.get_inference() method. Btw, we might want to make df and attribute so we can re-use it, e.g. in the WY method?

s3alfisc · 2024-11-10T10:41:29Z

On the other hand, keeping things as they are in my initial commit w.r.t. to getting the S adjusted p-values is not really different from what is already going with rwolf? Namely, the same small sample corrections are applied, although they differ from those of fixest.

Yes and no, I think =D

Per se, it does not matter which small sample correction wildboottest applies. When computing p-values, the small sample corrections will cancel out in this step:

wb = fit.wildboottest() # wb is a WildBootsStrap object
np.mean(np.abs(ssc * wb .t_stat) < abs(ssc * wb .t_boot))

If we now would do something as

np.mean(np.abs(ssc_feols * self.t_stat) < abs(ssc_wb * wb .t_boot))

i.e. compare the t-stat from pf.feols() with bootstrapped t-stats from the wild bootstrap, we would potentially get incorrect results in case the small sample corrections are drastically different (i.e. if we have small N or large k).

So we always need to make sure that in all comparisons, we apply the small sample corrections consistently.

marcandre259 · 2024-11-13T18:23:33Z

Hi @s3alfisc,

Thank you for the extensive response, learning a lot..

Based on your feedback, I will include this step prior to running the main WY algorithm:

df = fit._N - fit._k if fit._vcov_type in ["iid", "hetero"] else fit._G - 1
p_vals_boot = 2 * (1 - t.cdf(np.abs(t_boot), df))

That way there will at least be more clarity as far following the steps. I am also going to use the t-stats obtained from the wildbootstrap program.

…dding solver argument and integration tests to fepois and feols (py-econometrics#661) * Adding solver to feols and fepois api's * Benchmarks with linearmodels absorbingls * Ran pixi lint * Default solver to _estimate_all_models is np.linalg.solve * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Replace exact equality with a precision tolerance for test_fepois_args * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* add info on discourse to docs * test predict with newdata against fixest * updates * pass tests * fix test bug

* Adding literals to feols and fepois api's * Adding docstring * Fix lpdid argument vcov * literal type * hunting vcov literals * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Changes refering to initial review * Whac-a-mole based development * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Check length of valid_types instead looking for generic literal type * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

s3alfisc · 2024-11-13T19:55:33Z

Sounds like a good plan to me! Let me know when I should take another look =)

marcandre259 · 2024-11-23T14:50:40Z

Going to close this request and refer to it in a new one.

marcandre259 added 12 commits November 3, 2024 20:30

Now complete the owl

57834b8

Logic to get the adjusted null t's

fed19d1

first version -- untested

e395ed6

Proper naming

70f9cff

No more randomization -- weird results + not indicated

752dc0c

Correct documentation a bit

eebed63

Prototype comparative

d93a3ec

Quick print

ccadd68

Adding test

bb7c648

Removing some debugging comments

2aabd5c

No mo gitignore notebook

9b9d0db

Removed irrelevant mentions of RI

d86f0c4

pre-commit-ci bot and others added 2 commits November 6, 2024 18:28

[pre-commit.ci] auto fixes from pre-commit.com hooks

820c3bc

for more information, see https://pre-commit.ci

Uncomment wildrwolf

d1dc711

marcandre259 and others added 7 commits November 13, 2024 20:21

Start revision to pvals

dfa495c

type valid separation methods

9c458d4

update test_poisson

10a7881

minor cleanups

dee657a

add test against pplmhdfe test data set 01

d56b71f

ignore coverage xml

201582d

Revert "Iterative rectifier" (py-econometrics#659)

00a0930

marcandre259 and others added 6 commits November 13, 2024 20:28

small updates to documentation

7031e93

add separation_cheks kwarg back in

a1146d0

Bugfix in fixef() method for WLS (py-econometrics#682)

3c112c2

More predict tests (py-econometrics#683)

7c9aea2

* add info on discourse to docs * test predict with newdata against fixest * updates * pass tests * fix test bug

Adding RI for wyoung

974dd0b

marcandre259 mentioned this pull request Nov 15, 2024

Randomization inference, "ri" sampling_method in rwolf, gives too tight a sample of null t-statistics #717

Closed

marcandre259 closed this Nov 23, 2024

marcandre259 deleted the marc-wyoung branch November 23, 2024 14:50

marcandre259 mentioned this pull request Nov 23, 2024

Adding westfall-young p-value adjustment for multiple tests #725

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Westfall and Young p-values for feols #698

Westfall and Young p-values for feols #698

marcandre259 commented Nov 6, 2024 •

edited

Loading

marcandre259 commented Nov 6, 2024

s3alfisc commented Nov 6, 2024

s3alfisc commented Nov 7, 2024 •

edited

Loading

marcandre259 commented Nov 7, 2024

s3alfisc commented Nov 7, 2024 •

edited

Loading

marcandre259 commented Nov 9, 2024 •

edited

Loading

s3alfisc commented Nov 10, 2024 •

edited

Loading

s3alfisc commented Nov 10, 2024 •

edited

Loading

marcandre259 commented Nov 13, 2024

s3alfisc commented Nov 13, 2024

marcandre259 commented Nov 23, 2024

Westfall and Young p-values for feols #698

Westfall and Young p-values for feols #698

Conversation

marcandre259 commented Nov 6, 2024 • edited Loading

marcandre259 commented Nov 6, 2024

s3alfisc commented Nov 6, 2024

s3alfisc commented Nov 7, 2024 • edited Loading

marcandre259 commented Nov 7, 2024

s3alfisc commented Nov 7, 2024 • edited Loading

marcandre259 commented Nov 9, 2024 • edited Loading

s3alfisc commented Nov 10, 2024 • edited Loading

s3alfisc commented Nov 10, 2024 • edited Loading

marcandre259 commented Nov 13, 2024

s3alfisc commented Nov 13, 2024

marcandre259 commented Nov 23, 2024

marcandre259 commented Nov 6, 2024 •

edited

Loading

s3alfisc commented Nov 7, 2024 •

edited

Loading

s3alfisc commented Nov 7, 2024 •

edited

Loading

marcandre259 commented Nov 9, 2024 •

edited

Loading

s3alfisc commented Nov 10, 2024 •

edited

Loading

s3alfisc commented Nov 10, 2024 •

edited

Loading