-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Westfall and Young p-values for feols #698
Conversation
pre-commit.ci autofix |
for more information, see https://pre-commit.ci
Very cool @marcandre259! I'll review tomorrow after work. Will also check if we can test against my r package - I recall not being happy with some simulation results I had seen, will try to piece together why I never properly released it. Re stata - the problem is that both the wyoung and rwolf implementations are very slow, so in order to get results matching at even the first digit behind the comma, we'd have to wait quite a long time. And then if results don't match, we'd have to increase the number of iterations in stata, and wait even longer... Though if we prepare tests, we could ask @leostimpfle to make use of his license and help us with stata results? 😀 |
Hi @marcandre259 , the implementation looks good to me, though I am sceptical about the use of t-stats over pvalues. Though I think it is fine as long as the t-stats are rank preserving? I would prefer to use p-values though, as it will help users to follow the implementation if we link to the stata journal paper. Beyond that, the implementation looks very good & I think the algorithm is implemented correctly =) I have two other suggestions: a) I would like to add support for randomization inference, just as in the rwolf method; and b) what do you think about economizing a bit between the rwolf and wyoung functions? I think that both functions could share all of their code except for this part: t_stats[i] = wildboot_res_df["t value"]
boot_t_stats[:, i] = bootstrapped_t_stats
pval = _get_wyoung_pval(t_stats, boot_t_stats) For example, we could define a function def _multcomp_resample(
type: str,
models: list[Union[Feols, Fepois]],
param: str,
reps: int,
seed: int,
) -> pd.DataFrame:
models = _post_processing_input_checks(models)
all_model_stats = pd.DataFrame()
full_enumeration = False
S = 0
for model in models:
if param not in model._coefnames:
raise ValueError(
f"Parameter '{param}' not found in the model {model._fml}."
)
if model._is_clustered:
# model._G a list of length 3
# for oneway clusering: repeated three times
G = min(model._G)
if reps > 2**G:
warnings.warn(
f"""
2^(the number of clusters) < the number of boot iterations for at least one model,
setting full_enumeration to True and reps = {2**G}.
"""
)
full_enumeration = True
model_tidy = model.tidy().xs(param)
all_model_stats = pd.concat([all_model_stats, model_tidy], axis=1)
S += 1
if type == "rwolf":
t_stats = np.zeros(S)
boot_t_stats = np.zeros((2**G, S)) if full_enumeration else np.zeros((reps, S))
else:
p_vals = np.zeros(S)
boot_p_vals = np.zeros((2**G, S)) if full_enumeration else np.zeros((reps, S))
for i in range(S):
model = models[i]
wildboot_res_df, bootstrapped_t_stats = model.wildboottest(
param=param,
reps=reps,
return_bootstrapped_t_stats=True,
seed=seed, # all S iterations require the same bootstrap samples, hence seed needs to be reset
)
if type == "rwolf":
t_stats[i] = wildboot_res_df["t value"]
boot_t_stats[:, i] = bootstrapped_t_stats
pval = _get_rwolf_pval(t_stats, boot_t_stats)
else:
p_vals[i]= ...
boot_p_vals[i] = ...
pval = _get_wyoung_pval(p_vals, boot_p_vals)
all_model_stats.loc["WY Pr(>|t|)"] = pval
all_model_stats.columns = pd.Index([f"est{i}" for i, _ in enumerate(models)])
return all_model_stats and then define def rwolf(...):
return _multcomp_resample(type = "rwolf", ...)
def wyoung(...):
return _multcomp_resample(type = "wyoung", ...) ? |
Hi @s3alfisc, Thank you for the great suggestions. Indeed, I agree with the refactoring you suggest, I'll implement this. How hard is it to get the unadjusted p-values out of the developed randomization/bootstrapping functions? It would be nice if I can keep using the existing tools. Best, |
You can take the unadjusted p-values directly from the feols object. I only take the t-stat from the wildboottest output because wildboottest() might use different small sample corrections than pyfixest, so the comparison of the "original" t-stat and the "bootstrapped" t-stats might be incorrect. I.e. we might do something as which I thought we should avoid =) For the p-value, this does not matter, so you can simply call import pyfixest as pf
data = pf.get_data()
fit = pf.feols("Y ~ X1", data=data)
fit.wildboottest(param = "X1", reps = 10).xs("Pr(>|t|)")
fit.ritest(resampvar = "X1", reps = 10).xs("Pr(>|t|)") |
Hi @s3alfisc, the issue I'm currently facing is that fit.wildboottest(param = "X1", reps = 10).xs("Pr(>|t|)") returns a single p-value based on the method: def get_pvalue(self, pval_type = "two-tailed"):
if pval_type == "two-tailed":
self.pvalue = np.mean(np.abs(self.t_stat) < abs(self.t_boot))
elif pval_type == "equal-tailed":
pl = np.mean(self.t_stat < self.t_boot)
ph = np.mean(self.t_stat > self.t_boot)
self.pvalue = 2 * min(pl, ph)
elif pval_type == ">":
self.pvalue = np.mean(self.t_stat < self.t_boot)
else:
self.pvalue = np.mean(self.t_stat > self.t_boot) of wildboottest. In the wyoung algorithm, I need to do a comparison against each of the resampled B sets of p-values. I'm not sure what is the best way to go from there. On the one-hand, the wildboottest could give the option to output an array of B p-values. My question is then how to compute them? On the other hand, keeping things as they are in my initial commit w.r.t. to getting the S adjusted p-values is not really different from what is already going with rwolf? Namely, the same small sample corrections are applied, although they differ from those of fixest. I'm looking forward to getting your opinion :) |
The wild bootstrap computes a "bootstrapped p-value" by counting how often the "sample" t statistic is larger than the individual bootstrap t-statistics. Thats why we only get a single pvalue computed e.g. via If I understand the WY method correctly, we compute a p-value for the test of interest for each of the B bootstrap samples. So if we have the t-statistic, we could do something like this: import pyfixest as pf
from scipy.stats import t
import numpy as np
data = pf.get_data()
fit = pf.feols("Y ~ X1", data = data)
_, t_boot = fit.wildboottest(param = "X1", reps = 1000, return_bootstrapped_t_stats = True)
t_boot[0:5]
# array([ 0.75552776, -0.84928264, -1.04821196, 0.84407764, 1.15154982])
#df = _N - _k if _vcov_type in ["iid", "hetero"] else _G - 1
df = fit._N - fit._k if fit._vcov_type in ["iid", "hetero"] else fit._G - 1
p_vals_boot = 2 * (1 - t.cdf(np.abs(t_boot), df))
p_vals_boot[0:5]
# array([0.45011102, 0.39592812, 0.29479525, 0.3988287 , 0.24978246]) Here I compute p-values as in the |
Yes and no, I think =D Per se, it does not matter which small sample correction wildboottest applies. When computing p-values, the small sample corrections will cancel out in this step: wb = fit.wildboottest() # wb is a WildBootsStrap object
np.mean(np.abs(ssc * wb .t_stat) < abs(ssc * wb .t_boot)) If we now would do something as np.mean(np.abs(ssc_feols * self.t_stat) < abs(ssc_wb * wb .t_boot)) i.e. compare the t-stat from So we always need to make sure that in all comparisons, we apply the small sample corrections consistently. |
Hi @s3alfisc, Thank you for the extensive response, learning a lot.. Based on your feedback, I will include this step prior to running the main WY algorithm: df = fit._N - fit._k if fit._vcov_type in ["iid", "hetero"] else fit._G - 1
p_vals_boot = 2 * (1 - t.cdf(np.abs(t_boot), df)) That way there will at least be more clarity as far following the steps. I am also going to use the t-stats obtained from the wildbootstrap program. |
…dding solver argument and integration tests to fepois and feols (py-econometrics#661) * Adding solver to feols and fepois api's * Benchmarks with linearmodels absorbingls * Ran pixi lint * Default solver to _estimate_all_models is np.linalg.solve * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Replace exact equality with a precision tolerance for test_fepois_args * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* add info on discourse to docs * test predict with newdata against fixest * updates * pass tests * fix test bug
* Adding literals to feols and fepois api's * Adding docstring * Fix lpdid argument vcov * literal type * hunting vcov literals * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Changes refering to initial review * Whac-a-mole based development * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Check length of valid_types instead looking for generic literal type * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Sounds like a good plan to me! Let me know when I should take another look =) |
Going to close this request and refer to it in a new one. |
In response to #485, I followed the algorithm given in paper in the appendix. Only difference is that we are working with t statistics instead of p-values directly
A lot of code is copied from the rwolf function. I've removed the randomization option, and kept only the wild bootstrap, because I was getting strange results with randomization. Did not deeply look into it however, so it could be a simple issue.
For testing, I wrote a private "slow" function that follows the algorithm given in the above paper quite literaly. It's not ideal, but I do not have access to Stata. I did find out about @s3alfisc experimental R package, and that could maybe also be used as a baseline?
Cheers