Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Figure 5.3 appears to be total log revenue rather than average, conflicts with text; model uses mean(log(revenue)) rather than log(mean(revenue)) #17

Open
shane-kercheval opened this issue Dec 28, 2020 · 5 comments

Comments

@shane-kercheval
Copy link

Pg 143/144 & https://github.com/TaddyLab/BDS/blob/master/examples/paidsearch.R

Text:

Figure 5.3 shows the log difference between average revenues in each group.

Caption:

The log-scale average revenue difference ..

Although, in the code, both plots are using totalrev and are created before semavg is defined.

The total vs average log differences will produce the same pattern on different scales, but initially confused me as I walked through the code/example.


Related, let's assume the graphs plot the mean instead of total, so it is the same as the model.

The graphs first take the average (or total in the current code) and then take the log of the average. (i.e. log(mean(revenue)))

The model uses y from semavg which takes the log and then the mean. In the code, y is defined as y=mean(log(revenue)))

Whether we use sum or mean in the model, it seems like would want to take the log after the mean. This seems especially true if we were going to use sum rather than mean.


Original Code (mean(log(revenue)))

library(data.table)
sem <- as.data.table(sem)
sem_avg_log <- sem[, 
			list(d=mean(1-search.stays.on), y=mean(log(revenue))), 
			by=c("dma","treatment_period")]
setnames(sem_avg_log, "treatment_period", "t") # names to match slides
sem_avg_log <- as.data.frame(sem_avg_log)
coef(glm(y ~ d*t, data=sem_avg_log))['d:t']

gives -0.006586852


log(mean(revenue)):

sem_log_avg <- sem[, 
			list(d=mean(1-search.stays.on), y=log(mean(revenue))), 
			by=c("dma","treatment_period")]
setnames(sem_log_avg, "treatment_period", "t") # names to match slides
sem_log_avg <- as.data.frame(sem_log_avg)
coef(glm(y ~ d*t, data=sem_log_avg))['d:t']

gives -0.005775498


If we were to use sum rather than mean and then log i.e. log(sum(revenue))

sem_log_sum <- sem[, 
			list(d=mean(1-search.stays.on), y=log(sum(revenue))), 
			by=c("dma","treatment_period")]
setnames(sem_log_sum, "treatment_period", "t") # names to match slides
sem_log_sum <- as.data.frame(sem_log_sum)
coef(glm(y ~ d*t, data=sem_log_sum))['d:t']

gives -0.005775498, which is the same as log(mean(revenue))


If we were to do sum(log(revenue)) which would clearly be wrong because the control is a larger group, then we'd get -0.2534986...


Is there a reason we should specifically use mean(log(revenue)) rather than log(mean(revenue))?

@mataddy
Copy link
Member

mataddy commented Feb 8, 2021

@shane-kercheval many thanks for this. Apologies for the delayed reply, I'm revising the book for a new addition (will be pretty cool; we're making it into a much more readable full-service text) and I just got to this section for revision.

You are absolutely correct! I got the order of operations mixed up and the plot descriptions also. The fixed analysis script is:

`
library(data.table)
ebay <- as.data.table(ps)
ebay <- ebay[,list(ssm.turns.off=mean(1-search.stays.on),
revenue=mean(revenue)),
by=c("dma","treatment_period")]
setnames(ebay, "treatment_period", "post.treat")
ebay <- as.data.frame(ebay)
head(ebay)

run the DiD analysis

did <- glm(log(revenue) ~ ssm.turns.off*post.treat, data=ebay)
coef(did)

library(sandwich)
library(lmtest)
coeftest(did, vcov=vcovCL(did, cluster=ebay$dma))`

Results are exactly you describe. I've also attached updated draft section in case you are curious. Comments/errata are highly welcome :-)
draft.pdf

@joshualeond
Copy link

@mataddy, maybe not the best place for this question but do you have an idea of when the 2nd edition could be released? Thanks!

@mataddy
Copy link
Member

mataddy commented May 1, 2021

Soon! Chapters are with production now. I'd say summer if I'm optimistic.

@joshualeond
Copy link

I'm at risk of being a bit obnoxious here on Github but was curious if there was an update on the 2nd edition?

@mataddy
Copy link
Member

mataddy commented Nov 11, 2021

Soon! It's with production, so hopefully early 2022.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants