intro.Rmd

# Introduction

Bayesian analysis is now fairly common in applied work. It is no longer a surprising thing to see it utilized in non-statistical journals, though it is still fresh enough that many researchers feel they have to put 'Bayesian' in the title of their papers when they implement it.  However, to be clear, one doesn't conduct a Bayesian analysis per se.  A Bayesian logistic regression is still just logistic regression.  The *Bayesian* part comes into play with the perspective on probability that one uses to interpret the results, and in how the estimates are arrived at.  

The Bayesian approach itself is very old at this point. Bayes and Laplace started the whole shebang in the 18^th^ and 19^th^ centuries[^evenolder], and even the modern implementation of it has its foundations in the 30s, 40s and 50s of last century[^old]. So while it may still seem somewhat newer relative to more common techniques, much of the groundwork has long since been hashed out, and there is no more need to justify a Bayesian analysis any more than there is to use the standard maximum likelihood or other approach[^justify].  While there are perhaps many reasons why the Bayesian approach to analysis did not catch on until relatively recently, perhaps the biggest is simply computational power.  Bayesian analysis requires an iterative and time-consuming approach that simply wasn't viable for most applied researchers until modern computing.  But nowadays, one can conduct such analysis even on their laptop very easily.

The Bayesian approach to data analysis requires a different way of thinking about things, but its implementation can be seen as an extension of traditional approaches. In fact, as we will see later, it incorporates the very likelihood one uses in standard statistical techniques.  The key difference regards the notion of probability, which, while different than Fisherian or frequentist statistics, is actually more akin to how the average Joe thinks about probability.  Furthermore, p-values and intervals will have the interpretation that many applied researchers incorrectly think their current methods provide.  On top of this, one gets a very flexible toolbox that can handle many complex analyses.  In short, the reason to engage in Bayesian analysis is that it has a lot to offer and can potentially handle whatever you throw at it.

As we will see shortly, one must also get used to thinking about distributions rather than fixed points.  With Bayesian analysis, we are not so much as making guesses about specific values as in the traditional setting, but more so trying to understand the limits of our knowledge and getting a healthy sense of the uncertainty of those guesses.


## Bayesian Probability

This section will probably be about as formal as this document gets, and will be very minimal even then.  The focus still will be on the conceptual understanding though, and subsequently illustrated with a by-hand example in the next section.


### Conditional probability & Bayes theorem

Bayes theorem is illustrated in terms of probability as follows:


$$p(\mathcal{A}|\mathcal{B}) = \frac{p(\mathcal{B}|\mathcal{A})p(\mathcal{A})}{p(\mathcal{B})}$$
  

In short, we are attempting to ascertain the conditional probability of $\mathcal{A}$ given $\mathcal{B}$ based on the conditional probability of $\mathcal{B}$ given $\mathcal{A}$ and the respective probabilities of $\mathcal{A}$ and $\mathcal{B}$.  This is perhaps not altogether enlightening in and of itself, so we will frame it in other ways, and for the upcoming depictions we will ignore the denominator[^denominator].


$$p(hypothesis|data) \propto p(data|hypothesis)p(hypothesis)$$ 
  

In the above formulation[^propto], we are trying to obtain the probability of a hypothesis given the evidence at hand (data) and our initial (prior) beliefs regarding that hypothesis.  Here we are already able to see at least one key difference between Bayesian and classical statistics. The Bayesian approach provides a probability of the hypothesis given the data, which is something very desirable from a scientific perspective.

Here is yet another way to consider this:


$$posterior \propto likelihood * prior$$


For this depiction, let us consider a standard regression coefficient $b$[^parameters]. Here we have a prior belief about $b$ expressed as a probability distribution.  As a preliminary example, we will assume that perhaps the distribution is normal, and is centered on some value $\mu_b$, and with some variance $\sigma_b^2$.  The likelihood here is the exact same one used in classical statistics- if $y$ is our variable of interest, then the likelihood is $p(y|b)$ as in the standard regression approach using maximum likelihood estimation.  What we end up with in the Bayesian context however is not a specific value of $b$ that would make the data most likely, but a probability distribution for $b$ that serves as a weighted combination of the likelihood and prior.  Given that *posterior distribution* for $b$, we can then get the mean, median, 95% *credible interval*[^credible], and a whole host of other statistics that might be of interest to us.

To summarize conceptually, we have some belief about the state of the world, expressed as a mathematical model (such as the linear model used in regression).  The Bayesian approach provides an updated belief as a weighted combination of prior beliefs regarding that state, and the currently available evidence. In addition, there is the possibility of the current evidence overwhelming prior beliefs, or prior beliefs remaining largely intact in the face of scant evidence.


$$\text{updated belief} = \text{current evidence} * \text{prior belief or evidence}$$


We will make these concepts more concrete in the next section.

[^evenolder]: Bayes theorem possibly [predates](https://en.wikipedia.org/wiki/Nicholas_Saunderson) Bayes himself by some accounts.

[^old]: Jeffreys, Metropolis etc.

[^justify]: Though some might suggest that the typical practice of hypothesis testing that comes with standard methods would need *more*.

[^denominator]: The denominator reflects the sum of the numerator for *all* values $\mathcal{A}$ might take on. For example:
$$p(\mathcal{A_i}|\mathcal{B}) = \frac{p(\mathcal{B}|\mathcal{A_i})p(\mathcal{A_i})}{p(\mathcal{B}|\mathcal{A_i})p(\mathcal{A_i}) + \dots + p(\mathcal{B}|\mathcal{A_n})p(\mathcal{A_n})}$$

[^propto]: The $\propto$ means 'proportional to'.

[^parameters]: If we think of y as our outcome and $\Theta$ as our *set* of coefficients that include all the regression coefficients $b$ and $\sigma^2$ variance, i.e. all parameters we need to estimate for the model: $$p(\mathcal{\Theta}|\mathcal{y}) = \frac{p(\mathcal{y}|\mathcal{\Theta})p(\mathcal{\Theta})}{p(\mathcal{y})}$$

[^credible]: More on this later.