L28-Bayes.qmd

# Bayesian competition between hypotheses {#sec-bayes}

```{r include=FALSE}
source("../_startup.R")
set_chapter(28)
```

::: {.hidden .content-visible when-format="html"}
$$\newcommand{\Ptest}{\mathbb{P}}
\newcommand{\Ntest}{\mathbb{N}}
\newcommand{\Sick}{\cal S}
\newcommand{\Healthy}{\cal H}
\newcommand{\given}{\ |\!\!|\  }$$
:::

> *The test of a first-rate intelligence is the ability to hold two opposed ideas in the mind at the same time, and still retain the ability to function.* -- F. Scott Fitzgerald, [1936](https://quoteinvestigator.com/2020/01/05/intelligence/)

Our actions are guided by what we know about the state of things and the mechanisms that shape that state. Experience shows, however, that sometimes our so-called "knowledge" is only approximate or even dead wrong. Acknowledging this situation, in these Lessons we have avoided the word "*knowledge*" in favor of "*hypothesis*." 

"To entertain a hypothesis" is to express a willingness to take that hypothesis into consideration as we try to make sense of the world. We can entertain more than one hypothesis about any given situation, but frequently we give more credence to one hypothesis than to another. It is as if multiple hypotheses are competing for credibility as explanations for the events seen in the world.

**Bayesian inference** provides a framework for calculating how we should balance the credence we give different hypotheses in the light of accumulating data. It does this *quantitatively*, using the language of probability to represent what in English we call variously `r glossary_term("credibility, credence, belief, faith, credit, credulity, confidence, certitude, and so on", "belief")`.

In this Lesson, we introduce Bayesian inference in a simple context: a competition between two opposing hypotheses. Later, in Lesson [-@sec-NHT], we will use the Bayesian framework to illuminate two other widely-used forms of statistical inference: null hypothesis testing and the Neyman-Pearson framework.

## Two hypotheses in the light of data

Bayesian inference is applicable to many situations. We will use the familiar one of health. In particular, imagine a disease or condition that a person might or might not have, for instance, COVID or breast cancer. The two competing hypotheses are that you are sick or you are healthy. Medical diagnosis generally involves more than two hypotheses: a range of possible illnesses or conditions. But, in this introduction, we will keep things simple.

We will denote the two hypotheses---sick or healthy---as $\Sick$ and $\Healthy$. 

A medical test (such as mammography or an at-home COVID test) is a common source of data to inform the relative credibility of the two hypotheses. Many such medical tests are arranged to produce a binary outcome: either a positive ($\Ptest$) or a negative ($\Ntest$) result. By convention, a *positive* test result points toward the subject of the test being sick and a *negative* tests points toward being healthy.

The test produces a definite **observable result** that can have one, and only one, outcome: $\Ptest$ or $\Ntest$. In contrast, $\Sick$ and $\Healthy$ are **hypotheses** that are competing for `r glossary_term("credulity", "belief")`. We can entertain both hypotheses at the same time. A typical situation in medical testing is that a $\Ptest$ results triggers additional tests, for instance a biopsy of tissue. It's often the case that the additional tests contradict the original test. A biopsy performed after a $\Ptest$ often turns out to be negative. In many situations, "often" means "the large majority of the time."

Confusing an observable result with a hypotheses leads to faulty reasoning. The fault is not necessarily obvious. For example, it seems reasonable to say that "a positive test implies that the patient is sick." In the formalism of logic, this could be written $\Sick\impliedby\Ptest$. One clue that this simple reasoning is wrong is that $\Ptest$ and $\Sick$ are different kinds of things: one is an observable and the other is a hypothesis.

Another clue that $\Sick\impliedby\Ptest$ is fallacious comes from causal reasoning. A positive test does not cause you to become sick. To the contrary, sickness creates the physical or biochemical conditions cause the test to come out $\Ptest$. And we know that $\Ptest$s can occur even among the healthy. (These cases are called "**false positives**".) 

Bayesian inference improves on the iffy logic of implication when considering which of two hypotheses---$\Sick$ or $\Healthy$---is to be `r glossary_term("preferred", "belief")` based on the observables. 

## Probability and likelihood

One step toward a better form of reasoning is to replace the hard certainty of logical implication with something softer: probability and likelihood. Instead of $\Ptest \impliedby \Sick$, it's more realistic to say that the likelihood of $\Ptest$ is high [Recall from Lesson [-@sec-likelihood] that a likelihood is the probability of observing a particular outcome---$\Ptest$ here---in a world where a hypothesis---$\Sick$ here---is true.]{.aside} when the patient is $\Sick$. Or, in the notation of `r glossary_term("likelihood")`: 

$$p(\Ptest\given\Sick)\ \text{ is high, say, 0.9.}$$ {#eq-sick-likelihood}

If the test is any good, a similar likelihood statement applies to healthy people and their test outcomes:

$$p(\Ntest\given\Healthy)\ \text{is also high, say, 0.9.}$$ {#eq-healthy-likelihood}

These two likelihood statements represent well, for instance, the situation with mammography to detect breast cancer in women. Unfortunately, although both statements are correct, neither is *directly* relevant to a woman or her doctor interpreting a test result. Instead, the appropriate interpretation of $\Ptest$ comes from the answer to this question:

>  Suppose your test result is $\Ptest$. To what extent should you believe that you are $\Sick$? That is, how much `r glossary_term("credence", "belief")` should you give the $\Sick$ hypothesis, as opposed to the competing hypothesis $\Healthy$?

It's reasonable to quantify the `r glossary_term("extent of credibility or belief", "belief")` in terms of probability. Doing this, the above question becomes an of finding $p(\Sick\given\Ptest)$. Also relevant, at least for the woman getting a $\Ntest$ result is the probability $p(\Healthy\given\Ntest)$.

Using probabilities to encode the `r glossary_term("extent of creditability", "belief")` has the benefit that we can compute some things from others. For example, from $p(\Sick\given\Ptest)$ we can easily calculate $p(\Healthy\given\Ptest)$: the relationship is $$p(\Healthy\given\Ptest) = 1 - p(\Sick\given\Ptest) .$$ Notice that *both* of these probabilities have the observation $\Ptest$ as the given. Similarly, $p(\Sick\given\Ntest) = 1 - p(\Healthy\given\Ntest)$. Both of these have the observation $\Ntest$ as given.

Now consider the likelihoods as expressed in Statements [-@eq-sick-likelihood] and [-@eq-healthy-likelihood]. There is no valid calculation on the likelihoods that is similar to the probability calculations in the previous paragraph.
$1-p(\Ptest\given\Sick)$ is not necessarily even close to $p(\Ptest\given\Healthy)$.  To see why, consider the metaphor about planets made in @sec-planets-and-hypotheses. Planet $\Sick$ is where $p(\Ptest\given\Sick)$ can be tabulated, but $p(\Ptest\given\Healthy)$ is about the conditions on Planet $\Healthy$. Being on different planets, the two probabilities have no simple relationship. Let's emphasize this.

i. In $p(\Ptest\given\Sick)$, we *stipulate* that you are $\Sick$ and ask how likely would be a $\Ptest$ outcome under the $\Sick$ condition. 
ii. In $p(\Sick\given\Ptest)$, we know the observed test outcome is $\Ptest$ and we want to express how strongly we should believe in hypothesis $\Sick$.

To avoid confusing (i) and (ii), we will write (i) using a different notation, one that emphasizes that we are talking about a $\cal L}\text{ikelihood}$. Instead of $p(\Ptest\given\Sick)$, we will write ${\cal L}_\Sick(\Ptest)$. Read this as "the *likelihood* on planet $\Sick$ of observing a $\Ptest$ result. This can also be stated (with greater dignity) in either of these ways: "The likelihood under the assumption of $\Sick$, of observing a $\Ptest$ result," or "stipulating that the patient is $\Sick$, the likelihood of observing $\Ptest$." $p(\Ptest\given\Sick)$ and ${\cal L}_\Sick(\Ptest)$ are just two names for the same quantity, but ${\cal L}_\Sick(\Ptest)$ is a reminder that this quantity is a likelihood.

## Prior and posterior probability

Recall the story up to this point: You go to the doctor's office and are going to get a test. It matters greatly *why* you are going. Is it just for a scheduled check-up, or are you feeling symptoms or seeing signs of illness?

In the language of probability, this *why* is summarized by what's called a "**prior probability*, which we can write $p(\Sick)$. It's called a "prior" because it's relevant *before* you take the test. If you are going to the doctor for a regular check-up, the prior probability is small, no more than the *prevalence* of the sickness in the population you are a part of. However, if you are going because of signs or symptoms, the prior probability is larger, although not necessarily large in absolute terms.

Your objective in going to the doctor and getting the test is to get more information about whether you might be sick. We express this as a "**posterior probability**," that is, your revised probability of being $\Sick$ once you know the test result.

The point of Bayesian reasoning is to use new observations to turn a prior probability into a posterior probability. That is, the new observations allow you to **update** your previous idea of the probability of being $\Sick$ based on the new information.

There is a formula for calculating the posterior probability. The formula can be written most simply if both the prior and the posterior are represented as *odds* rather than as probability. Recall that if the probability of some outcome is $p(outcome)$, then the *odds* of that outcome is $$\text{Odds}(\text{outcome}) \equiv \frac{p(\text{outcome})}{1 - p(\text{outcome})} .$$

{{< include LearningChecks/L28/LC28-04.qmd >}}

For future reference, here is the formula for the posterior odds. We present and use it here, and will explain its origins in the next sections of this Lesson. It is called `r glossary_term("Bayes' Rule")`.

$$\text{posterior odds of }\Sick = \frac{\cal{L}_\Sick(\text{test result})}{\cal{L}_\Healthy(\text{test result})}\ \times \text{prior odds of }\Sick$$ {#eq-bayes-rule-odds}

Formula [-@eq-bayes-rule-odds] involves two hypotheses and one observed test result. Each of the hypotheses corresponds to the likelihood of the observed test result. The relative credit we give the hypotheses is measured by the odds. If the odds of $\Sick$ are greater than 1, $\Sick$ is preferred. If the odds of $\Sick$ are less than 1, $\Healthy$ is preferred. The prior odds apply *before* the test result is observed, the posterior odds apply *after* the test result is known.

The two hypotheses---$\Sick$ or $\Healthy$---are competing with one another. The quantitative representation of this competition is called the `r glossary_term("Likelihood ratio")`.

$$\textbf{likelihood ratio:}\ $\frac{\cal{L}_\Sick(\text{test result})}{\cal{L}_\Healthy(\text{test result})} .$$ 

Let's illustrate using the situation of a 50-year old woman getting a regularly scheduled mammogram. Before the test, in other words, *prior* to the test, her probability of having breast cancer is low, say 1%. Or, reframed as *odds*, 1/99. 

It turns out that the outcome of the test is $\Ptest$. Based on this, the formula lets us calculate the *posterior odds*. Since the test result is $\Ptest$, the relevant likelihoods are $\cal L_\Sick(\Ptest)$ and $\cal L_\Healthy(\Ptest)$. Referring to [-@eq-sick-likelihood] and [-@eq-healthy-likelihood] in the previous section, these are

$$\cal L_\Sick(\Ptest) = 0.90 \ \text{and}\ \ \cal L_\Healthy(\Ptest) = 0.10 .$$

Consequently the likelihood ratio is

$$\frac{\cal L_\Sick(\Ptest)}{\cal L_\Healthy(\Ptest)} = \frac{0.90}{0.10} = 9 .$$

{{< include LearningChecks/L28/LC28-02.qmd >}}

The formula says to find the posterior odds by multiplying the prior odds by the likelihood ratio. The prior odds were 1/99, so the posterior odds are $9/99 = 0.091$. This is in the format of odds, but most people would prefer to reformat it as a probability. This is easily done: the posterior probability of $\Sick$ is $\frac{0.091}{1 + 0.091} = 0.083$.

Perhaps this is surprising to you. The posterior probability is small, even though the woman had a $\Ptest$ result. This surprise illustrates why it is so important to understand Bayesian calculations.

{{< include LearningChecks/L28/LC28-03.qmd >}}

## Basis for Bayes' Rule

To see where Bayes' Rule comes from, let's look back into the process of development of the medical test that gives our observable, either $\Ptest$ or $\Ntest$.

Test developers start with an idea about a physically detectable condition that can indicate the presence of the disease of interest. For example, observations that cancerous lumps are denser than other tissue probably lie at the origins of the use of x-rays in the mammography test. Similarly, at-home COVID tests detect fragments of proteins encoded by the virus's genetic sequence. In other words, there is some physical or biochemical science behind the test. But this is not the only science.

The test developers also carry out a trial to establish the performance of the test. One part of doing this is to construct a group of people who are known to have the disease. Each of these people is then given the test in question and the results tabulated to find the fraction of people in the group who got a $\Ptest$ result. This fraction is called the test's `r glossary_term("sensitivity")`  Estimating sensitivity is not necessarily quick or easy; it may require use of a more expensive or invasive test to confirm the presence of the disease, or even waiting until the disease becomes manifest in other ways and constructing the $\Sick$ group retrospectively.

At the same time, the test developers assemble a group of people known to be $\Healthy$. The test is administered to each person in this group and the fraction who got a $\Ntest$ result is tabulated. This is called the `r glossary_term("specificity")` of the test.

```{r echo=FALSE}
set.seed(111)
group_sim1 <- datasim_make(
  sick <- bernoulli(n, prob=0.05, labels=c("H", "S")),
  test <- ifelse(sick=="S", 
                 bernoulli(n, p=0.9, labels=c("N", "P")),
                 bernoulli(n, p=0.2, labels=c("N", "P"))),
  x <- runif(n),
  y <- runif(n),
  size <- runif(n, min=1, max=1.1),
  xsplit <- ifelse(sick=="H", x, runif(n, 1.1, 1.2)),
  color <- categorical(n, "steelblue", "skyblue", "blue", "lightblue", "purple", "darkblue"),
  pcolor <- ifelse(test=="P", "red", color),
  ysplit <- ifelse(test=="P", runif(n, 1.1, 1.4), y),
  shape <- 16L + (sick=="H")
)
Your_group <- group_sim1 |> take_sample(n=5000)
```

@fig-development-groups shows schematically the people in the $\Sick$ group alongside the people in the $\Healthy$ group. The large majority of the people in $\Sick$ tested $\Ptest$: 90% in this illustration. Similarly, the large majority of people in $\Healthy$ tested $\Ntest$: 80% in this illustration.

::: {#fig-development-groups}
```{r echo=FALSE}
#| layout-ncol: 2
#| fig-subcap:
#|   - The $\Sick$ group used for measuring the sensitivity.
#|   - The $\Healthy$ group used for measuring the specificity
Your_group |>
  filter(sick == "S") |>
  gf_point(y ~ x, shape = 16, size = ~ size, 
         color = ~ pcolor, alpha=0.5) +
  scale_color_identity() +
  theme_void() + theme(legend.position = "none")

Your_group |>
  filter(sick == "H") |>
  head(200) |>
  gf_point(y ~ x, shape = 17, size = ~ size, 
         color = ~ pcolor, alpha=0.5) +
  scale_color_identity() +
  theme_void() + theme(legend.position = "none")
```
The two trial groups used in assessing the performance of the test. Shape indicates the person's condition (circles for $\Sick$, triangles for $\Healthy$). Color indicates the test result (red for $\Ptest$).
:::


The sensitivity and specificity are both *likelihoods*. Sensitivity is the probability of a $\Ptest$ outcome *given* that the persion is $\Sick$. Specificity is the probability of a $\Ntest$ outcome *given* that the person in $\Healthy$. To use the planet metaphor of Lesson [-@sec-hypo-thinking], the specificity is calculated on a planet where everyone is $\Healthy$. The sensitivity is calculated on a different planet, one where everyone is $\Sick$.

```{r echo=FALSE}
set.seed(111)
group_sim1 <- datasim_make(
  sick <- bernoulli(n, prob=0.05, labels=c("H", "S")),
  test <- ifelse(sick=="S", 
                 bernoulli(n, p=0.9, labels=c("N", "P")),
                 bernoulli(n, p=0.2, labels=c("N", "P"))),
  x <- runif(n),
  y <- runif(n),
  size <- runif(n, min=1, max=1.1),
  xsplit <- ifelse(sick=="H", x, runif(n, 1.1, 1.2)),
  color <- categorical(n, "steelblue", "skyblue", "blue", "lightblue", "purple", "darkblue"),
  pcolor <- ifelse(test=="P", "red", color),
  ysplit <- ifelse(test=="P", runif(n, 1.1, 1.4), y),
  shape <- 16L + (sick=="H")
)
Your_group <- group_sim1 |> take_sample(n=5000)
```

@fig-development-groups shows schematically the people in the $\Sick$ group alongside the people in the $\Healthy$ group. The large majority of the people in $\Sick$ tested $\Ptest$: 90% in this illustration. Similarly, the large majority of people in $\Healthy$ tested $\Ntest$: 80% in this illustration.

::: {#fig-development-groups}
```{r echo=FALSE}
#| layout-ncol: 2
#| fig-subcap:
#|   - The $\Sick$ group used for measuring the sensitivity.
#|   - The $\Healthy$ group used for measuring the specificity
Your_group |>
  filter(sick == "S") |>
  gf_point(y ~ x, shape = 16, size = ~ size, 
         color = ~ pcolor, alpha=0.5) +
  scale_color_identity() +
  theme_void() + theme(legend.position = "none")

Your_group |>
  filter(sick == "H") |>
  head(200) |>
  gf_point(y ~ x, shape = 17, size = ~ size, 
         color = ~ pcolor, alpha=0.5) +
  scale_color_identity() +
  theme_void() + theme(legend.position = "none")
```
The two trial groups used in assessing the performance of the test. Shape indicates the person's condition (circles for $\Sick$, triangles for $\Healthy$). Color indicates the test result (red for $\Ptest$).
:::


Now turn to the use of the test among people who might be sick or might be healthy. We don't know the status of any individual person, but we do have some information about the population as a whole: the fraction of the population who have the disease. This fraction is called the `r glossary_term("prevalence")` of the disease and is known from demographic and medical records. 

For the sake of illustration, @fig-prevalence shows a simulated population of 1000 people with a prevalence of 5%. That is, fifty  dots are circles, the rest triangles. Each of the 1000 simulated people were given the test, with the result shown in color (red is $\Ptest$).

::: {#fig-prevalence}
```{r echo=FALSE}
Your_group |>
  head(1000) |>
  gf_point(y ~ x, shape = ~shape, size = ~ size, 
         color = ~ pcolor, alpha=0.5) +
  scale_shape_identity() +
  scale_color_identity() +
  theme_void() + theme(legend.position = "none")
```

A *simulated* population based on the prevalence of the disease (5%) and the actual sensitivity and specificity of the test.
:::

The nature of a simulation is that we know all the salient details about the individual people, in particular their health status and their test result. This makes it easy to calculate the posterior probability, that is, the fraction of people with a $\Ptest$ who are actually $\Sick$. To illustrate, we will move people into a quadrant of the field depending on their status:  

::: {#fig-split-up}
```{r echo=FALSE}
#| fig-cap: "The same people as in @fig-prevalence repositioned according the their health status and test result."
#| layout-ncol: 2
#| fig-subcap:
#| - "Splitting into healthy and sick sub-groups."
#| - "Further splitting the subgroups according to the test result."
Your_group |>
  head(1000) |>
  gf_point(y ~ xsplit, shape = ~ sick, size = ~ size, 
         color = ~ pcolor, alpha=0.5) +
  scale_color_identity() +
  theme_void() + theme(legend.position = "none")

Your_group |>
  head(1000) |>
  gf_point(ysplit ~ xsplit, shape = ~ sick, size = ~ size, 
         color = ~ pcolor, data=Your_group, alpha=0.5) +
  scale_color_identity() +
  theme_void() + theme(legend.position = "none")
```
:::

@fig-split-up(a) illustrates the prevalence of the disease, that is, the fraction of people with the disease. This is the fraction of people moved to the right side of the field, all of whom are triangles. Almost all of the $\Sick$ people had a $\Ptest$ result, reflecting a test sensitivity of 90%. 

@fig-split-up(b) further divides the field. All the people with a $\Ptest$ result have been moved to the top of the field. Altogether there are four clusters of people. On the bottom left are the healthy people who got an appropriate $\Ntest$ result. On the top right are the $\Sick$ people who also got the appropriate test result: $\Ptest$. On the bottom right and top left are people who were mis-classified by the test. 

The group on the bottom right is $\Sick$ people with a $\Ntest$ result. These test results are called `r glossary_term("false negatives")`---"false" because the test gave the wrong result, "negative" because that result was $\Ntest$. 

Similarly, the group on the top left is mis-classified. All of them are $\Healthy$, but nonetheless they received a $\Ptest$ result. Such results are called `r glossary_term("false positives")`. Again, "false" indicates that the test result was wrong, "positive" that the result was $\Ptest$.

Now we are in a position to calculate, simply by counting, the posterior probability. That is, the probability that a person with a $\Ptest$ test resuult is actually $\Sick$.

```{r echo=FALSE}
Your_group |>
  head(1000) |>
  count(sick, test)
```

If you take the time to count, in @fig-split-up you will find 185 people who are $\Healthy$ but erroneously have a $\Ptest$ result. There are 49 people who are $\Sick$ and were correct given a $\Ptest$ result. Thus, the *odds* of being $\Sick$ given a $\Ptest$ result are 49/185 = 0.26, which is equivalent to probability of about 20%.

Let's compare the counting result to the result from Bayes' Rule (@eq-bayes-rule-odds). 

- Since the prevalence is 5%, the prior odds are 5/95.
- The likelihood of a $\Ptest$ result for $\Sick$ people is exactly what the sensitivity measures: 90%.
- The likelihood of a $\Ntest$ result for $\Healthy$ people is one minus the specificity: 1 - 80% = 20%.

Putting these three numbers together using @eq_bayes-rule-odds gives the posterior odds:

$$\underbrace{\frac{90\%}{20\%}}_\text{Likelihood ratio}\ \ \ \times\  \  \underbrace{\frac{5}{95}}_\text{prior odds} = \underbrace{\ 0.24\ }_\text{posterior odds}$$

The result from Bayes' Rule differs very little from the posterior odds we found by simulation counting. The difference is due to *sampling variation*; our sample from the simulation had size $n=1000$. But had we used a much larger sample, the results would converge on the result from Bayes' Rule.


## Exercises

```{r eval=FALSE, echo=FALSE}
# need to run this in console after each change
emit_exercise_markup(
  paste0("../LSTexercises/28-Bayes/",
         c("Q28-101.Rmd",
           "Q28-105.Rmd",
           "Q28-106.Rmd")),
  outname = "L28-exercise-markup.txt"
)
```

<!--
# Exercises from 2024 Math 300 at USAFA. Maybe revise in some way.
../LSTexercises/fromSummerDraft/bayes-odds-form.qmd
../LSTexercises/Lesson-33/Q33-4.Rmd
../LSTexercises/Lesson-34/Q34-1.Rmd
-->


{{< include L20-exercise-markup.txt >}}

## Short projects

```{r eval=FALSE, echo=FALSE}
emit_exercise_markup(
  paste0("../LSTexercises/28-Bayes/",
         c("Q28-301.Rmd", 
           "Q28-302.Rmd", 
           "Q28-303.Rmd",
           "Q28-304.Rmd")),
  outname = "L28-project-markup.txt"
)
```

{{< include L20-project-markup.txt >}}


## Enrichment topics


{{< include Enrichment-topics/ENR-28/Topic28-01.qmd >}}

{{< include Enrichment-topics/ENR-28/Topic28-02.qmd >}}

{{< include Enrichment-topics/ENR-28/Topic28-03.qmd >}}

{{< include Enrichment-topics/ENR-28/Topic28-04.qmd >}}

{{< include Enrichment-topics/ENR-28/Topic28-05.qmd >}}


- Accumulating evidence. THE CYCLE OF ACCUMULATION. Let's look at biopsy that follows a mammogram, both with a $\Ptest$ and a $\Ntest$ result.