---
title: "Computational Cognitive Science 2022-2023"
output:
  pdf_document: default
  word_document: default
  html_document: default
---
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
require(DirichletReg) # We'll be using Dirichlet distributions. gtools includes rdirichlet and ddirichlet but not with log densities, which are useful.
```

# Tutorial 4: Bayesian Model Comparison

## Nisbett and Wilson: Are people biased?

We previously used count data from a Nisbett and Wilson study to look at different approaches to parameter estimation. We will now use the same data to look at model comparison.

Options A-D are identical consumer products arranged from left to right in a display.
Observations are the number of participants who chose each item in the array.

---------------------------------------------
Choice             A       B     C      D    
----------------- ------- ----- ------ ------
 Observations     6       9     16     21  
---------------------------------------------

```{r}
choice_data <- c(6,9,16,21)
```

Let's compare two models:

* M1: Uniform categorical distribution. This captures the hypothesis that people are unbiased in their selections.
* M2: Dirichlet with $\alpha=2$ priors. This captures the hypothesis that there is bias in participants' choices. We aren't using Jeffreys priors because they attach substantial weight to extreme biases, which are unlikely.

We will now try to see which model is supported by the data. We'll ignore the fact that the answer is fairly obvious, and a null hypothesis significance test would probably be adequate here.

**Question 1**: Are these descriptive or process models?

**Solution**:

*Descriptive; there is no psychological content or account of why choices might look the way they do in either model.*

**Question 2**: We have chosen these models for simplicity. If our goal is to detect left/right bias, can you think of changes that make one or both models more reasonable, without adding much complexity?

**Solution**:

*There are many possible answers, but one issue is that there are many reasons we could prefer M2 over M1 aside from left/right bias, which means that data favoring M2 doesn't imply there's a left/right bias overall, e.g., c(100,0,0,100) would strongly favor M2 but provides evidence against a left/right bias. If detecting left/right bias is our goal, we might collapse A and B into one group and C and D into another and use a beta-binimonal distribution. We might also decide there's a range of biases that are so subtle that they don't matter, leading us to permit some variability in binomial parameters in M1. We might also have intuitions about plausible biases that favor different priors, which are important: If we aren't prepared to defend our choice of priors, we might not trust our marginal likelihoods.*

**Question 3**: What is the marginal log likelihood for M1?

**Solution**: *We don't need to integrate over a range of thetas because M1 commits to $theta_i=.25$. A [multinomial distribution](https://en.wikipedia.org/wiki/Multinomial_distribution) describes the probability of independent choice counts.  We can use R's built-in `dmultinom` function, which produces log probabilities directly.*

```{r}
dmultinom(choice_data,prob=c(.25,.25,.25,.25),log=TRUE)
```

**Question 4**: Finish implementing the importance sampler below.

```{r}

impSamp <- function(logTargD, # log target density
                    propR,    # a function to sample proposals
                    logPropD, # the log density for our proposal distribution
                    # Somewhat arbitarily, we are computing
                    # the expectation of the 4th param
                    ef=function(x) {x[,4]}) {
  nSamps <- 100000 # The more the better
  proposals <- propR(nSamps)
  pDens <- logPropD(proposals)
  unnP <- logTargD(proposals)
  w <- exp(unnP-pDens)
  print(paste("Expected value of target function:",
    sprintf("%2.3f",sum(w*ef(proposals))/sum(w))))
  print(paste("Log of average importance weight:",
    sprintf("%2.6f",log(sum(w)/nSamps))))
}
alpha <- 2

# Define logTargD, propR, logPropD. You can use ef to sanity-check --
#   it returns the posterior expectation 4th alpha entry, which you can
#   compute exactly: alpha_k/(sum_i alpha_i), combining real counts and
#   pseudocounts.
# Then call impSamp on these arguments.
```

**Solution**:

```{r}
prop_alpha <- c(alpha,alpha,alpha,alpha)

propR <- function(n) {rdirichlet(n,prop_alpha)}
logPropD <- function(p) {ddirichlet(p,prop_alpha,log=TRUE)}
# We want to compute the likelihood for each of the proposals, all in one go.
# dmultinom doesn't inherently support that, so we're using the apply function
# to apply dmultinom to each row/proposal.
likeli <- function(p) {apply(p,1,function(ps) {out <- dmultinom(choice_data,prob=ps,log=TRUE);out})}
m2_prior <- function(p) {out <- ddirichlet(p,alpha=c(alpha,alpha,alpha,alpha),log=TRUE);out}
# Note that the prior shows up both in the proposal density and the target;
# this is the special case of importance sampling we discussed in lecture.
# We could remove those terms from both density functions since they cancel out.
logTargD <- function(p) {
  lk <- likeli(p);
  pr <- m2_prior(p);
  lk+pr}
impSamp(logTargD,propR,logPropD,function(x) {x[,4]})
```

*Because we are dealing with a conjugate prior, we can also compute the marginal likelihood directly, though deriving that is outside the scope of the course:*

```{r}
ml <- function(pd,d) {lgamma(sum(pd))+lgamma(sum(d)+1) -
    lgamma(sum(pd)+sum(d))+sum(lgamma(pd+d))-sum(lgamma(pd)+lgamma(d+1))}
ml(c(alpha,alpha,alpha,alpha),choice_data)
```

**Question 5**: What is the Bayes factor favoring the more complex model? How should we interpret this?

**Solution**: 

*The BF is equal to M2 log likelihood minus the M1 log likelihood, exponentiated to yield a regular probability. The numbers below may vary from yours because the importance sampler has stochastic output.*

```{r}
exp(-9.47 - (-11.16))
```
*It favors the model that suggests people have a bias. We would have to believe (a priori) that the more simpler model is about 5.4 times as probable to prefer it in light of our data*.

**Question 6**: Which model is favored by the BIC? Does the BIC make sense to use, here?

**Solution**:

*Fitting $\alpha$ to maximize likelihood will converge to the the same likelihood as setting multinomial parameters directly. We could also argue that the effective number of parameters is the same as in the multinomial case: 3 rather than 4 as the final parameter is determined by the sum-to-one requirement.*

```{r}
mle_nll <- -2*dmultinom(choice_data,prob=choice_data/sum(choice_data),log=TRUE)
penalty <- 4*log(sum(choice_data)) # or 3; see above. We get the same qualitative result.

print(mle_nll+penalty)
print(-2*dmultinom(choice_data,prob=c(.25,.25,.25,.25),log=TRUE))
```

*No. If we can compute marginal likelihoods, there's little point in using the BIC, and it's a bit peculiar to fit a prior in this way, as conceptually it becomes somewhat superfluous; we could have just fit the multinomial parameters directly.*