---
title: "Computational Cognitive Science 2022-2023"
output:
  pdf_document: default
  html_document: default
urlcolor: blue
---

# Tutorial 7: Multi-armed bandits

## Bandits

Today's tutorial is a tour of a classic multi-arm Bernoulli bandit task and some of the models we discussed in lecture.

First, let's define a function that will try a policy with a particular multi-armed bandit problem, and return the mean reward as well as a dataframe containing all choices and rewards.

```{r}
banditTask <- function(policy,horizon,armProbs) {
  nArms <- length(armProbs)
  choices <- c()
  rewards <- c()
  for(t in 1:horizon) {
    nextChoice <- policy(choices,rewards,nArms)
    nextReward <- rbinom(1,1,armProbs[nextChoice])
    choices <- append(choices,nextChoice)
    rewards <- append(rewards,nextReward)
  }
  c(mean(rewards),data.frame(choices=choices,rewards=rewards))
}
```

**Question 1**: What is the `horizon` argument? Our bandit policies are not given the value of horizon. Would we expect this to be the case for all bandit policies we discussed? If so, why? If not, which one(s) need to be aware of the horizon?

Next, we will define three policies: random, WSLS, and greedy.

```{r}
randomPolicy <- function(choices,rewards,nArms) {
  sample(1:nArms,1,replace=TRUE,prob=rep(1/nArms,nArms))
}

wslsPolicy <- function(choices,rewards,nArms) {
  if(length(choices) == 0) {
    sample(1:nArms,1,replace=TRUE,prob=rep(1/nArms,nArms))
  } else {
    lastReward <- rewards[length(rewards)]
    lastChoice <- choices[length(choices)]
    if(lastReward==1) {
      lastChoice
    } else {
      options <- (1:nArms)[-lastChoice]
      sample(options,1,replace=TRUE,prob=rep(1/(nArms-1),nArms-1))  
    }
  }
}

# A lazy implementation; recomputing counts is inefficient
greedyPolicy <- function(alpha,beta) {
  function(choices,rewards,nArms) {
    ratios <- rep(0,nArms)
    for(i in 1:nArms) {
      num <- sum(rewards[choices==i])+alpha
      den <- sum(choices==i)+alpha+beta
      ratios[i] <- num/den
    }
    
    validArms <- which(ratios == max(ratios))
    nV <- length(validArms)
    if(nV == 1) {
      validArms
    } else {
      sample(validArms,1,prob=rep(1/nV,nV))
    }
  }
}
```

**Question 2**: What are the memory requirements for each policy? How does the cost of computing the next choice change with the number of trials?

Let's take a Bernoulli bandit task with reward probabilities of 0, 1/4, 1/2, and 3/4, and see how WSLS compares to a greedy policy. Here's a single run, with a random policy for comparison (though we can straightforwardly compute the expected reward of a random policy).

```{r}
exampleArms <- c(0,.25,.5,.75)

randRes <- banditTask(randomPolicy,2000,exampleArms)[[1]]
wslsRes <- banditTask(wslsPolicy,2000,exampleArms)[[1]]
greedyRes <- banditTask(greedyPolicy(1,1),2000,exampleArms)[[1]]

randRes
wslsRes
greedyRes
```
Now let's look at distributions of reward over several runs of the task, with 200 trials each.

```{r}
nRuns <- 1000
greedyResBatch <- numeric(nRuns)
wslsResBatch <- numeric(nRuns)

for(i in 1:nRuns) {
  greedyResBatch[i] <- banditTask(greedyPolicy(0.001,0.001),200,exampleArms)[[1]]
  wslsResBatch[i] <- banditTask(wslsPolicy,200,exampleArms)[[1]]
}

hist(greedyResBatch,breaks=40)
hist(wslsResBatch,breaks=40)
```

**Question 3**: Is it surprising that greedy policy outperformed WSLS? Give an example to illustrate why or why not.

**Question 4**: Why are there multiple modes in the histogram for the greedy strategy? What might we do to make the worse modes disappear?

**Question 5**: How would we implement an $\epsilon$-greedy model without writing much new code?

**Question 6**: What do you think is important to consider in a model of decision making in bandit tasks, e.g., phenomena it should be able to capture? Discuss.

**Question 7**: What might you do to make bandit tasks more realistic or representative of analogous tasks in everyday life? Discuss.