--- title: "Computational Cognitive Science 2025-2026" output: pdf_document: default html_document: default urlcolor: blue --- # Tutorial 6: Multi-armed bandits ## Bandits Today's tutorial is a tour of a classic multi-arm Bernoulli bandit task and some of the models we discussed in lecture. First, let's define a function that will try a policy with a particular multi-armed bandit problem, and return the mean reward as well as a dataframe containing all choices and rewards. ```{r} banditTask <- function(policy,horizon,armProbs) { nArms <- length(armProbs) choices <- c() rewards <- c() for(t in 1:horizon) { nextChoice <- policy(choices,rewards,nArms) nextReward <- rbinom(1,1,armProbs[nextChoice]) choices <- append(choices,nextChoice) rewards <- append(rewards,nextReward) } c(mean(rewards),data.frame(choices=choices,rewards=rewards)) } ``` **Question 1**: What is the `horizon` argument? Our bandit policies are not given the value of horizon. Would we expect this to be the case for all bandit policies we discussed? If so, why? If not, which one(s) need to be aware of the horizon? **Solution**: *The horizon is the total number of trials. Optimal bandit policies do need to know the horizon, as the tradeoff between exploration and exploitation changes as we get closer to the final trial.* Next, we will define three policies: random, WSLS, and greedy. ```{r} randomPolicy <- function(choices,rewards,nArms) { sample(1:nArms,1,replace=TRUE,prob=rep(1/nArms,nArms)) } wslsPolicy <- function(choices,rewards,nArms) { if(length(choices) == 0) { sample(1:nArms,1,replace=TRUE,prob=rep(1/nArms,nArms)) } else { lastReward <- rewards[length(rewards)] lastChoice <- choices[length(choices)] if(lastReward==1) { lastChoice } else { options <- (1:nArms)[-lastChoice] sample(options,1,replace=TRUE,prob=rep(1/(nArms-1),nArms-1)) } } } # - A lazy implementation; recomputing counts is inefficient # - This is a policy generator, which returns a policy with particular # hyperparameters alpha, beta of the Beta prior over each arm. greedyPolicy <- function(alpha,beta) { function(choices,rewards,nArms) { ratios <- rep(0,nArms) for(i in 1:nArms) { num <- sum(rewards[choices==i])+alpha den <- sum(choices==i)+alpha+beta ratios[i] <- num/den } validArms <- which(ratios == max(ratios)) nV <- length(validArms) if(nV == 1) { validArms } else { sample(validArms,1,prob=rep(1/nV,nV)) } } } ``` **Question 2** The greedy policy has hyperparameters $\alpha$ and $\beta$. Why can't we just initialize all ratios with 0 before any choices were made? **Solution 2**: *If we initialize all ratios to zero, a greedy would get stuck at the first arm which returns a reward. $\alpha=1$ and $\beta=1$ is a prior that is equivalent to having observed one success and one failure on each arm.* **Question 3** What does the ratio parameter correspond to in the greedy policy (hint: have a look at the moments of the Beta distribution)? **Solution 3**: *The ratio $\frac{\alpha}{\alpha+\beta}$ is the expected value/mean of the Beta distribution. I.e., the expected reward probability of an arm.* **Question 4**: What are the memory requirements for each policy? How does the computational cost of computing the next choice change with the number of trials? **Solution 4**: *The random policy is completely memoryless. The WSLS policy needs to track what it did last and whether it won or not, so 3 bits for 4 arms. The greedy policy needs to track the number of wins and losses for each arm. In all cases, the computational cost is constant given a reasonable implementation, though in our case the greedy policy's cost scales linearly with the number of trials.* Let's take a Bernoulli bandit task with reward probabilities of 0, 1/4, 1/2, and 3/4, and see how WSLS compares to a greedy policy. Here's a single run, with a random policy for comparison (though we can straightforwardly compute the expected reward of a random policy). ```{r} exampleArms <- c(0,.25,.5,.75) randRes <- banditTask(randomPolicy,2000,exampleArms)[[1]] wslsRes <- banditTask(wslsPolicy,2000,exampleArms)[[1]] greedyRes <- banditTask(greedyPolicy(1,1),2000,exampleArms)[[1]] randRes wslsRes greedyRes ``` Now let's look at distributions of reward over several runs of the task, with 200 trials each. ```{r} nRuns <- 1000 randResBatch <- numeric(nRuns) greedyResBatch <- numeric(nRuns) wslsResBatch <- numeric(nRuns) for(i in 1:nRuns) { randResBatch[i] <- banditTask(randomPolicy,200,exampleArms)[[1]] greedyResBatch[i] <- banditTask(greedyPolicy(1,1),200,exampleArms)[[1]] wslsResBatch[i] <- banditTask(wslsPolicy,200,exampleArms)[[1]] } hist(randResBatch,breaks=40, xlim=c(0,1)) hist(greedyResBatch,breaks=40, xlim=c(0,1)) hist(wslsResBatch,breaks=40, xlim=c(0,1)) ``` **Question 5**: Is it surprising that greedy policy outperformed WSLS? Give an example to illustrate why or why not. **Solution 5**: *No, it isn't, as a greedy policy will often converge to reliably picking the best arm, whereas WSLS will abandon the best arm every time it fails (1 time in 4).* **Question 6**: Why are there multiple modes in the histogram for the greedy strategy? What might we do to make the worse modes disappear? **Solution 6**: *As we discussed in lecture, greedy strategies can get unlucky and see early evidence suggesting that good options are not worth considering, e.g., an initial failure for the 0.75 arm and a win for the 0.5 arm. There are a few options, including using an $\epsilon$-greedy strategy or having an "optimistic" prior for the rewards.* **Question 7**: How would we implement an $\epsilon$-greedy model without writing much new code? **Solution 7**: *We can wrap our existing greedy policy with a "coin toss".* ```{r} epsilonGreedyPolicy <- function(innerGreedy,epsilon) { function(choices,rewards,nArms) { if(runif(1) < epsilon) { randomPolicy(choices,rewards,nArms) } else { innerGreedy(choices,rewards,nArms) } } } myEpsGreedy <- epsilonGreedyPolicy(greedyPolicy(0.001,0.001),.1) egBatch <- numeric(nRuns) for(i in 1:nRuns) { egBatch[i] <- banditTask(myEpsGreedy,200,exampleArms)[[1]] } hist(egBatch,breaks=40, xlim=c(0,1)) mean(egBatch) mean(greedyResBatch) ``` **Question 8**: In the lecture, we discussed Thompson sampling as a viable alternative. How would you modify the code below to implement Thomposon sampling for our case of Bernoulli bandits? **Solution 8**: *In Thompson sampling we maintain beliefs over the reward probabilities for each arm using a Beta distribution. We need to sample a reward distribution for each arm and pick the one with the highest expected reward. The reward distributions in our case are Bernoulli distributions with parameters $\lambda_i$. We therefore first sample a $\lambda_i$ for each arm from each Beta distribution. The expected reward in the Bernoulli case is simply $\lambda_i$. We therefore pick the arm with the highest $\lambda_i$.* ```{r} # - A lazy implementation; recomputing counts is inefficient # - This is a policy generator, which returns a policy with particular # hyperparameters alpha, beta of the Beta prior over each arm. thompsonSampling <- function(alpha,beta) { function(choices,rewards,nArms) { successes <- rep(0,nArms) failures <- rep(0,nArms) for(i in 1:nArms) { successes[i] <- sum(rewards[choices==i])+alpha failures[i] <- sum(rewards[choices==i]==0)+beta } # YOUR CODE GOES HERE. Remember: you can sample from a beta distribution with rbeta lambdas <- rep(0, nArms) for (i in 1:nArms) { lambdas[i] <- rbeta(1, successes[i], failures[i]) } which(lambdas == max(lambdas)) } } ``` **Question 9**: Let's compare Thompson sampling to the greedy policy. What can we say? ```{r} nRuns <- 1000 randResBatch <- numeric(nRuns) thompsonResBatch <- numeric(nRuns) greedyResBatch <- numeric(nRuns) wslsResBatch <- numeric(nRuns) for(i in 1:nRuns) { randResBatch[i] <- banditTask(randomPolicy,200,exampleArms)[[1]] thompsonResBatch[i] <- banditTask(thompsonSampling(1,1),200,exampleArms)[[1]] greedyResBatch[i] <- banditTask(greedyPolicy(0.001,0.001),200,exampleArms)[[1]] wslsResBatch[i] <- banditTask(wslsPolicy,200,exampleArms)[[1]] } hist(thompsonResBatch,breaks=40, xlim=c(0,1)) hist(greedyResBatch,breaks=40, xlim=c(0,1)) ``` **Solution 9**: *Thompson sampling avoids the local minima of the greedy policy as it will explore arms as long as it has high uncertainty about their reward probability. Thompson sampling therefore always discovers the best arm. Compared to the best-case scenario of the greedy policy in which the best arm is exploited, Thompson sampling involves spending more time in low-reward arms. The single mode of the plot for the Thompson policy is therefore slightly shifted to the left compared to the best mode of the greedy policy.* **Question 10**: What do you think is important to consider in a model of decision making in bandit tasks, e.g., phenomena it should be able to capture? Discuss. **Solution 10**: *A few possibilities came up in lecture, e.g., switching between exploratory and exploitative modes, the possibility that rewards might shift, sensitivity to costs and benefits shaping policy choice, to name a new. Hopefully students will come up with some of their own ideas as well.* **Question 11**: What might you do to make bandit tasks more realistic or representative of analogous tasks in everyday life? Discuss. **Solution 11**: *As w/Q6, there are many possibilities, including restless bandits, potentially informative features, a more stateful representation of the agent, e.g., preferences for novelty, different reward distributions, and so on.*