--- title: "Computational Cognitive Science 2025-2026" output: pdf_document: default html_document: default urlcolor: blue --- # Tutorial 6: Multi-armed bandits ## Bandits Today's tutorial is a tour of a classic multi-arm Bernoulli bandit task and some of the models we discussed in lecture. First, let's define a function that will try a policy with a particular multi-armed bandit problem, and return the mean reward as well as a dataframe containing all choices and rewards. ```{r} banditTask <- function(policy,horizon,armProbs) { nArms <- length(armProbs) choices <- c() rewards <- c() for(t in 1:horizon) { nextChoice <- policy(choices,rewards,nArms) nextReward <- rbinom(1,1,armProbs[nextChoice]) choices <- append(choices,nextChoice) rewards <- append(rewards,nextReward) } c(mean(rewards),data.frame(choices=choices,rewards=rewards)) } ``` **Question 1**: What is the `horizon` argument? Our bandit policies are not given the value of horizon. Would we expect this to be the case for all bandit policies we discussed? If so, why? If not, which one(s) need to be aware of the horizon? Next, we will define three policies: random, win stay loose shift (WSLS), and greedy. A policy $\pi(a|s)$ is a mapping from a state $s$ to an action $a$. The state in the case of a bandit task is the record of all previous actions and obtained rewards. ```{r} randomPolicy <- function(choices,rewards,nArms) { # sample a random arm sample(1:nArms,1,replace=TRUE,prob=rep(1/nArms,nArms)) } wslsPolicy <- function(choices,rewards,nArms) { if(length(choices) == 0) { # if there was no prior choice, sample a random arm sample(1:nArms,1,replace=TRUE,prob=rep(1/nArms,nArms)) } else { # if there was a prior choice stay if it involved a reward, # ... sample another option randomly otherwise lastReward <- rewards[length(rewards)] lastChoice <- choices[length(choices)] if(lastReward==1) { lastChoice } else { options <- (1:nArms)[-lastChoice] sample(options,1,replace=TRUE,prob=rep(1/(nArms-1),nArms-1)) } } } # - A lazy implementation; recomputing counts is inefficient # - This is a policy generator, which returns a policy with particular # hyperparameters alpha, beta of the Beta prior over each arm. greedyPolicy <- function(alpha,beta) { function(choices,rewards,nArms) { ratios <- rep(0,nArms) for(i in 1:nArms) { num <- sum(rewards[choices==i])+alpha den <- sum(choices==i)+alpha+beta ratios[i] <- num/den } validArms <- which(ratios == max(ratios)) nV <- length(validArms) if(nV == 1) { validArms } else { sample(validArms,1,prob=rep(1/nV,nV)) } } } ``` **Question 2**: The greedy policy has hyperparameters alpha and beta. Why can't we just initialize all ratios with 0 before any choices were made? **Question 3**: What does the ratio parameter correspond to in the greedy policy (hint: have a look at the Beta distribution)? **Question 4**: What are the memory requirements for each policy? How does the cost of computing the next choice change with the number of trials? Let's take a Bernoulli bandit task with reward probabilities of 0, 1/4, 1/2, and 3/4, and see how WSLS compares to a greedy policy. Here's a single run, with a random policy for comparison (though we can straightforwardly compute the expected reward of a random policy). ```{r} exampleArms <- c(0,.25,.5,.75) randRes <- banditTask(randomPolicy,2000,exampleArms)[[1]] wslsRes <- banditTask(wslsPolicy,2000,exampleArms)[[1]] greedyRes <- banditTask(greedyPolicy(1,1),2000,exampleArms)[[1]] randRes wslsRes greedyRes ``` Now let's look at distributions of reward over several runs of the task, with 200 trials each. ```{r} nRuns <- 1000 randResBatch <- numeric(nRuns) greedyResBatch <- numeric(nRuns) wslsResBatch <- numeric(nRuns) for(i in 1:nRuns) { randResBatch[i] <- banditTask(randomPolicy,200,exampleArms)[[1]] greedyResBatch[i] <- banditTask(greedyPolicy(1,1),200,exampleArms)[[1]] wslsResBatch[i] <- banditTask(wslsPolicy,200,exampleArms)[[1]] } hist(randResBatch,breaks=40, xlim=c(0,1)) hist(greedyResBatch,breaks=40, xlim=c(0,1)) hist(wslsResBatch,breaks=40, xlim=c(0,1)) ``` **Question 5**: Is it surprising that greedy policy outperformed WSLS? Give an example to illustrate why or why not. **Question 6**: Why are there multiple modes in the histogram for the greedy strategy? What might we do to make the worse modes disappear? **Question 7**: How would we implement an $\epsilon$-greedy model without writing much new code? **Question 8**: In the lecture, we discussed Thompson sampling as a viable alternative. How would you modify the code below to implement Thomposon sampling for our case of Bernoulli bandits? ```{r} # - A lazy implementation; recomputing counts is inefficient # - This is a policy generator, which returns a policy with particular # hyperparameters alpha, beta of the Beta prior over each arm. thompsonSampling <- function(alpha,beta) { function(choices,rewards,nArms) { successes <- rep(0,nArms) failures <- rep(0,nArms) for(i in 1:nArms) { successes[i] <- sum(rewards[choices==i])+alpha failures[i] <- sum(rewards[choices==i]==0)+beta } # YOUR CODE GOES HERE. Remember: you can sample from a beta distribution with rbeta } } ``` **Question 9**: Let's compare Thompson sampling to the greedy policy. What can we say? ```{r} nRuns <- 1000 randResBatch <- numeric(nRuns) thompsonResBatch <- numeric(nRuns) greedyResBatch <- numeric(nRuns) wslsResBatch <- numeric(nRuns) for(i in 1:nRuns) { randResBatch[i] <- banditTask(randomPolicy,200,exampleArms)[[1]] thompsonResBatch[i] <- banditTask(thompsonSampling(1,1),200,exampleArms)[[1]] greedyResBatch[i] <- banditTask(greedyPolicy(0.001,0.001),200,exampleArms)[[1]] wslsResBatch[i] <- banditTask(wslsPolicy,200,exampleArms)[[1]] } hist(thompsonResBatch,breaks=40, xlim=c(0,1)) hist(greedyResBatch,breaks=40, xlim=c(0,1)) ``` **Question 10**: What do you think is important to consider in a model of decision making in bandit tasks, e.g., phenomena it should be able to capture? Discuss. **Question 11**: What might you do to make bandit tasks more realistic or representative of analogous tasks in everyday life? Discuss.