---
title: "Computational Cognitive Science 2022-2023"
output:
  pdf_document: default
  html_document: default
urlcolor: blue
references:
- author:
  - family: Kemp
    given: Charles
  - family: Perfors
    given: Andrew
  - family: Tenenbaum
    given: Joshua
  id: Kemp
  issued:
    year: 2007
  publisher: Wiley Online Library
  title: Learning overhypotheses with hierarchical Bayesian models
  type: article-journal
  container-title: Developmental science
  volume: 10
  number: 3
---

# Tutorial 6: Overhypotheses

## The Dirichlet Distribution and the Dirichlet-Categorical Distribution

The Dirichlet distribution is a distribution over probability distributions; that is, draws from a Dirichlet are probability distributions. In the lecture on overhypotheses, the Dirichlet distribution was parameterised
with two values: $\text{Dirichlet}(\alpha \boldsymbol{\beta})$, where $\alpha > 0$ is a scalar and $\boldsymbol{\beta}$ is a probability distribution of size $K$: all $\beta_k > 0$ and $\sum_k \beta_k = 1$.

The two parameters play different roles: $\boldsymbol{\beta}$ is the **base distribution** and $\alpha$ is the **concentration parameter**, governing how much draws from $\text{Dirichlet}(\alpha \boldsymbol{\beta})$ will diverge from the base

This exercise is meant to strengthen your intuitions about the role of the $\alpha$ and $\boldsymbol{\beta}$ hyperparameters in the context of Dirichlet priors with categorical likelihoods, by examining how the hyperparameters influence the prediction of the next draw.

Dirichlet priors with categorical likelihoods have a convenient closed form for the predictive posterior, integrating over all possible draws from the Dirichlet ($\theta$ in the lectures). With the $\text{Dirichlet}(\alpha \boldsymbol{\beta})$ parameterisation, the predictive posterior is:
$$
  p(y = k | D, \alpha, \boldsymbol{\beta}) = \frac{\alpha \beta_k + N_{k}}
                                {\sum_{k'} \alpha \beta_{k'} + N_{k'}}
$$
where $N_k$ refers to the number of items of category $k$ in $D$.

**Question 1**: Consider a dataset consisting of ten marbles, two of which are black (marbles can only be black and white; so $K=2; N=10; N_\text{black} = 2$). Given this dataset,  calculate the probability of the next marble being black, using the following hyperparameter settings (4 different settings; preferably vary $\alpha$ while keeping $\boldsymbol{\beta}$ stable):

- $\boldsymbol{\beta} = [0.5, 0.5]$, $\boldsymbol{\beta} = [0.2, 0.8]$
- $\alpha = 0.01, 100$

You can calculate these by hand (which is fairly quick to do) or write code code to calculate the probabilities.

**Solution**:

- $\alpha=0.01, \boldsymbol{\beta}=[0.5, 0.5], \frac{0.01 * 0.5 + 2}{0.01 * 0.5 + 2 + 0.01 * 0.5 + 8} = 2.005/10.01 = 0.2003$ 
- $\alpha=100, \boldsymbol{\beta}=[0.5, 0.5], 52/110 = 0.4727$
- $\alpha=0.01, \boldsymbol{\beta}=[0.2, 0.8], 2.002/10.01 = 0.2$
- $\alpha=100, \boldsymbol{\beta}=[0.2, 0.8], 22/110.0000 = 0.2$

**Question 2**: What is the effect of a small $\alpha$? A large $\alpha$?

**Solution**: *Small $\alpha$ adds a tiny pseudocount that gets washed away by the data, so the predictive posterior is very close to empirical likelihood/data distribution; large $\alpha$ influencesthe data distribution more strongly, towards the base distribution $\boldsymbol{\beta}$*.

Kemp et al.'s hierarchical model does not specify $\alpha$ directly, but instead specifies a prior over possible values of $\alpha$.

**Question 3**: Give an example where this leads to very different predictions than we'd see from a model that uses a fixed value of $\alpha$.

**Solution**: *Under the Kemp et al. prior, an $\alpha$ of 1.0 is fairly likely, but if we see many bags that are completely uniform, the hierarchical model predicts a single black marble constitutes good evidence of an almost all-black bag. In contrast, a fixed $\alpha$ of 1.0 predicts a largely mixed bag, because the previous bag examples don't carry as much information about the new bag's contents.*

**Question 4**: What prior over $\alpha$ did Kemp et al. use? Describe one or more features that are desirable in a prior over $\alpha$.

**Solution**: *Kemp et al. used an exponential distribution with a rate of 1. This prior has positive support, which is important since negative $\alpha$ values aren't permitted. We also want priors that capture our intuitions, but in this case people might disagree about what intuitions make sense.*

Kemp et al.'s model can discover that some features (like shape) are important for some "ontological kinds" (like rigid objects), whereas other features (like color) are important for other kinds (like substances).

**Question 5**: Do we need to specify the numbers of kinds of things in advance, in this model? Why or why not?

**Solution**: *No; the model supports arbitrary partitions of objects, so in principle there could be as many kinds as there are examples.*

**Question 6**: Give a high level summary of how a "Chinese restaurant process" works.

**Solution**: *The CRP can be seen as a prior over partitions in a Dirichlet process, which itself can be thought of as the limiting case of a Dirichlet distribution where the number of discrete outcomes goes to infinity, but the total probability mass associated with yet-un-sampled discrete outcomes is collectively proportional to a parameter $\alpha$. Quoting Murphy, 2012: "The analogy is as follows: The tables are like clusters, and the customers are like observations. When a person enters the restaurant, he may choose to join an existing table with probability proportional to the number of people already sitting at this table [...]; otherwise, with a probability that diminishes as more people enter the room [...], [they] may choose to sit at a new table [...]. The result is a distribution over partitions of the integers, which is like a distribution of customers to tables.*

## References