---
title: "Computational Cognitive Science 2022-2023"
output:
  pdf_document: default
  html_document: default
urlcolor: blue
references:
- author:
  - family: Kemp
    given: Charles
  - family: Perfors
    given: Andrew
  - family: Tenenbaum
    given: Joshua
  id: Kemp
  issued:
    year: 2007
  publisher: Wiley Online Library
  title: Learning overhypotheses with hierarchical Bayesian models
  type: article-journal
  container-title: Developmental science
  volume: 10
  number: 3
---

# Tutorial 6: Overhypotheses

## The Dirichlet Distribution and the Dirichlet-Categorical Distribution

The Dirichlet distribution is a distribution over probability distributions; that is, draws from a Dirichlet are probability distributions. In the lecture on overhypotheses, the Dirichlet distribution was parameterised
with two values: $\text{Dirichlet}(\alpha \boldsymbol{\beta})$, where $\alpha > 0$ is a scalar and $\boldsymbol{\beta}$ is a probability distribution of size $K$: all $\beta_k > 0$ and $\sum_k \beta_k = 1$.

The two parameters play different roles: $\boldsymbol{\beta}$ is the **base distribution** and $\alpha$ is the **concentration parameter**, governing how much draws from $\text{Dirichlet}(\alpha \boldsymbol{\beta})$ will diverge from the base

This exercise is meant to strengthen your intuitions about the role of the $\alpha$ and $\boldsymbol{\beta}$ hyperparameters in the context of Dirichlet priors with categorical likelihoods, by examining how the hyperparameters influence the prediction of the next draw.

Dirichlet priors with categorical likelihoods have a convenient closed form for the predictive posterior, integrating over all possible draws from the Dirichlet ($\theta$ in the lectures). With the $\text{Dirichlet}(\alpha \boldsymbol{\beta})$ parameterisation, the predictive posterior is:
$$
  p(y = k | D, \alpha, \boldsymbol{\beta}) = \frac{\alpha \beta_k + N_{k}}
                                {\sum_{k'} \alpha \beta_{k'} + N_{k'}}
$$
where $N_k$ refers to the number of items of category $k$ in $D$.

**Question 1**: Consider a dataset consisting of ten marbles, two of which are black (marbles can only be black and white; so $K=2; N=10; N_\text{black} = 2$). Given this dataset,  calculate the probability of the next marble being black, using the following hyperparameter settings (4 different settings; preferably vary $\alpha$ while keeping $\boldsymbol{\beta}$ stable):

- $\boldsymbol{\beta} = [0.5, 0.5]$, $\boldsymbol{\beta} = [0.2, 0.8]$
- $\alpha = 0.01, 100$

You can calculate these by hand (which is fairly quick to do) or write code code to calculate the probabilities.

**Question 2**: What is the effect of a small $\alpha$? A large $\alpha$?

Kemp et al.'s hierarchical model does not specify $\alpha$ directly, but instead specifies a prior over possible values of $\alpha$.

**Question 3**: Give an example where this leads to very different predictions than we'd see from a model that uses a fixed value of $\alpha$.

**Question 4**: What prior over $\alpha$ did Kemp et al. use? Describe one or more features that are desirable in a prior over $\alpha$.

Kemp et al.'s model can discover that some features (like shape) are important for some "ontological kinds" (like rigid objects), whereas other features (like color) are important for other kinds (like substances).

**Question 5**: Do we need to specify the numbers of kinds of things in advance, in this model? Why or why not?

**Question 6**: Give a high level summary of how a "Chinese restaurant process" works.


## References