--- title: "Computational Cognitive Science 2022-2023" output: pdf_document: default html_document: default urlcolor: blue references: - author: - family: Kemp given: Charles - family: Perfors given: Andrew - family: Tenenbaum given: Joshua id: Kemp issued: year: 2007 publisher: Wiley Online Library title: Learning overhypotheses with hierarchical Bayesian models type: article-journal container-title: Developmental science volume: 10 number: 3 --- # Tutorial 6: Overhypotheses ## The Dirichlet Distribution and the Dirichlet-Categorical Distribution The Dirichlet distribution is a distribution over probability distributions; that is, draws from a Dirichlet are probability distributions. In the lecture on overhypotheses, the Dirichlet distribution was parameterised with two values: $\text{Dirichlet}(\alpha \boldsymbol{\beta})$, where $\alpha > 0$ is a scalar and $\boldsymbol{\beta}$ is a probability distribution of size $K$: all $\beta_k > 0$ and $\sum_k \beta_k = 1$. The two parameters play different roles: $\boldsymbol{\beta}$ is the **base distribution** and $\alpha$ is the **concentration parameter**, governing how much draws from $\text{Dirichlet}(\alpha \boldsymbol{\beta})$ will diverge from the base This exercise is meant to strengthen your intuitions about the role of the $\alpha$ and $\boldsymbol{\beta}$ hyperparameters in the context of Dirichlet priors with categorical likelihoods, by examining how the hyperparameters influence the prediction of the next draw. Dirichlet priors with categorical likelihoods have a convenient closed form for the predictive posterior, integrating over all possible draws from the Dirichlet ($\theta$ in the lectures). With the $\text{Dirichlet}(\alpha \boldsymbol{\beta})$ parameterisation, the predictive posterior is: $$ p(y = k | D, \alpha, \boldsymbol{\beta}) = \frac{\alpha \beta_k + N_{k}} {\sum_{k'} \alpha \beta_{k'} + N_{k'}} $$ where $N_k$ refers to the number of items of category $k$ in $D$. **Question 1**: Consider a dataset consisting of ten marbles, two of which are black (marbles can only be black and white; so $K=2; N=10; N_\text{black} = 2$). Given this dataset, calculate the probability of the next marble being black, using the following hyperparameter settings (4 different settings; preferably vary $\alpha$ while keeping $\boldsymbol{\beta}$ stable): - $\boldsymbol{\beta} = [0.5, 0.5]$, $\boldsymbol{\beta} = [0.2, 0.8]$ - $\alpha = 0.01, 100$ You can calculate these by hand (which is fairly quick to do) or write code code to calculate the probabilities. **Question 2**: What is the effect of a small $\alpha$? A large $\alpha$? Kemp et al.'s hierarchical model does not specify $\alpha$ directly, but instead specifies a prior over possible values of $\alpha$. **Question 3**: Give an example where this leads to very different predictions than we'd see from a model that uses a fixed value of $\alpha$. **Question 4**: What prior over $\alpha$ did Kemp et al. use? Describe one or more features that are desirable in a prior over $\alpha$. Kemp et al.'s model can discover that some features (like shape) are important for some "ontological kinds" (like rigid objects), whereas other features (like color) are important for other kinds (like substances). **Question 5**: Do we need to specify the numbers of kinds of things in advance, in this model? Why or why not? **Question 6**: Give a high level summary of how a "Chinese restaurant process" works. ## References