## Homework #5: Statistical models

### Due Thu., Sep 29

#### PDF version

## Homework

### Preliminary Information

Here are the homework exercises from the book

### Homework Exercises:

**Self-study:** Work these exercises, but do not turn them in.

- Exercises 6E1, 6E3, 6E4
- Exercises 7E1, 7E3, 7E4

**Turn in:** Work these exercises and turn them in.

- Exercises 6M2–6M3
- Exercises 7M2–7M6

**Optional:** The following exercises are optional. You can turn them in for extra credit.

- Exercise 6H1

### Notes on Homework:

Because I was late posting the exercises from the textbook, everyone can have an extension until Friday to turn the homework in. Turn it in any time before the weekend and you’ll get full credit.

### Exercise 7M3:

For exercise 7M3, I recommend using the hominin model 7.4 in section 7.1:

First, create the data (code 7.1 and 7.2), and then use model 7.4 from
code 7.8, on p. 198. Make several variations, in which you change
the standard deviation on the prior for `b`

from 10 to smaller values,
such as 5, 2, 1, 0.5, 0.1, and 0.01.

```
lst_7.1 <- alist(
brain_std ~ dnorm(mu, exp(log_sigma)),
mu <- a + b[1] * mass_std + b[2] * mass_std^2 +
b[3] * mass_std^3 + b[4] * mass_std^4,
a ~ dnorm(0.5, 1),
b ~ dnorm(0, 10),
log_sigma ~ dnorm(0, 1)
)
lst_7.1a <- alist(
brain_std ~ dnorm(mu, exp(log_sigma)),
mu <- a + b[1] * mass_std + b[2] * mass_std^2 +
b[3] * mass_std^3 + b[4] * mass_std^4,
a ~ dnorm(0.5, 1),
b ~ dnorm(0, 5),
log_sigma ~ dnorm(0, 1)
)
```

and so forth, for smaller and smaller values of the standard deviation on the
prior for `b`

.

Then fit each model to the data like this:

```
mdl_7.1 <- quap(lst_7.1, data = d, start = list(b = rep(0,4)))
mdl_7.1a <- quap(lst_7.1a, data = d, start = list(b = rep(0,4)))
...
```

Use `precis`

to inspect the values of the parameters and see how they change
as you apply more and more informative priors

Then you can sample from the models, following the code in 7.10

```
new_data <- data.frame(mass_std = seq(min(d$mass_std), max(d$mass_std),
length.out = 100))
lnk_7.1 <- link(mdl_7.1, data = new_data)
lnk_7.1a <- link(mdl_7.1a, data = new_data)
...
```

Finally, you can plot the models

```
mu_7.1 <- apply(lnk_7.1, 2, mean)
ci_7.1 <- apply(lnk_7.1, 2, PI)
mu_7.1a <- apply(lnk_7.1a, 2, mean)
ci_7.1a <- apply(lnk_7.1a, 2, PI)
...
plot(brain_std ~ mass_std, data = d)
lines(new_data$mass_std, mu_7.1)
shade(ci_7.1, new_data$mass_std)
plot(brain_std ~ mass_std, data = d)
lines(new_data$mass_std, mu_7.1a)
shade(ci_7.1a, new_data$mass_std)
...
```

and so forth.

As you make the prior for `b`

narrower and narrower, how do things change in
terms of overfitting and underfitting?

### Exercise 7M4:

For exercise 7M4, I recommend using the fungus treatment example from section 6.2 or the educational attainment example from section 6.3.2.

Start with the code for creating the data (code 6.13 for the fungus example, or 6.25–6.26 for the education example). Modify the code to generate a data-frame with 200 rows (\(N = 200\))

Then instead of using the code for fitting models from 6.15–6.17 or
6.27–6.28, make the formula lists separately and then call `quap`

with
different amounts of data:

```
lst_6.6 <- alist(
h1 ~ dnorm(mu, sigma),
mu <- h0 * p,
p ~ dlnorm(0, 0.25),
sigma ~ dexp(1)
)
```

and make similar lists for models 6.7 and 6.8.

If you’re using the educational attainment model, there are only two models
in the book, so I’d recommend making up a 3rd model with fewer parameters
(maybe only look at `G`

and ignore `P`

).

You can pick \(n\) rows from the data frame \(d\) using the `slice_sample`

function:

```
d_20 <- slice_sample(d, n = 20)
d_50 <- slice_sample(d, n = 50)
d_100 <- slice_sample(d, n = 100)
```

Now you can fit each model to a different amount of data:

`mdl_6.6_20 <- quap(lst_6.6, data = d_20)`

First, for a single model, vary the amount of data and see how WAIC changes with the number of samples.

Then fit the different models to the same data (`d_20`

, `d_50`

, `d_100`

), and
use the `compare`

function to see which has the smallest WAIC. Are the
results consistent for the different amounts of data (20, 50, or 100 points)?

Next, compare models fit to different amounts of data: `mdl_6.6.20`

,
`mdl_6.7_50`

, and `mdl_6.8_100`

and other combinations.
Does the `compare`

function rank the models differently when you
compare models fit to different amounts of data?
(You will see a warning from `compare()`

when you do this).