lec 20
all right welcome back everybody I have

a lot of slides this is gonna be a great

lecture yeah I’d say the best responses

oh no all right this is the last lecture

but certainly not the least we are going

to cover some topics that I think are

extremely important we’re doing the

flight statistics before we get to that

let’s talk about pancakes okay these are

my very bad drawings of pancakes this is

three pancakes and the hatching on a

side of a pancake indicates that that

side is burnt so imagine like you come

over to my apartment for pancakes and I

turn on the griddle and the temperature

isn’t right it’s too hot and the first

couple pancakes get burnt this is

actually how it happens right and so by

the third pancake it’s perfect so the

first pancake is burnt on both sides the

second pancake is burnt on only one side

and the third pancake is just right yeah

now I serve you at random a pancake with

a random side up and you look down at

your pancake and it’s burnt and I want

to ask you since you’ve taken my stats

course what’s the posterior probability

that the other side is also burnt I’ll

let you think about it for a second all

the information you need to solve this

problem is on the slide you don’t know

which of the three pancakes it is right

you know it’s when it’s not right it’s

not number three so what’s the

probability the other side is per what’s

the answer

there’s it anybody else that think a

half we got two votes for a half

two out of three or yeah it’s the answer

is two out of three it’s not a half half

is the standard intuition and it’s a

perfectly reasonable count on that word

that just means you’re human

the intuition is half cuz there’s two

pancakes it could be right you think I

could be either ones random right no

it’s two out of three and let me show

you how to figure that out but I’m in

doing this it wasn’t to make you feel


especially at all it was my intuition

too as a hat first time I saw this

problem and this is the famous logic

problem actually it wasn’t originally

done with pancakes but I liked it better

with pancakes yeah there’s really like

boxes with balls in it or something

stupid like that

so pancakes if everybody likes pancakes

right yeah okay good answer so the point

of these logic puzzles is not to make

you feel bad it’s to teach you some

intuitions about or correct your

intuitions and teaching methods for

solving these things and as I’ve said

many times in this course I think one of

the values of learning probability

theory is that it means you don’t have

to be clever now you can just ruthlessly

apply the rules of conditioning and you

don’t have to feel like you need to

Intuit the right answer to this thing so

don’t trust your intuitions would be my

advice instead be ruthless ruthlessly

conditioned what do I mean the way we

figure things out in probability theory

or Bayesian inference which is just

probability theory is we want to know

the probability of something and so we

conditioned on what we know and see if

that updates the probability right

there’s any information in what we

already know for inferring the thing

we’d like to know then if we compute the

probability of thing you want to know

conditional on the stuff we already know

it’ll be there right and the rules of

probability tell us the only way to do

that so we do that let’s do that for the

pancakes all right here’s your Berk

pancake we want to know the probability

that the side we can’t see is burnt

conditional on what we know which is

that the side up is burnt now of course

you could peek right so this is a

probability puzzle yeah the rules of

probability tell us that a conditional

probability is defined as the

probability that both things are true

divided by the probability

but the thing we condition on is true so

that’s what the expression on the right

hand side is the probability burnt up

and burnt down at the both sides of burn

divided by the probability burnt up this

is base theorem by the way just

disguised right so this is just a

definition of conditional probability if

you ever forget this just Google

definition of conditional probability

there’ll be a nice Wikipedia page for

you right we have all the information

needed to compute this first we need the

probability that the burnt side is up

and there are two ways this can happen

well let’s think of it this way method

logically there are three pancakes right

so you’ve got a bebe pancake a BU

pancake and a uu pancake doesn’t mean

Bibi is burnt burnt right bu is burnt on

Burt and then unburnt on Burt

three pancakes probability that this is

the Burt burnt that you would see a

burnt side up if the pancake was burnt

burnt is one because either side will

show you burnt

yeah so that’s why it’s probability B B

times 1 probability B u times 0.5

because only one of the sides of that

pancake is burnt so if I’m randomly

flopping it down on your plate then

there’s a random chance the upside is

burn and then for the unburned pancake

there’s no way so you already eliminated

that right there there your intuition

was working each of these pancakes has a

chance third of being served to you so

there’s a half chance did you get the

burnt side up on your pancake and then

the last thing it’s just the top well

it’s the probability of the burnt up

burnt down pancake it’s one third

because they’re free pancakes right and

I served you one at random so it’s 1/3

divided by half and the answers 2/3

weird right

let me give you some intuition about why

now that probability theory has given

you the right answer in the text there’s

a simulation in a pancake based

individual based pancake simulation of

this right to tell you you’ll appreciate

this right it’s like agent base but

their pancakes to prove through this

that it’s 2/3 if you didn’t believe with

calculation let me try to give you some

intuition though now that I just told

you you don’t need to be clever and in


I’m not clever at all I just ruthlessly

applied conditional probability and

figured things out but if you want some

intuition at the end which is a

perfectly legitimate goal let me try to

give you a little bit the mistake is

focusing on pancakes you want to focus

on sides so look in this picture there’s

a side at the bottom and we want to know

what the other side is so if you look at

the top there are three bird sides the

sign at the bottom could be any of those

right so there are three possibilities

the other side of those how many of

those three possibilities are also burnt

the answers – so the answer is two out

of three you got to focus on sides we’re

talking about sides not pancakes right

that’s the mistake pancakes are so

delicious you were cooking a pancake and

not the side right mental bias I forgive


alright so the point of all this is

everything we’ve done in this course

everything you do in in any statistics

course really is underlined by the idea

of ruthlessly applying conditional

probability to solve problems there’s

some information we have there’s some

other thing we’d like to infer if

there’s any evidence in the information

we have about the thing we like to

confer you reveal it through

conditioning through the rules of

probability on whether we another way to

think about this express our information

as constraints and distributions and

then we just sent logic do the rest and

that’s what Bayesian inference is is

just logic so rest of the time today I

want to take this approach and showed

how so you have produces automatic

solutions to two very common problems

which are historically in statistics

very hard to solve because people tried

to be clever if you void being clever

just ruthlessly apply conditioning you

can get useful solutions to two very big

problems these are measurement error try

assert is always present in data and

usually ignored yeah and missing data

which is a special kind of measurement

error right it’s not there at all it’s

just the extreme version of measurement

error these are really common so let’s

think about measurement error first

there’s always some error measurement

for sure

and when you do an ordinary linear

regression it’s captured in this

residual bearings right there’s this

Sigma thing it’s supposed to capture any

even if you had the true relationship

you don’t expect a perfect fit that’s

what that Sigma error variance on the

end is supposed to be that’s fine if the

only error is on the outcome variable

what if there’s also an error on the

predictor variables and imagine for

example that that error is not in

uniform across the cases or the

variables then you’re in trouble

you could be in trouble and there are a

whole bunch of ad hoc procedures to try

and deal with this like reduce major

axis regression and sunshin and none of

them are terribly reliable so let’s

think about an approach that avoids

trying to be clever and just conditions

on what we know how to get there so

think back to Waffle ‘s right we have

pancakes now waffles the waffle divorce

data set from early on in the course

yeah and in that data set we had a bunch

of states in the United States of

America for each one we had measured

divorce rate median age of marriage and

the marriage rate and in if you go back

and look at the data set you’ll see

there were columns which were standard

errors on a couple of the variables that

is these are taken from partial census

data from records they’re measured with

some error and that error has been

quantified in terms of the standard

error on each of those values we don’t

know the divorce rate in a particular

state what we have is an estimate of it

and we also have a number that tells us

our confidence in at this standard error

okay there’s a lot of heterogeneity in

this error in this data set so I’m

showing you on this graph is the divorce

rate on the vertical median agent

marriage on the horizontal and the line

segments show you the standard error for

each state I’m just force rate so some

of these are big the reason some of them

are big is because some of the states

are small and so they produce relatively

little evidence in any given since this

period so here’s another way to look at

these data the left graph is the

graphics on the previous slide it’s just

divorce rate against median age of

marriage with the line segments for the

standard error of each divorce rate and

then on the right the divorce rate again

vertical axis is the same now the

horizontal is the log population

that state so really big states are on

the right California

right it’s the one all the way on the

right and in California in any

particular year you get so many of the

events that you get a fantastic measure

of the true if you want to say the true

divorce rate in the state cuz it’s just

so many Californians right they’re all

over the place I’m one right produced

all over the world and on the left

you’ve got smaller states

I won’t single any of them out well

later I will but you know pick your

favorite small state there there are

states with very small populations

particularly west of the Mississippi

where there are more livestock than

people by far in many parts of Western

the United States and in those states

there’s so few of these census events in

any given period that you’re gonna have

an error alright your confidence in

having estimated the long run rate is

very small and so the standard error is

on those states with a big this make

sense let’s think about this in terms of

a causal model measurement error can be

put into a dag any almost anything could

be put into a dag not anything but

almost anything and so we want to think

about the observed divorce rate here

which I named d-sub observed as being a

function of the true divorce rate which

we don’t get to observe here that’s D

and the population size of the state

which I call in and in the data set

we’ve already taken the end part of this

and summarized this as there’s a

standard error reliability right but in

general this is what’s going on is that

the your observed variable for divorce

rate is a function of these two things

and that’s that’s what generates the

measurement error yeah does it make

sense so how can we approach this

statistically well let’s not be clever

let’s just be ruthless and apply

conditional probability we state what we

know and we see if it’ll figure out what

we don’t know so here’s the idea about

how the observed variables are generated

right so there’s some true divorce rate

which we haven’t observed and we’d like

to use that as our outcome variable not

this crummy measured thing right with

highly unreliable things for some state

so generatively thinking our observed

divorce rate is sampled from some normal

distribution all right central limit

theorem makes this almost true in this

case and the mean of this normal

distribution will be the true rate and

then there’s a standard deviation which

has been summarized in this proportional

inversely proportional to the population

size using the standard error

yeah saving you some work you can do

this yourself somebody did this right

there’s a formula for this this make

sense conceptually what’s going on it’s

like you’ve you sampled this process

every month divorces are happening in a

given state but in California you get

thousands and thousands in any given

month in Idaho mostly zero every month

correct yeah

and so but in the long run there’s some

D true that you’d observe in the long

run averaging over a really long data

set but in any finite period there’s

error and that error will be inversely

proportional to the population size it’s

a function of the population size and

that’s what the standard error is yeah

so the thing you observe comes out of

this generative process make sense and


let’s think about the statistical model

now so this was our dad before remind

you a is median agent marriage in as the

marriage rate DS the divorce rate we

were evaluating the direct and indirect

effects of aging marriage and this is

the statistical model we had before

where the thing on the top is D true we

acted as if we knew the divorce rate but

now we don’t so we need to add something

pumps to this model what are we going to

add we’re going to put a line on top of

it which is the observation process in

this observation process you just add

this to the model now and now the D true

is a vector of parameters because we

don’t know it we replace all those

things which were previously the

outcomes with a bunch of unknown

parameters and then the observation line

at the top estimates the force

because we have information with which

to estimate them what is that

information well we have the standard

error okay that’s great but if that was

all we had that would be if that would

not be enough we also have the whole

regression or the whole regression

relationship is going to pull them down

values for the different states and you

may be getting a tingling sensation in

your skull which tells you that

shrinkage is going to happen right cuz

it’s my class and there’s always

shrinkage right so let me well I should

pause here and say does this make sense

yeah what’s happening one way to think

about this is if you were going to

simulate measurement error this is the

model you use it’s just you write down

the same model and then it runs

backwards right Bayesian models are

generative and you can run them in both

directions so if you run them forwards

you simulate fake data if you run them

in the reverse they spit out a posterior

distribution right you feed in a

distribution they spit out data you feed

in data they spit out a distribution you

can run them in both directions and so

this if you were going to simulate

measurement error you do it with the top

line you’d have some true value but it

will be sampled from a distribution so

it would have some error on it yeah

because in a finite time period imagine

the census period was a day in a big

state like California you get many more

events and so that day provides much

more information about the true rate

right the the variance from day to day

will be higher in a small state

it’s like sample size right so imagine

you pick a village instead of Los

Angeles and you try to estimate the

divorce rate it’s easier if you could

cross Angeles yeah

does it make sense it’s just sample size

it’s just sample size look how far new

is bigger than Idaho in if the periods

if that since its periods are the same

for both it’s the sample size effect

yeah you never observe the rate the rate

is an unobservable always and so the

sample size constrains your precision of

it you’ll sleep on it

it’s just sample size it’s nothing but

sample size all right how do we do this

in a model exactly as it was on the

previous slide the only trick here you

got to define the true as a vector of

parameters as as long as the whole

dataset so for every state you there’s a

D true and that’s what the vector in

thing is in this model so at the top

we’ve got be true which is a vector of

parameters and the second line we define

it as a vector of LinkedIn and it gets

the likelihood but there’s a parameter

on the left now not an observed variable

but it works the same remember the

distinction between likelihoods and

priors in a Bayesian model that’s that’s

cognitive that’s something you know

probability theory don’t care it works

the same on unobserved and observed

variables it doesn’t care at all so when

when something in your data set becomes

unobserved the model doesn’t change yeah

then this is like a brain bending thing

for people because you’re taught like

the data and parameters are

fundamentally different things but

they’re not if they’re just variables

and sometimes you observe them and

sometimes you don’t and this is the part

where I said you don’t have to be clever

you just have to be ruthless the model

exists before you know the sample right

it’s like a representation of the

generative process the fact that you

haven’t observed some of the data

doesn’t mean the model changes yeah it

just means that now you have parameters

there because you haven’t observed it

yet because the parameter is just an

unemployed variable yeah

the fact that you could observe it makes

it feel really different than something

you can’t observe like a rate yeah

okay and then the rest is just the LD

priors right good yeah

some of you were nodding at something

here like I don’t like this making me

unhappy and what happens in this


so there’s shrinkage as I said the

tingling sensation in your skull you

shrinkage and now a clotting the

relationship between media native

marriage but you know there’s a strong

relationship in these data between the

median age of marriage and the divorce

rate plotting median age of marriage

against the divorce rate both of these

are standardized variables the blue

points in this are the values that were

observed for a divorce rate in the data

set the ones we previously ran the

regression on back in Chapter I don’t

know what it was five or something and

the open circles are the values from the

posterior distribution the D trues the

posterior means for each of the D Troost

that has been estimated and the line

segments connect them for each state

with me so what has happened here well

there’s shrinkage some have moved more

than others and since you’re pros at

shrinkage now right you can explain this

pattern what has happened well I have

some of these moved way more than others

and why have they moved where they’ve

moved so they they’ve moved to the

regression line because that’s the

expectation right that’s the expectation

if some states observed divorce rate is

really far from the regression line then

it will shrink more but how much it

shrinks is also a function of its

standard error so a really big state

like California could stick far from the

regression line it doesn’t here it’s

actually pretty typical but it’s because

it has such a precisely measured divorce

rate but a small state like Idaho look

on the left of this slide you see Idaho

that’s I beat on the far left

Idaho is more sheep than people right

mostly potatoes just potatoes amount ins

for the most parts beautiful state and

it has a very imprecisely measured

divorce rate it gets shrunk it’s still

off the regression line by a good amount

right but this model says like given

this relationship in these variables

that measured rate is way too extreme to

be believable probably due partly to

sampling here and it shrinks it a lot

you get similar effects for North Dakota

down on the bottom

Wyoming is an interesting one there

that’s the next one over WY it’s not so

far but it’s so uncertain that it gets

shrunk directly to the line

right because Wyoming is another one of

these states it’s mostly sheep right

another beautiful state and so on Maine

has an extremely high divorce rate and

it gets drunk a lot too and it’s small

and Rhode Island’s a small state etc you

can look at this again compare it to

this plot with the standard deviations

where we calculate the shrinkage so on

the left side of this plot what I’ve

done is I’ve taken the difference

between the estimated the observed so

this is the mean of the posterior

distribution of the D trues and

subtracted the observed rate in the data

set and that’s the vertical axis and on

the horizontal axis I put this standard

deviation that’s in your data set the

standard error of the measure and you’ll

see that if you’re any state that has a

difference of zero means there is no

shrinkage so on the far left you’ve got

California for example and then as you

move to the right the standard deviation

increases you get more movement right

that’s what the shrinkage is there’s a

bigger difference between the observed

rate and what the model thinks is

plausible so makes sense okay that’s

error on the outcome when it’s when it’s

not constant it varies across cases you

can also of course have error on

predictor variables and those can go

into the model the same way again be

ruthless don’t be clever we’ve got a

generative model imagine sampling an


version of a predictor variable but now

with air and then you insert that

observation process into your model

so that’s what we’re gonna do here so on

this plot on this slide is now we’re

looking at the marriage rate in each

state and that’s also measured for the

same reason with air and again plotted

on the horizontal against log population

California’s on the far right Idaho

Wyoming on the far left the line

segments get bigger because the sample

size issues here’s the model I’ll do

some notation on this on the next slide

so so don’t panic top part is the same

you saw before word very first line is

the observation process on divorce rate

then we’ve got the regression of the

true divorce rate on agent marriage and

marriage rate which is M but now inside

the regression the linear model we have

in truth not the observed in what is M

true it’s a parameter right there’s a

parameter for each state here and it

goes in so this last term is a parameter

times a parameter don’t you like the

electing the course now yeah and so

every state is gonna have one these M

throughs we don’t know it but it goes

into this thing exactly the same it’s

the same model because it’s the same

generative process

you haven’t observed it but that doesn’t

change the model and then we’ve got what

you might call the likelihood for the

observed rate the M observed the

marriage rate for each state that we saw

comes from this sampling process again

normal m through with is the standard

error which is also in the data set good

yeah again you got to go home and draw

the out of course but this is really all

it is is you just think generatively

about the thing and then write down the

step model you can run in both

directions one last thing to talk about

before I show you what how this model

behaves you’ve got to put it in a prior

for these improves right and you can do

lots of interesting things here in this

example in the text I set it to zero one

because it’s a standardized variable

it’s not terrible what happens as a

consequence of that though is that

you’re ignoring information in the data

because if I tell you

the age of marriage and state you get

information about the merit rate right

look at the dad on this slide if you

believe this dad and you do yeah you

don’t have to but there’s lots of

evidence this is a reasonable Brooke tag

for the system age of marriage

influences marriage rate and so we can

have a better prior for the marriage

rate if we use all of the regression

information put the whole dab into this

model I’ll do an example of that not in

this but later on today I’ll have an

example when we do that we put the whole

dab into the stats model all the

variables have relationships and if we

do it all at once

there’s even more information that can

help us pin down to do and get better

estimates of the true values but this

will be awful okay okay I know there’s a

lot going on in this graph it’s not the

most beautiful graph of the course

what’s happening here on the we’ve got

two variables now which are observed

with error and I’m plotting them against

one another we’ve got divorce rate on

the vertical standardized and marriage

rate on the horizontal getting

standardized blue points are the

observed values in the data set all

right so the combination of the observed

marriage rate observed divorce rate and

then the open points are the connected

by the line segments are the

corresponding pairs of posterior means

for the true estimated true affects true

rates in both cases yeah so you’ve got

shrinkage in two directions now right

both of them towards some regression

relationship but you can you can see I

haven’t drawn it but you can kind of see

it there right it’s like a constellation

this is like the Milky Way and some of

these shrink a lot more than others so I

want you to notice the first thing

you’re going to notice is if you’re

really far from the red line you shrink

more you expected that then you see that

here right away there’s a more subtle

thing going on which is that there’s

more shrinkage for the divorce rate than

for the marriage rate so look at a case

like an upper left there’s some state up


I should have labeled this I know let’s

guess that’s why oming or something like

that and well that’s how it’s probably

main something and it it’s extreme in

both right it has an extremely low

marriage rate yeah this is almost

certainly named and an extremely high

divorce rate and so it comes down really

far but it comes down a lot more on

divorce rate than it does on merit rate

why and then you can see in the other

cases this is also true for the most

part yeah all those things at the top

similarly why does that happen and the

answer is because the regression says

that marriage rate is not really

strongly related to divorce rate so the

shrinkage doesn’t happen is stronger

marriage rate because there’s not as

much information in the regression to

move it the model doesn’t know where to

move it exactly so it doesn’t it isn’t

attracted to the regression relationship

as strongly because the real causal

effect in this model is agent marriage

remember there’s this association

between these two variables that arises

through the back that that other path

but that means you don’t get as much

shrinkage on this predictor does that

make sense yeah I thought this was a

cool effect when I made up this example

anyway go home and draw the owl you will

have fun with this I am absolutely sure

let me try to put some context around

this before we move on to the next topic

so measurement error comes in many

disguises and the version I just gave

you is the simplest where there’s a

variable in your data set that’s called

the error right sometimes you’re lucky

and you get that and some expert has

told you here’s the measurement error in

this variable then you can proceed like

this but there are many many more subtle

forms as well one of the things that I

see a lot and I think it’s a waste of

information is people will pre average

some variable and then enter that as a

predictor in a regression that removes

the fact that you have a finite sample

in which to estimate that mean from yeah

so people will do this all the time like

you’ve got some sample from a state and

then you just create state averages and

then you put them in the data set right

that takes variation out of the data set


if you’re doing that consciously as a

way to to get your key values to be

smaller now to call it cheating but I

think usually people are cheating

they’re just doing what they were taught

to do yeah and but we don’t have to do

that what else could you do you could

just use a multi-level bottle right

that’s what you could do and then the

means are varying effects right their

parameters and you don’t have to do any

averaging do the averaging in the model

not outside the model and then all of

the uncertainty that has to do with

different sample sizes is taken care of

parentage analysis is a fun case this is

done in people but it’s also done in

other interesting animals right you’re

trying to figure out you got some

population of say wild rodents and

you’re trying to figure out who sired

poop right who is whose parents and you

can’t interview them unlike people and

so you get their genotypes and you you

try to figure out how if they’re related

and there’s there’s uncertainty you can

do exclusion but then there will be some

number of individuals who could be the

parent so there’ll be different

probabilities and this is a standard

sort of thing that happens in this


so when I did the example last week I

used a single tree there’s a strong

fiction there we don’t know the history

of practice practice phylogeny czar

rarely very certain and so on the

right-hand side of this slide there’s

this trend now the plot phylogeny is

like this I’m a really big fan of this

I’ve seen this in papers all the time

now I like this a lot so you’re showing

the whole posterior distribution of

trees there’s a big fuzzy graph in some

cases there’s massive uncertainty in

some parts of trees and you want to keep

that in mind and you can do the analysis

over the whole distribution of trees

actually write the imagine you just feed

it into Bayes like this and now you’ve

got this whole distribution you need to

average over it works like lots of other

ways that we have distributions in base

in archaeology paleontology forensics

measurement error is the norm

that’s absolutely standard your your

data are decimated in some way so think

about like radiocarbon dating you don’t

know the radiocarbon date you’ve got you

know a few hundred years thousands Iria

right the archaeologists in the audience

know the pain at this

and people take this very seriously now

at least most people do take this very

very seriously

I had sexing in here I had a colleague

back in at in California who’s trying to

sex fossils if this is no joke right to

try and do this correctly you can never

be sure it’s very very difficult to sex

a primate fossil and but you can assign

probabilities absolutely you can okay

yeah determining ages is another issue

in my department we work in places where

people don’t keep track of birthdays so

you can ask people differ their age I’ll

give you a number so you don’t want to

know usually use it right

so you think biological facts like

average birth spacing things like that

you can reduce the error on those

estimates okay let’s shift to a very

related example grown up measurement

error is missing data there’s lots of

things which are sort of mechanically

similar about this but it feels really

different because often it feels like

the data is missing there’s nothing you

can do you want to do something about

missing data typically and I want to

teach you today why so this is really

common right you’re used to this most of

the standard regression tools in

software packages well automatically

remove cases with missing values without

saying anything to you right so if you

run Elenore GLM and are if it finds any

any of the variables you you include in

the regression if the finds missing

value deletes that whole case so all of

the variables for that case are removed

from the dataset this squanders

information right so that’s the first

thing to worry about but it can also

create compounds so missingness can

create contacts complete case deletion

is not harmless that’s what I wanted to

pinch you up and there are ways to deal

with this sometimes there’s no

guarantees so how do people deal with

this there are lots of different

approaches the worst approach and you

should absolutely never do this as I say

on this slide is to replace the missing

values with the mean of that column this

is tragic this is really really bad idea

why because the model will interpret is

if you knew that value

yeah it’s you don’t know that about you

you want something with error there

that’s what you want and the mean is if

you just put it in there you cognitively

know that you don’t know that value but

the model doesn’t the model thinks and

it’s the mean and then bad things happen

really bad things so don’t do this

almost I haven’t seen this in a very

long time which is nice the word has

gotten out that you should absolutely

never do this what else could you do

there’s this procedure called multiple

imputation which works it’s one of these

things which shouldn’t work but works

really really well and it’s it’s a

frequentist technique for imitating what

we’re gonna do today in fact it was

admitted by a Bayesian back when desktop

power Don Ruben back with desktop power

couldn’t do these things and so but he

created the frequentist technique it

works unreasonably well it’s really

effective and basically you do the model

multiple times on different samples from

some stochastic model of that variable

that’s what multiple imputation is you

don’t need very many multiples to get a

really good estimate of the uncertainty

it turns out we’re just going to go full

flavor Bayesian imputation here and just

put in the probability statement about

how this works okay impute what does it


impute means to assert some feature of a

thing in law it’s usually a crime that

you’re imputing to somebody right but

here there’s no there’s no valence like

that implied so before we get to the

mechanics of how to run a model that

does imputation that tries to guess the

value of missing that missing variable

let’s talk about bags again and try to

get missing this into the deck this is a

literature that is deeply confusing

because the terminology is really awful

it’s just about the worst terminology

I’ve ever come across in any region of

Statistics and I will convince you of

that on the next slide I’m confident so

let’s think about the primate milk data

again as an example this was a small

data set we’ve got a number of primate

species we’re interested in

understanding why the energy content of

milk varies so much across different

species and we’re focusing on body mass

and proportion of the brain that his

neocortex so here in this dead I have an

M as body mass B for brain is the

proportion of the brain this neocortex

why are we focused on that because

humans have a lot of neocortex and we

focus on things that we have a lot of

right it’s just this kind of narcissism

of our species right the whole field of

anthropology is an exercise in

narcissism and K is the milk energy

kilocalories of milk use some unobserved

thing that is generating a positive

correlation between body mass and the

proportion of the brain is neocortex and

there is a strong positive correlation

across species in these two things but

we don’t know the mechanism so I’m just

putting a you there to say I don’t know

something it’s generating it right it’s

not directly causal from them to be

because it’s not just a llama tree right

so brain sizing we’re just talking brain

size that would be maybe it was just an

arrow from him to be but that’s not

what’s going on here right you can have

brains the same size but some of them

will have more gray stuff right than

others do so there’s something was going

on here we just don’t know what it is

now let’s talk about the three types of

missingness all types of michigan is can

be classified usefully in this taxonomy

because it tells us what we need to do

and that’s why this taxonomy exists and

this is where the totally confusing

vocabulary comes in so you’ve got to

know this vocabulary or at least when

you recognize it you can come back and

check the notes to figure out what’s

going on this is horrible

right so on the Left we’ve got something

called missing completely at random I’m

gonna walk through that starting on the

next slide and explain what that means

it’s abbreviated in-car and now I’ll

explain this dag to you in the middle

we’ve got the second type is called

missing at random that’s how different

to you than missing completely random in

a sense it’s out the same to me

but it’s totally different in terms of

its consequences and then I’ll explain

that to you as well after I explain in

car and then there’s M norm sometimes

written in mark instead and because you

know in English you can put a knot

anywhere and it has some random effect

on the meaning of the sentence right

that’s it and so this is missing not a

grammar so I don’t know about you but as

a native speaker of English when I heard

missing completely random missing and

random missing not at random I don’t

think of these processes at all that I

will explain to you this is a tragedy

this is another example of the law that

statisticians should not be allowed to

make terminology but I think you can’t

understand what’s going on here and

these three types are incredibly

important because they tell you what you

need to do to make a none confounded

inference about things so let’s start

with the first here’s our our milk

energy dag the triangle at the bottom is

the basic dag and now we’re not going to

get to see be the true proportion of

neocortex anymore because it has missing

values in it and in this data set it

does about half of the values are

missing almost half of them are missing

I think and I mean they’re 12 missing

values in this data set lots of primates

that no one’s ever measured this for

maybe you measured brain mass but they

didn’t measure the percent neocortex and

so we instead we’ve got this variable B

observed it has deletions in it gaps

missing values in it and to get this we

know it’s partly caused by B that’s why

there’s an arrow coming up from B but

it’s also caused by this missingness

mechanism whatever it is that places

missing values in particular on

particular species and we want to name

this thing and in this literature these

missing this processes are given the

letter R for some reason I guess go

random and then use a subscript to say

which variable it’s affecting so R sub B

is the missingness mechanism that

creates missing values and be observed

think about this is what dad’s always

say is they say for any variable the air

is entering it are the things the

arguments to some function which

generates that variable right so this

says be observed is a function of B and

the missingness mechanism that’s all it

says it makes sense now since you’re all

experts at graph analysis I ask you are

there any back doors from B observe

decay why is that the question because

we’re going to condition on B observed

we can’t condition on B because we don’t

have it but the graph stays the same and

now K is our outcome be observed is our

predictor are there backdoor

from be observe decay give you a moment

any back doors remember what a back door

was it does guys remember that’s not a

back door

what’s a front door so that’s the

distinction there’s there’s two things

to think about here right there’s no

back door that’s the answer is no

there’s no back doors only a back door

when the arrow enters the back isn’t

careful right there’s no back door but

there are two paths from be observed kay

there’s a direct effect in an indirect

effect but the total causal effect of be

observed on K can be estimated by simple

regression would just be observed on the

right right because there’s an indirect

effect through your room yeah exactly

indirect path so there’s some you know

other paths there but there’s no back

door make sense so you can in fact you

can condition on in here and you can get

the direct effect of B but there’s no

path that takes you through our through

our sub B no back door that takes you

there through it you see that it’s just

like stuck on the end of the graph and

this means that the the missingness

mechanism is ignoring it’s not a

confound you can analyze it just like

any other variable right there’s

something that’s influencing that then

you’re gonna condition on but it doesn’t

create any backdoor compound and so you

can ignore it right remember the rules

you don’t have to condition on it and

that’s true here you don’t have to

condition on it you don’t need to know

and that’s nice because usually we can’t

discover the missing this mechanism yeah

so this is the this is the benign case

let’s call missing completely at random

yes remember that it’s completely in car

okay it’s in carbon Kay is

unconditionally independent of the

missing this mechanism you don’t have to

do anything

conditioned on anything in this graph to

keep your inference about K independent

of the missing this mechanism yeah

you’re safe it means it’s ignoring so

this is the benign case is what

everybody hopes for when you find

missing values right isn’t people

pretend oh it’s in car but let’s pause

for a second and think about this yeah

yes you can so complete case analysis

will not create a confound when this is

true this is what license is you to drop

all those values is this assumption and

this is the only time it’s going to turn

out that that is okay however if you do

the imputation which we will you can do

even better because you get more power

right you’ll get a more precise estimate

of the causal effect if you do the

imputation so you still don’t want to

drop the values you could drop the

values it’s not a sin right but you

could do even better if you didn’t drop

the values and imputed that’s the in

part case

and in part you don’t have to acute but

you should yeah right exactly and that’s

why you lose in power

exactly you with me before we leave in

car I want us to think about does this

ever happen so what would this graph

mean the only way you can get in car is

it’s like your research assistant used a

random number generator to delete values

in the spreadsheet right what what could

do this now maybe there are cases where

it could truly be unrelated to every

other variable in the graph that’s what

this means

the missingness mechanism is not

influenced by anything else we know

right or anything else still we might

need to know even an unobserved variable

and that would be yeah it does your your

research assistant your he becomes in

and just like randomly deletes some

values right

it could happen could happen probably

has happened but probably not this is

the monkeys on typewriters sort of

missing this mechanism I assert that

this is a highly implausible in most

real research situations that this is

the case I’m sorry to say I think it’s

you can come up with a convincing

example for your data set

congratulations but it’s really hard to

think of one where this happens in this

case what else is probably going on well

here’s one proposal

M is influencing the missingness

mechanism perhaps this will give us the

situation called missing at random I

know we had missing completely at random

now it’s mere merely a trance or

something like that it’s missing at

random yeah I should say the person who

came up with these terminologies is don

rubin who is an absolute genius it’s

just not what terminology Freddie’s the

first person to analyze these cases and

talk about the conditional probability

requirements for each it’s a super

achievement but these term enough this

terminology is not not helping to spread

the gospel but it’s I don’t mean to make

fun of the topic so M is now entering

our be influencing it what does this


the missingness mechanism depends upon

the body mass values species that have

particularly large or small body masses

are more likely to have missing

neocortex values how about that HAP

excuse me how might that happen well

anthropologists have different research

interests and they find different kinds

of species attractive to study and

measure in particular for example maybe

small ones are really hard to measure

neocortex for we’re just not as

interested in them right there’s like a

bunch of Kalat Rickards they’re out

there and a little furry and they’re

cute but there’s not nearly as much

effort on measuring them as a

chimpanzees right were their whole

armies of people in this building

studying chimpanzees right so and that

makes sense but it generates a pattern

where some of the things of a species

predict missing values that are

associated causally with the missingness

in other values does that make sense I

think this is extremely common and in

this case we get the missing at random

case and now I ask you the same question

as before or is there a back door pass

or be observed okay and I won’t you know

go through the socratic thing of waiting

for you to say something yes there is

now because there’s an arrow entering RB

from the back you’ve got a complete path

all the way from be observed around

through in

okay how can you close that back door

you condition on him as it says it’s

about this slide yeah sorry I shouldn’t

put that there

your condition on him and then it shuts

that path but still you should look at

it right remember this I know it’s it’s

rusty but remember all the path closing

procedures yeah if we condition on n we

block that fork right informs the fork

and you can close the fork by

conditioning on the middle of the fork

so to close the fork are conditioning on

them and then again the missingness

mechanism is ignore we don’t have to

know it but we do have to condition on

em and we do have to impute otherwise

there will be a bias in the estimate but

you do but you don’t have to know R

which is nice because R is usually not

knowable right in any any detailed sense

but you do here you don’t have to know

the missingness mechanism I’ve got a

summary on the next slide but you do

need to do imputation otherwise you’re

gonna have a confounded a biased

estimate of the causal effect so what is

Marv missing at random missing simply

and random may be is any case in which K

is conditionally independent of the

missingness mechanism that is there some

variable in the graph or set of

variables in the graph we can condition

on and separate the two D separate if

you remember that term from way back

when chapter six or so and this is

missing it random missing at random is a

nice situation to be in and it’s the

situation is probably most common there

is something else in our system which is

associated with it causing the

missingness if we can condition on that

and do imputation we have hope of

getting a good causal inference out of

it I’ll show you how to do this why do

you need to imputing all the other

variables with this associated missing

this pattern and this can create really

strong biases in the inference yeah and

final case which is the worst case to be

in is called missing not at random I


the last one wasn’t at randomizer right

but this one’s really not that random

and in this case there are a couple ways

to get this let me show you the most

obvious the variable itself causes the

missingness particular values of

neocortex percent are more likely to go

missing than others right how could this

happen well this could happen it for

example in this case I can’t think of an

example what is it to be true but I’ve

got another mechanism that it could do

it in this data set

but maybe species with Logan your cortex

you guessed that from the background

information and so you don’t measure

those and so you don’t have any precise

estimate for those species in that case

it would be the actual value that’s

doing it this is nasty because you get a

backdoor that you can’t close the

mission this mechanism is not ignore

belen this case there’s nothing you

can’t condition on that will shut it

because you don’t know the missing s

mechanism if you did know the missing

this mechanism you could shut that

backdoor path and that’s what’s required

and your only hope in this case is the

model to missing this mechanism and

thereby condition on it so if you’ve got

enough scientific information about how

that missing this works you can do that

if you’re lucky in the right case but

there’s no guarantee the other way you

can get this effect it isn’t just an

arrow from B to RB you could have a

latent variable that does it good times

right so here I’ve drawn on the right

another version of missing that at

random there’s another unobserved

variable you too and this is a fork

which influences both neocortex percent

and missingness what could this be like

it could be like phylogeny imagine

humans since we’re narcissistic we like

to study things that are closely related

to us things that are closely related to

us have brains with a lot of neocortex

so in this case you – if it’s

phylogenetic proximity to humans will

influence the neocortex percent and it

will also influence missingness good

times right this happens like all the

time in front ology present okay so

there’s my summary of missing out at

random if this is a case where K is

unconditionally dependent on Rd there’s

nothing you can condition on except for

the misenus mechanism itself which will

shut that back door

and this happens – yeah I can’t say how

commonly but when you when you find

yourself in this situation your hope is

to model the missing us mechanism which

can be done okay I’ve got ten minutes to

go let me do this slide because this is

about concept and then there’s a bunch

of mechanical slides to come the trying

to go quickly through because all that’s

in the text and it’s just how to run the

model yeah so here’s here’s my other I’m

trying to develop a way to teach this

stuff so here’s my other attempt at the

redefining these let’s think about dogs

eating homework in many parts of the

english-speaking world or world this is

a way we talked about people lying why

they don’t have their homework done

right in my dog ate it

it’s a standard joke my doggy ate my

homework sometimes a dog does eat your

homework though right so it happens

happened to me when I was a kid my dog

ate my homework it’s true you can

imagine me I was straight-a student go

into the teacher my dog ate my homework

she’s like really you Richard I thought

better of you so so why cats are better

they don’t eat your homework so imagine

a dag not dog dad with four variables in

it labeled H H star a and D H is your

homework and that is the score it’s

worth the quality of your homework as a

quantitative variable H star is the

version with missing values so a bunch

of students are coming turning their

homeworks and some of them are missing


a is a certain attribute of the student

which causally influences the quality of

their homework my attention span working

memory adderall

whatever and DS our dog the missingness

mechanism right it was R on the previous

graphs it’s now a dog and on the Left

we’ve got missing completely at random

here I relabeled it the dog eats any

homework and this is the dad for it so

the attribute influences homework on the

top of that the true state of the

homework influences H star the version

of homework with missing values and the

dog influences H star but nothing

influences the dog but dog elite any

homework it’s not selective right that’s

missing completely random

you with me okay I’m working on this it

does need some development I know so in

the middle we’ve got missing in random

the dog eats particular students

homework they care about the student the

attributed the student so now it’s like

the dogs of students who have particular

values of this attribute are more likely

to get their homework e right I don’t

have a mechanism so I said need to work

on this mechanism of this the attribute

could be something like attention span

or something like that and so if you

don’t if you don’t pay close attention

to your homework and you turn away the

dog eats it something like that right so

it’s an attribute of the student not of

the homework yeah now of course it’s

correlated getting eaten when your

homework is correlated with the score of

the homework but not because of the

score in the homework it’s because of

the attribute of the student who is

working on it that’s missing at random

and that’s the thing I said is really

common in science incredibly common in

science okay finally the worst case dog

only eats bad homework the dog sniffs

the homework assesses his score eats it

right so you know or more likely the

homeworks bad the student feeds it to

the dog so it’s another way this could

happen but it depends upon the score on

the homework right and so that’s missing

not a grantham our dog eats bad homework

and now in the dag we’ve got an arrow

directly from the true h to the missing

this mechanism D does this help I’m

working on this something I think I’ll

add this to the book it might be a

little too weird but that’s never

stopped me before okay so let me show

you a little bit about the mechanics of

this and this is all in the book the

code to do this so I’ll necessarily move

a little bit faster this just to give

you you know the first step of drawing

the owl the conceptual bit of it and the

key insight again it’s just we think

about the generative process and we

write down the same model when we go

every missing value just gets a

parameter now because we haven’t

observed it so it becomes a parameter

model stays the same we run the model so

that’s it basically that’s it well

there’s all this drawing the outer part

in between it has to do with the

algorithm but let me give you the

intuition so there’s 12 missing values

for new cortex in this dataset on the

right I’m showing you the whole dataset

the last column on the far right is the

neocortex percent each of those anaise

is a missing value

and we’re going to assume missing at

random that M is influencing the

missingness the body mass is influencing

missingness conceptually what we’re

going to do is we’re going to replace

each of the N a s in this column with a

parameter and then we’re going to get

posterior distributions for each of the

missing values and the information in

those will also flow into the regression

so you’ll get different slopes out of

this too so let me show you how this

works the idea is we think about each of

these gaps now getting assigned a

parameter because it’s unobserved and

unobserved variables are by definition

parameters that’s what they are in-phase

and these things will be imputed by the

model this is what the model looks like

B is now a vector in which some

positions are observed values and some

are parameters let’s all mix together

and we’re going to stick that mix of

things into the linear model in an

ordinary regression model and then the

only additional thing we have to do

shown in blue here is have some prior

for the B values you can think about

this as the model of B and sometimes

your dad will tell you this right what

is be caused by your dad will give you

some information about this in the case

when B is observed it informs the

parameters inside this so new is the

that’s new degree letter so that little

V is going to be its new is the mean of

the B values of the of the neocortex

values this is a standardized variable

so it’ll be very near zero and Sigma sub

B is the standard deviation those will

be estimated from the observed values

when the values not observed this is

been a prior for that value it can keeps

it from being any old thing mind blowing

I can see there’s at least one mind

blowing excellent but it’s the same

model as before it’s alright here’s my

my annotation to say the same thing

that’s what B is a mix and then whidbey

is observed this thing’s actually but

when it’s not observed

it’s a product in toad form it looks

exactly the same except we add this

prior around here what lulav is going to

do is it’s going to detect the NA s and

it’s going to construct that mix vector

for you tries to help you it automates

this and what you get out in the

posterior distribution there’s a

parameter for every missing values so

you see 12 B imputes here and each of

these is an imputed neocortex value and

the question of all your minds I’m sure

is what is this due to the slopes in

this model now we’ve added 12 cases to

the data that’s nice and almost doubles

our sample size and so let’s compare the

same model you don’t have to change the

model code at all you just delete the

cases that have missing values and rerun


all right the model stays exactly the

same and now we can compare the slopes

remember this is one of these masking

effect cases we’ve got two predictors

that are positively associated with one

another but one is negatively associated

with the outcome the one’s positively

associated model 15.3 is our new is our

new imputed model that uses the full

sample and 15.4 is the old one from way

back in the previous chapter notice what

has happened is the estimates have

gotten more precise they’ve also shrunk

a little bit towards zero so the

previous model is probably

overestimating the influence of each of

these but we’ve gotten precision by

adding twelve parameters for things we

don’t know we’ve got more precise

estimates of the slope yeah this is what

you expect in a missing at random or dog

eats particular students homework

situation is that you get UD confound

and get extra precision by doing the

imputation let me show you what happens

as a pattern in the data so we can plot

these imputed values up mixed in with

all the observed values but they’re

going to have standard errors on them

and that’s what I’m showing you here we

don’t know exactly what these values are

the posterior distributions are pretty

big right despite that they help us

understand the slopes more so this on

this graph we’ve got neocortex percent

on the horizontal each of those open

circles is a computed value the blue

circles are observed values they follow

the regression line the posterior means

follow the regression trend right

because the regression informs them if

you’ve got some species with a big body

mass that tells you something about its

neocortex percent because those two

variables are strongly associated in all

of the observed cases yeah so the model

automatically accounts for this you

don’t have to be clever cool right I

don’t like being clever it’s very hard

disappointing thing about this model is

that the relationship between the

imputed values and the other predictor

is zero which is wrong so if you look at

this graph you see a regression print

for the blue points and the observed

points there’s a strong positive

correlation to a log body mass in

neocortex percent but the imputed values

don’t follow this at all

and that’s because we didn’t tell the

models of these two things are

associated so in the text I show you how

to fix this just very quickly because I

know I’m out of time we do this by

saying M and B come from a multivariate

normal and we model their correlation we

did this just like an instrumental

variable right it’s the same trick the

same kind of code but you can do it now

even though B has missing values inside

of it you can still do it it’s a mixed

vector of parameters and such and

observed values there’s code show you

how to do this you just have to manually

construct this mixed vector of things

and there’s some code to do that that’s

the drawing the owl part and then at the

end happy days they’re associated you

get even more precision from the

estimate of these things okay this is a

really big topic missing data and there

are lots of things which are kind of

like missing data but don’t feel exactly

the same one of the areas that I think

is most important in this family of

models in ecology called occupancy

models mark-recapture methods these are

really missing data problems in a sense

you think of them they’re kind of like

measurement error there’s a true

occupancy is the species there but you

can’t observe that zeroes are not

trustworthy right and so there’s a

special relationship you do imputations

of the true States right there’s this

latent thing whether the species is


and you need to impute it and that’s how

these models works they’re like missing

data models but they have special

structure which comes from the detection

process that you model okay I realize

amount of time I’m gonna put up a final

homework later this afternoon after I do

it myself I think it’s good but we’ll

see after I try it in which you will do

some imputation practice with some

primates and it’s due in a week even

though the course ends today please turn

it in in a week for a full sense of

satisfaction and if you so want some

certificate of completion I’ll be happy

to give you one as well

so with that thank you for your

indulgence for the last 10 weeks we’ve

gone a long way from the Golem of Prague

and as you go home and deploy your

golems I just want you to remain humble

in their presence and I hope you’ve

learned something valuable thank you