lec 20
all right welcome back everybody I have
a lot of slides this is gonna be a great
lecture yeah I’d say the best responses
oh no all right this is the last lecture
but certainly not the least we are going
to cover some topics that I think are
extremely important we’re doing the
flight statistics before we get to that
let’s talk about pancakes okay these are
my very bad drawings of pancakes this is
three pancakes and the hatching on a
side of a pancake indicates that that
side is burnt so imagine like you come
over to my apartment for pancakes and I
turn on the griddle and the temperature
isn’t right it’s too hot and the first
couple pancakes get burnt this is
actually how it happens right and so by
the third pancake it’s perfect so the
first pancake is burnt on both sides the
second pancake is burnt on only one side
and the third pancake is just right yeah
now I serve you at random a pancake with
a random side up and you look down at
your pancake and it’s burnt and I want
to ask you since you’ve taken my stats
course what’s the posterior probability
that the other side is also burnt I’ll
let you think about it for a second all
the information you need to solve this
problem is on the slide you don’t know
which of the three pancakes it is right
you know it’s when it’s not right it’s
not number three so what’s the
probability the other side is per what’s
the answer
there’s it anybody else that think a
half we got two votes for a half
two out of three or yeah it’s the answer
is two out of three it’s not a half half
is the standard intuition and it’s a
perfectly reasonable count on that word
that just means you’re human
the intuition is half cuz there’s two
pancakes it could be right you think I
could be either ones random right no
it’s two out of three and let me show
you how to figure that out but I’m in
doing this it wasn’t to make you feel
bad
especially at all it was my intuition
too as a hat first time I saw this
problem and this is the famous logic
problem actually it wasn’t originally
done with pancakes but I liked it better
with pancakes yeah there’s really like
boxes with balls in it or something
stupid like that
so pancakes if everybody likes pancakes
right yeah okay good answer so the point
of these logic puzzles is not to make
you feel bad it’s to teach you some
intuitions about or correct your
intuitions and teaching methods for
solving these things and as I’ve said
many times in this course I think one of
the values of learning probability
theory is that it means you don’t have
to be clever now you can just ruthlessly
apply the rules of conditioning and you
don’t have to feel like you need to
Intuit the right answer to this thing so
don’t trust your intuitions would be my
advice instead be ruthless ruthlessly
conditioned what do I mean the way we
figure things out in probability theory
or Bayesian inference which is just
probability theory is we want to know
the probability of something and so we
conditioned on what we know and see if
that updates the probability right
there’s any information in what we
already know for inferring the thing
we’d like to know then if we compute the
probability of thing you want to know
conditional on the stuff we already know
it’ll be there right and the rules of
probability tell us the only way to do
that so we do that let’s do that for the
pancakes all right here’s your Berk
pancake we want to know the probability
that the side we can’t see is burnt
conditional on what we know which is
that the side up is burnt now of course
you could peek right so this is a
probability puzzle yeah the rules of
probability tell us that a conditional
probability is defined as the
probability that both things are true
divided by the probability
but the thing we condition on is true so
that’s what the expression on the right
hand side is the probability burnt up
and burnt down at the both sides of burn
divided by the probability burnt up this
is base theorem by the way just
disguised right so this is just a
definition of conditional probability if
you ever forget this just Google
definition of conditional probability
there’ll be a nice Wikipedia page for
you right we have all the information
needed to compute this first we need the
probability that the burnt side is up
and there are two ways this can happen
well let’s think of it this way method
logically there are three pancakes right
so you’ve got a bebe pancake a BU
pancake and a uu pancake doesn’t mean
Bibi is burnt burnt right bu is burnt on
Burt and then unburnt on Burt
three pancakes probability that this is
the Burt burnt that you would see a
burnt side up if the pancake was burnt
burnt is one because either side will
show you burnt
yeah so that’s why it’s probability B B
times 1 probability B u times 0.5
because only one of the sides of that
pancake is burnt so if I’m randomly
flopping it down on your plate then
there’s a random chance the upside is
burn and then for the unburned pancake
there’s no way so you already eliminated
that right there there your intuition
was working each of these pancakes has a
chance third of being served to you so
there’s a half chance did you get the
burnt side up on your pancake and then
the last thing it’s just the top well
it’s the probability of the burnt up
burnt down pancake it’s one third
because they’re free pancakes right and
I served you one at random so it’s 1/3
divided by half and the answers 2/3
weird right
let me give you some intuition about why
now that probability theory has given
you the right answer in the text there’s
a simulation in a pancake based
individual based pancake simulation of
this right to tell you you’ll appreciate
this right it’s like agent base but
their pancakes to prove through this
that it’s 2/3 if you didn’t believe with
calculation let me try to give you some
intuition though now that I just told
you you don’t need to be clever and in
fact
I’m not clever at all I just ruthlessly
applied conditional probability and
figured things out but if you want some
intuition at the end which is a
perfectly legitimate goal let me try to
give you a little bit the mistake is
focusing on pancakes you want to focus
on sides so look in this picture there’s
a side at the bottom and we want to know
what the other side is so if you look at
the top there are three bird sides the
sign at the bottom could be any of those
right so there are three possibilities
the other side of those how many of
those three possibilities are also burnt
the answers – so the answer is two out
of three you got to focus on sides we’re
talking about sides not pancakes right
that’s the mistake pancakes are so
delicious you were cooking a pancake and
not the side right mental bias I forgive
you
alright so the point of all this is
everything we’ve done in this course
everything you do in in any statistics
course really is underlined by the idea
of ruthlessly applying conditional
probability to solve problems there’s
some information we have there’s some
other thing we’d like to infer if
there’s any evidence in the information
we have about the thing we like to
confer you reveal it through
conditioning through the rules of
probability on whether we another way to
think about this express our information
as constraints and distributions and
then we just sent logic do the rest and
that’s what Bayesian inference is is
just logic so rest of the time today I
want to take this approach and showed
how so you have produces automatic
solutions to two very common problems
which are historically in statistics
very hard to solve because people tried
to be clever if you void being clever
just ruthlessly apply conditioning you
can get useful solutions to two very big
problems these are measurement error try
assert is always present in data and
usually ignored yeah and missing data
which is a special kind of measurement
error right it’s not there at all it’s
just the extreme version of measurement
error these are really common so let’s
think about measurement error first
there’s always some error measurement
for sure
and when you do an ordinary linear
regression it’s captured in this
residual bearings right there’s this
Sigma thing it’s supposed to capture any
even if you had the true relationship
you don’t expect a perfect fit that’s
what that Sigma error variance on the
end is supposed to be that’s fine if the
only error is on the outcome variable
what if there’s also an error on the
predictor variables and imagine for
example that that error is not in
uniform across the cases or the
variables then you’re in trouble
you could be in trouble and there are a
whole bunch of ad hoc procedures to try
and deal with this like reduce major
axis regression and sunshin and none of
them are terribly reliable so let’s
think about an approach that avoids
trying to be clever and just conditions
on what we know how to get there so
think back to Waffle ‘s right we have
pancakes now waffles the waffle divorce
data set from early on in the course
yeah and in that data set we had a bunch
of states in the United States of
America for each one we had measured
divorce rate median age of marriage and
the marriage rate and in if you go back
and look at the data set you’ll see
there were columns which were standard
errors on a couple of the variables that
is these are taken from partial census
data from records they’re measured with
some error and that error has been
quantified in terms of the standard
error on each of those values we don’t
know the divorce rate in a particular
state what we have is an estimate of it
and we also have a number that tells us
our confidence in at this standard error
okay there’s a lot of heterogeneity in
this error in this data set so I’m
showing you on this graph is the divorce
rate on the vertical median agent
marriage on the horizontal and the line
segments show you the standard error for
each state I’m just force rate so some
of these are big the reason some of them
are big is because some of the states
are small and so they produce relatively
little evidence in any given since this
period so here’s another way to look at
these data the left graph is the
graphics on the previous slide it’s just
divorce rate against median age of
marriage with the line segments for the
standard error of each divorce rate and
then on the right the divorce rate again
vertical axis is the same now the
horizontal is the log population
that state so really big states are on
the right California
right it’s the one all the way on the
right and in California in any
particular year you get so many of the
events that you get a fantastic measure
of the true if you want to say the true
divorce rate in the state cuz it’s just
so many Californians right they’re all
over the place I’m one right produced
all over the world and on the left
you’ve got smaller states
I won’t single any of them out well
later I will but you know pick your
favorite small state there there are
states with very small populations
particularly west of the Mississippi
where there are more livestock than
people by far in many parts of Western
the United States and in those states
there’s so few of these census events in
any given period that you’re gonna have
an error alright your confidence in
having estimated the long run rate is
very small and so the standard error is
on those states with a big this make
sense let’s think about this in terms of
a causal model measurement error can be
put into a dag any almost anything could
be put into a dag not anything but
almost anything and so we want to think
about the observed divorce rate here
which I named d-sub observed as being a
function of the true divorce rate which
we don’t get to observe here that’s D
and the population size of the state
which I call in and in the data set
we’ve already taken the end part of this
and summarized this as there’s a
standard error reliability right but in
general this is what’s going on is that
the your observed variable for divorce
rate is a function of these two things
and that’s that’s what generates the
measurement error yeah does it make
sense so how can we approach this
statistically well let’s not be clever
let’s just be ruthless and apply
conditional probability we state what we
know and we see if it’ll figure out what
we don’t know so here’s the idea about
how the observed variables are generated
right so there’s some true divorce rate
which we haven’t observed and we’d like
to use that as our outcome variable not
this crummy measured thing right with
highly unreliable things for some state
so generatively thinking our observed
divorce rate is sampled from some normal
distribution all right central limit
theorem makes this almost true in this
case and the mean of this normal
distribution will be the true rate and
then there’s a standard deviation which
has been summarized in this proportional
inversely proportional to the population
size using the standard error
yeah saving you some work you can do
this yourself somebody did this right
there’s a formula for this this make
sense conceptually what’s going on it’s
like you’ve you sampled this process
every month divorces are happening in a
given state but in California you get
thousands and thousands in any given
month in Idaho mostly zero every month
correct yeah
and so but in the long run there’s some
D true that you’d observe in the long
run averaging over a really long data
set but in any finite period there’s
error and that error will be inversely
proportional to the population size it’s
a function of the population size and
that’s what the standard error is yeah
so the thing you observe comes out of
this generative process make sense and
[Music]
let’s think about the statistical model
now so this was our dad before remind
you a is median agent marriage in as the
marriage rate DS the divorce rate we
were evaluating the direct and indirect
effects of aging marriage and this is
the statistical model we had before
where the thing on the top is D true we
acted as if we knew the divorce rate but
now we don’t so we need to add something
pumps to this model what are we going to
add we’re going to put a line on top of
it which is the observation process in
this observation process you just add
this to the model now and now the D true
is a vector of parameters because we
don’t know it we replace all those
things which were previously the
outcomes with a bunch of unknown
parameters and then the observation line
at the top estimates the force
because we have information with which
to estimate them what is that
information well we have the standard
error okay that’s great but if that was
all we had that would be if that would
not be enough we also have the whole
regression or the whole regression
relationship is going to pull them down
values for the different states and you
may be getting a tingling sensation in
your skull which tells you that
shrinkage is going to happen right cuz
it’s my class and there’s always
shrinkage right so let me well I should
pause here and say does this make sense
yeah what’s happening one way to think
about this is if you were going to
simulate measurement error this is the
model you use it’s just you write down
the same model and then it runs
backwards right Bayesian models are
generative and you can run them in both
directions so if you run them forwards
you simulate fake data if you run them
in the reverse they spit out a posterior
distribution right you feed in a
distribution they spit out data you feed
in data they spit out a distribution you
can run them in both directions and so
this if you were going to simulate
measurement error you do it with the top
line you’d have some true value but it
will be sampled from a distribution so
it would have some error on it yeah
because in a finite time period imagine
the census period was a day in a big
state like California you get many more
events and so that day provides much
more information about the true rate
right the the variance from day to day
will be higher in a small state
it’s like sample size right so imagine
you pick a village instead of Los
Angeles and you try to estimate the
divorce rate it’s easier if you could
cross Angeles yeah
does it make sense it’s just sample size
it’s just sample size look how far new
is bigger than Idaho in if the periods
if that since its periods are the same
for both it’s the sample size effect
yeah you never observe the rate the rate
is an unobservable always and so the
sample size constrains your precision of
it you’ll sleep on it
it’s just sample size it’s nothing but
sample size all right how do we do this
in a model exactly as it was on the
previous slide the only trick here you
got to define the true as a vector of
parameters as as long as the whole
dataset so for every state you there’s a
D true and that’s what the vector in
thing is in this model so at the top
we’ve got be true which is a vector of
parameters and the second line we define
it as a vector of LinkedIn and it gets
the likelihood but there’s a parameter
on the left now not an observed variable
but it works the same remember the
distinction between likelihoods and
priors in a Bayesian model that’s that’s
cognitive that’s something you know
probability theory don’t care it works
the same on unobserved and observed
variables it doesn’t care at all so when
when something in your data set becomes
unobserved the model doesn’t change yeah
then this is like a brain bending thing
for people because you’re taught like
the data and parameters are
fundamentally different things but
they’re not if they’re just variables
and sometimes you observe them and
sometimes you don’t and this is the part
where I said you don’t have to be clever
you just have to be ruthless the model
exists before you know the sample right
it’s like a representation of the
generative process the fact that you
haven’t observed some of the data
doesn’t mean the model changes yeah it
just means that now you have parameters
there because you haven’t observed it
yet because the parameter is just an
unemployed variable yeah
the fact that you could observe it makes
it feel really different than something
you can’t observe like a rate yeah
okay and then the rest is just the LD
priors right good yeah
some of you were nodding at something
here like I don’t like this making me
unhappy and what happens in this
regression
so there’s shrinkage as I said the
tingling sensation in your skull you
shrinkage and now a clotting the
relationship between media native
marriage but you know there’s a strong
relationship in these data between the
median age of marriage and the divorce
rate plotting median age of marriage
against the divorce rate both of these
are standardized variables the blue
points in this are the values that were
observed for a divorce rate in the data
set the ones we previously ran the
regression on back in Chapter I don’t
know what it was five or something and
the open circles are the values from the
posterior distribution the D trues the
posterior means for each of the D Troost
that has been estimated and the line
segments connect them for each state
with me so what has happened here well
there’s shrinkage some have moved more
than others and since you’re pros at
shrinkage now right you can explain this
pattern what has happened well I have
some of these moved way more than others
and why have they moved where they’ve
moved so they they’ve moved to the
regression line because that’s the
expectation right that’s the expectation
if some states observed divorce rate is
really far from the regression line then
it will shrink more but how much it
shrinks is also a function of its
standard error so a really big state
like California could stick far from the
regression line it doesn’t here it’s
actually pretty typical but it’s because
it has such a precisely measured divorce
rate but a small state like Idaho look
on the left of this slide you see Idaho
that’s I beat on the far left
Idaho is more sheep than people right
mostly potatoes just potatoes amount ins
for the most parts beautiful state and
it has a very imprecisely measured
divorce rate it gets shrunk it’s still
off the regression line by a good amount
right but this model says like given
this relationship in these variables
that measured rate is way too extreme to
be believable probably due partly to
sampling here and it shrinks it a lot
you get similar effects for North Dakota
down on the bottom
Wyoming is an interesting one there
that’s the next one over WY it’s not so
far but it’s so uncertain that it gets
shrunk directly to the line
right because Wyoming is another one of
these states it’s mostly sheep right
another beautiful state and so on Maine
has an extremely high divorce rate and
it gets drunk a lot too and it’s small
and Rhode Island’s a small state etc you
can look at this again compare it to
this plot with the standard deviations
where we calculate the shrinkage so on
the left side of this plot what I’ve
done is I’ve taken the difference
between the estimated the observed so
this is the mean of the posterior
distribution of the D trues and
subtracted the observed rate in the data
set and that’s the vertical axis and on
the horizontal axis I put this standard
deviation that’s in your data set the
standard error of the measure and you’ll
see that if you’re any state that has a
difference of zero means there is no
shrinkage so on the far left you’ve got
California for example and then as you
move to the right the standard deviation
increases you get more movement right
that’s what the shrinkage is there’s a
bigger difference between the observed
rate and what the model thinks is
plausible so makes sense okay that’s
error on the outcome when it’s when it’s
not constant it varies across cases you
can also of course have error on
predictor variables and those can go
into the model the same way again be
ruthless don’t be clever we’ve got a
generative model imagine sampling an
observed
version of a predictor variable but now
with air and then you insert that
observation process into your model
so that’s what we’re gonna do here so on
this plot on this slide is now we’re
looking at the marriage rate in each
state and that’s also measured for the
same reason with air and again plotted
on the horizontal against log population
California’s on the far right Idaho
Wyoming on the far left the line
segments get bigger because the sample
size issues here’s the model I’ll do
some notation on this on the next slide
so so don’t panic top part is the same
you saw before word very first line is
the observation process on divorce rate
then we’ve got the regression of the
true divorce rate on agent marriage and
marriage rate which is M but now inside
the regression the linear model we have
in truth not the observed in what is M
true it’s a parameter right there’s a
parameter for each state here and it
goes in so this last term is a parameter
times a parameter don’t you like the
electing the course now yeah and so
every state is gonna have one these M
throughs we don’t know it but it goes
into this thing exactly the same it’s
the same model because it’s the same
generative process
you haven’t observed it but that doesn’t
change the model and then we’ve got what
you might call the likelihood for the
observed rate the M observed the
marriage rate for each state that we saw
comes from this sampling process again
normal m through with is the standard
error which is also in the data set good
yeah again you got to go home and draw
the out of course but this is really all
it is is you just think generatively
about the thing and then write down the
step model you can run in both
directions one last thing to talk about
before I show you what how this model
behaves you’ve got to put it in a prior
for these improves right and you can do
lots of interesting things here in this
example in the text I set it to zero one
because it’s a standardized variable
it’s not terrible what happens as a
consequence of that though is that
you’re ignoring information in the data
because if I tell you
the age of marriage and state you get
information about the merit rate right
look at the dad on this slide if you
believe this dad and you do yeah you
don’t have to but there’s lots of
evidence this is a reasonable Brooke tag
for the system age of marriage
influences marriage rate and so we can
have a better prior for the marriage
rate if we use all of the regression
information put the whole dab into this
model I’ll do an example of that not in
this but later on today I’ll have an
example when we do that we put the whole
dab into the stats model all the
variables have relationships and if we
do it all at once
there’s even more information that can
help us pin down to do and get better
estimates of the true values but this
will be awful okay okay I know there’s a
lot going on in this graph it’s not the
most beautiful graph of the course
what’s happening here on the we’ve got
two variables now which are observed
with error and I’m plotting them against
one another we’ve got divorce rate on
the vertical standardized and marriage
rate on the horizontal getting
standardized blue points are the
observed values in the data set all
right so the combination of the observed
marriage rate observed divorce rate and
then the open points are the connected
by the line segments are the
corresponding pairs of posterior means
for the true estimated true affects true
rates in both cases yeah so you’ve got
shrinkage in two directions now right
both of them towards some regression
relationship but you can you can see I
haven’t drawn it but you can kind of see
it there right it’s like a constellation
this is like the Milky Way and some of
these shrink a lot more than others so I
want you to notice the first thing
you’re going to notice is if you’re
really far from the red line you shrink
more you expected that then you see that
here right away there’s a more subtle
thing going on which is that there’s
more shrinkage for the divorce rate than
for the marriage rate so look at a case
like an upper left there’s some state up
there
I should have labeled this I know let’s
guess that’s why oming or something like
that and well that’s how it’s probably
main something and it it’s extreme in
both right it has an extremely low
marriage rate yeah this is almost
certainly named and an extremely high
divorce rate and so it comes down really
far but it comes down a lot more on
divorce rate than it does on merit rate
why and then you can see in the other
cases this is also true for the most
part yeah all those things at the top
similarly why does that happen and the
answer is because the regression says
that marriage rate is not really
strongly related to divorce rate so the
shrinkage doesn’t happen is stronger
marriage rate because there’s not as
much information in the regression to
move it the model doesn’t know where to
move it exactly so it doesn’t it isn’t
attracted to the regression relationship
as strongly because the real causal
effect in this model is agent marriage
remember there’s this association
between these two variables that arises
through the back that that other path
but that means you don’t get as much
shrinkage on this predictor does that
make sense yeah I thought this was a
cool effect when I made up this example
anyway go home and draw the owl you will
have fun with this I am absolutely sure
let me try to put some context around
this before we move on to the next topic
so measurement error comes in many
disguises and the version I just gave
you is the simplest where there’s a
variable in your data set that’s called
the error right sometimes you’re lucky
and you get that and some expert has
told you here’s the measurement error in
this variable then you can proceed like
this but there are many many more subtle
forms as well one of the things that I
see a lot and I think it’s a waste of
information is people will pre average
some variable and then enter that as a
predictor in a regression that removes
the fact that you have a finite sample
in which to estimate that mean from yeah
so people will do this all the time like
you’ve got some sample from a state and
then you just create state averages and
then you put them in the data set right
that takes variation out of the data set
and
if you’re doing that consciously as a
way to to get your key values to be
smaller now to call it cheating but I
think usually people are cheating
they’re just doing what they were taught
to do yeah and but we don’t have to do
that what else could you do you could
just use a multi-level bottle right
that’s what you could do and then the
means are varying effects right their
parameters and you don’t have to do any
averaging do the averaging in the model
not outside the model and then all of
the uncertainty that has to do with
different sample sizes is taken care of
parentage analysis is a fun case this is
done in people but it’s also done in
other interesting animals right you’re
trying to figure out you got some
population of say wild rodents and
you’re trying to figure out who sired
poop right who is whose parents and you
can’t interview them unlike people and
so you get their genotypes and you you
try to figure out how if they’re related
and there’s there’s uncertainty you can
do exclusion but then there will be some
number of individuals who could be the
parent so there’ll be different
probabilities and this is a standard
sort of thing that happens in this
phylogenetics
so when I did the example last week I
used a single tree there’s a strong
fiction there we don’t know the history
of practice practice phylogeny czar
rarely very certain and so on the
right-hand side of this slide there’s
this trend now the plot phylogeny is
like this I’m a really big fan of this
I’ve seen this in papers all the time
now I like this a lot so you’re showing
the whole posterior distribution of
trees there’s a big fuzzy graph in some
cases there’s massive uncertainty in
some parts of trees and you want to keep
that in mind and you can do the analysis
over the whole distribution of trees
actually write the imagine you just feed
it into Bayes like this and now you’ve
got this whole distribution you need to
average over it works like lots of other
ways that we have distributions in base
in archaeology paleontology forensics
measurement error is the norm
that’s absolutely standard your your
data are decimated in some way so think
about like radiocarbon dating you don’t
know the radiocarbon date you’ve got you
know a few hundred years thousands Iria
right the archaeologists in the audience
know the pain at this
and people take this very seriously now
at least most people do take this very
very seriously
I had sexing in here I had a colleague
back in at in California who’s trying to
sex fossils if this is no joke right to
try and do this correctly you can never
be sure it’s very very difficult to sex
a primate fossil and but you can assign
probabilities absolutely you can okay
yeah determining ages is another issue
in my department we work in places where
people don’t keep track of birthdays so
you can ask people differ their age I’ll
give you a number so you don’t want to
know usually use it right
so you think biological facts like
average birth spacing things like that
you can reduce the error on those
estimates okay let’s shift to a very
related example grown up measurement
error is missing data there’s lots of
things which are sort of mechanically
similar about this but it feels really
different because often it feels like
the data is missing there’s nothing you
can do you want to do something about
missing data typically and I want to
teach you today why so this is really
common right you’re used to this most of
the standard regression tools in
software packages well automatically
remove cases with missing values without
saying anything to you right so if you
run Elenore GLM and are if it finds any
any of the variables you you include in
the regression if the finds missing
value deletes that whole case so all of
the variables for that case are removed
from the dataset this squanders
information right so that’s the first
thing to worry about but it can also
create compounds so missingness can
create contacts complete case deletion
is not harmless that’s what I wanted to
pinch you up and there are ways to deal
with this sometimes there’s no
guarantees so how do people deal with
this there are lots of different
approaches the worst approach and you
should absolutely never do this as I say
on this slide is to replace the missing
values with the mean of that column this
is tragic this is really really bad idea
why because the model will interpret is
if you knew that value
yeah it’s you don’t know that about you
you want something with error there
that’s what you want and the mean is if
you just put it in there you cognitively
know that you don’t know that value but
the model doesn’t the model thinks and
it’s the mean and then bad things happen
really bad things so don’t do this
almost I haven’t seen this in a very
long time which is nice the word has
gotten out that you should absolutely
never do this what else could you do
there’s this procedure called multiple
imputation which works it’s one of these
things which shouldn’t work but works
really really well and it’s it’s a
frequentist technique for imitating what
we’re gonna do today in fact it was
admitted by a Bayesian back when desktop
power Don Ruben back with desktop power
couldn’t do these things and so but he
created the frequentist technique it
works unreasonably well it’s really
effective and basically you do the model
multiple times on different samples from
some stochastic model of that variable
that’s what multiple imputation is you
don’t need very many multiples to get a
really good estimate of the uncertainty
it turns out we’re just going to go full
flavor Bayesian imputation here and just
put in the probability statement about
how this works okay impute what does it
mean
impute means to assert some feature of a
thing in law it’s usually a crime that
you’re imputing to somebody right but
here there’s no there’s no valence like
that implied so before we get to the
mechanics of how to run a model that
does imputation that tries to guess the
value of missing that missing variable
let’s talk about bags again and try to
get missing this into the deck this is a
literature that is deeply confusing
because the terminology is really awful
it’s just about the worst terminology
I’ve ever come across in any region of
Statistics and I will convince you of
that on the next slide I’m confident so
let’s think about the primate milk data
again as an example this was a small
data set we’ve got a number of primate
species we’re interested in
understanding why the energy content of
milk varies so much across different
species and we’re focusing on body mass
and proportion of the brain that his
neocortex so here in this dead I have an
M as body mass B for brain is the
proportion of the brain this neocortex
why are we focused on that because
humans have a lot of neocortex and we
focus on things that we have a lot of
right it’s just this kind of narcissism
of our species right the whole field of
anthropology is an exercise in
narcissism and K is the milk energy
kilocalories of milk use some unobserved
thing that is generating a positive
correlation between body mass and the
proportion of the brain is neocortex and
there is a strong positive correlation
across species in these two things but
we don’t know the mechanism so I’m just
putting a you there to say I don’t know
something it’s generating it right it’s
not directly causal from them to be
because it’s not just a llama tree right
so brain sizing we’re just talking brain
size that would be maybe it was just an
arrow from him to be but that’s not
what’s going on here right you can have
brains the same size but some of them
will have more gray stuff right than
others do so there’s something was going
on here we just don’t know what it is
now let’s talk about the three types of
missingness all types of michigan is can
be classified usefully in this taxonomy
because it tells us what we need to do
and that’s why this taxonomy exists and
this is where the totally confusing
vocabulary comes in so you’ve got to
know this vocabulary or at least when
you recognize it you can come back and
check the notes to figure out what’s
going on this is horrible
right so on the Left we’ve got something
called missing completely at random I’m
gonna walk through that starting on the
next slide and explain what that means
it’s abbreviated in-car and now I’ll
explain this dag to you in the middle
we’ve got the second type is called
missing at random that’s how different
to you than missing completely random in
a sense it’s out the same to me
but it’s totally different in terms of
its consequences and then I’ll explain
that to you as well after I explain in
car and then there’s M norm sometimes
written in mark instead and because you
know in English you can put a knot
anywhere and it has some random effect
on the meaning of the sentence right
that’s it and so this is missing not a
grammar so I don’t know about you but as
a native speaker of English when I heard
missing completely random missing and
random missing not at random I don’t
think of these processes at all that I
will explain to you this is a tragedy
this is another example of the law that
statisticians should not be allowed to
make terminology but I think you can’t
understand what’s going on here and
these three types are incredibly
important because they tell you what you
need to do to make a none confounded
inference about things so let’s start
with the first here’s our our milk
energy dag the triangle at the bottom is
the basic dag and now we’re not going to
get to see be the true proportion of
neocortex anymore because it has missing
values in it and in this data set it
does about half of the values are
missing almost half of them are missing
I think and I mean they’re 12 missing
values in this data set lots of primates
that no one’s ever measured this for
maybe you measured brain mass but they
didn’t measure the percent neocortex and
so we instead we’ve got this variable B
observed it has deletions in it gaps
missing values in it and to get this we
know it’s partly caused by B that’s why
there’s an arrow coming up from B but
it’s also caused by this missingness
mechanism whatever it is that places
missing values in particular on
particular species and we want to name
this thing and in this literature these
missing this processes are given the
letter R for some reason I guess go
random and then use a subscript to say
which variable it’s affecting so R sub B
is the missingness mechanism that
creates missing values and be observed
think about this is what dad’s always
say is they say for any variable the air
is entering it are the things the
arguments to some function which
generates that variable right so this
says be observed is a function of B and
the missingness mechanism that’s all it
says it makes sense now since you’re all
experts at graph analysis I ask you are
there any back doors from B observe
decay why is that the question because
we’re going to condition on B observed
we can’t condition on B because we don’t
have it but the graph stays the same and
now K is our outcome be observed is our
predictor are there backdoor
from be observe decay give you a moment
any back doors remember what a back door
was it does guys remember that’s not a
back door
what’s a front door so that’s the
distinction there’s there’s two things
to think about here right there’s no
back door that’s the answer is no
there’s no back doors only a back door
when the arrow enters the back isn’t
careful right there’s no back door but
there are two paths from be observed kay
there’s a direct effect in an indirect
effect but the total causal effect of be
observed on K can be estimated by simple
regression would just be observed on the
right right because there’s an indirect
effect through your room yeah exactly
indirect path so there’s some you know
other paths there but there’s no back
door make sense so you can in fact you
can condition on in here and you can get
the direct effect of B but there’s no
path that takes you through our through
our sub B no back door that takes you
there through it you see that it’s just
like stuck on the end of the graph and
this means that the the missingness
mechanism is ignoring it’s not a
confound you can analyze it just like
any other variable right there’s
something that’s influencing that then
you’re gonna condition on but it doesn’t
create any backdoor compound and so you
can ignore it right remember the rules
you don’t have to condition on it and
that’s true here you don’t have to
condition on it you don’t need to know
and that’s nice because usually we can’t
discover the missing this mechanism yeah
so this is the this is the benign case
let’s call missing completely at random
yes remember that it’s completely in car
okay it’s in carbon Kay is
unconditionally independent of the
missing this mechanism you don’t have to
do anything
conditioned on anything in this graph to
keep your inference about K independent
of the missing this mechanism yeah
you’re safe it means it’s ignoring so
this is the benign case is what
everybody hopes for when you find
missing values right isn’t people
pretend oh it’s in car but let’s pause
for a second and think about this yeah
yes you can so complete case analysis
will not create a confound when this is
true this is what license is you to drop
all those values is this assumption and
this is the only time it’s going to turn
out that that is okay however if you do
the imputation which we will you can do
even better because you get more power
right you’ll get a more precise estimate
of the causal effect if you do the
imputation so you still don’t want to
drop the values you could drop the
values it’s not a sin right but you
could do even better if you didn’t drop
the values and imputed that’s the in
part case
and in part you don’t have to acute but
you should yeah right exactly and that’s
why you lose in power
exactly you with me before we leave in
car I want us to think about does this
ever happen so what would this graph
mean the only way you can get in car is
it’s like your research assistant used a
random number generator to delete values
in the spreadsheet right what what could
do this now maybe there are cases where
it could truly be unrelated to every
other variable in the graph that’s what
this means
the missingness mechanism is not
influenced by anything else we know
right or anything else still we might
need to know even an unobserved variable
and that would be yeah it does your your
research assistant your he becomes in
and just like randomly deletes some
values right
it could happen could happen probably
has happened but probably not this is
the monkeys on typewriters sort of
missing this mechanism I assert that
this is a highly implausible in most
real research situations that this is
the case I’m sorry to say I think it’s
you can come up with a convincing
example for your data set
congratulations but it’s really hard to
think of one where this happens in this
case what else is probably going on well
here’s one proposal
M is influencing the missingness
mechanism perhaps this will give us the
situation called missing at random I
know we had missing completely at random
now it’s mere merely a trance or
something like that it’s missing at
random yeah I should say the person who
came up with these terminologies is don
rubin who is an absolute genius it’s
just not what terminology Freddie’s the
first person to analyze these cases and
talk about the conditional probability
requirements for each it’s a super
achievement but these term enough this
terminology is not not helping to spread
the gospel but it’s I don’t mean to make
fun of the topic so M is now entering
our be influencing it what does this
mean
the missingness mechanism depends upon
the body mass values species that have
particularly large or small body masses
are more likely to have missing
neocortex values how about that HAP
excuse me how might that happen well
anthropologists have different research
interests and they find different kinds
of species attractive to study and
measure in particular for example maybe
small ones are really hard to measure
neocortex for we’re just not as
interested in them right there’s like a
bunch of Kalat Rickards they’re out
there and a little furry and they’re
cute but there’s not nearly as much
effort on measuring them as a
chimpanzees right were their whole
armies of people in this building
studying chimpanzees right so and that
makes sense but it generates a pattern
where some of the things of a species
predict missing values that are
associated causally with the missingness
in other values does that make sense I
think this is extremely common and in
this case we get the missing at random
case and now I ask you the same question
as before or is there a back door pass
or be observed okay and I won’t you know
go through the socratic thing of waiting
for you to say something yes there is
now because there’s an arrow entering RB
from the back you’ve got a complete path
all the way from be observed around
through in
okay how can you close that back door
you condition on him as it says it’s
about this slide yeah sorry I shouldn’t
put that there
your condition on him and then it shuts
that path but still you should look at
it right remember this I know it’s it’s
rusty but remember all the path closing
procedures yeah if we condition on n we
block that fork right informs the fork
and you can close the fork by
conditioning on the middle of the fork
so to close the fork are conditioning on
them and then again the missingness
mechanism is ignore we don’t have to
know it but we do have to condition on
em and we do have to impute otherwise
there will be a bias in the estimate but
you do but you don’t have to know R
which is nice because R is usually not
knowable right in any any detailed sense
but you do here you don’t have to know
the missingness mechanism I’ve got a
summary on the next slide but you do
need to do imputation otherwise you’re
gonna have a confounded a biased
estimate of the causal effect so what is
Marv missing at random missing simply
and random may be is any case in which K
is conditionally independent of the
missingness mechanism that is there some
variable in the graph or set of
variables in the graph we can condition
on and separate the two D separate if
you remember that term from way back
when chapter six or so and this is
missing it random missing at random is a
nice situation to be in and it’s the
situation is probably most common there
is something else in our system which is
associated with it causing the
missingness if we can condition on that
and do imputation we have hope of
getting a good causal inference out of
it I’ll show you how to do this why do
you need to imputing all the other
variables with this associated missing
this pattern and this can create really
strong biases in the inference yeah and
final case which is the worst case to be
in is called missing not at random I
know
the last one wasn’t at randomizer right
but this one’s really not that random
and in this case there are a couple ways
to get this let me show you the most
obvious the variable itself causes the
missingness particular values of
neocortex percent are more likely to go
missing than others right how could this
happen well this could happen it for
example in this case I can’t think of an
example what is it to be true but I’ve
got another mechanism that it could do
it in this data set
but maybe species with Logan your cortex
you guessed that from the background
information and so you don’t measure
those and so you don’t have any precise
estimate for those species in that case
it would be the actual value that’s
doing it this is nasty because you get a
backdoor that you can’t close the
mission this mechanism is not ignore
belen this case there’s nothing you
can’t condition on that will shut it
because you don’t know the missing s
mechanism if you did know the missing
this mechanism you could shut that
backdoor path and that’s what’s required
and your only hope in this case is the
model to missing this mechanism and
thereby condition on it so if you’ve got
enough scientific information about how
that missing this works you can do that
if you’re lucky in the right case but
there’s no guarantee the other way you
can get this effect it isn’t just an
arrow from B to RB you could have a
latent variable that does it good times
right so here I’ve drawn on the right
another version of missing that at
random there’s another unobserved
variable you too and this is a fork
which influences both neocortex percent
and missingness what could this be like
it could be like phylogeny imagine
humans since we’re narcissistic we like
to study things that are closely related
to us things that are closely related to
us have brains with a lot of neocortex
so in this case you – if it’s
phylogenetic proximity to humans will
influence the neocortex percent and it
will also influence missingness good
times right this happens like all the
time in front ology present okay so
there’s my summary of missing out at
random if this is a case where K is
unconditionally dependent on Rd there’s
nothing you can condition on except for
the misenus mechanism itself which will
shut that back door
and this happens – yeah I can’t say how
commonly but when you when you find
yourself in this situation your hope is
to model the missing us mechanism which
can be done okay I’ve got ten minutes to
go let me do this slide because this is
about concept and then there’s a bunch
of mechanical slides to come the trying
to go quickly through because all that’s
in the text and it’s just how to run the
model yeah so here’s here’s my other I’m
trying to develop a way to teach this
stuff so here’s my other attempt at the
redefining these let’s think about dogs
eating homework in many parts of the
english-speaking world or world this is
a way we talked about people lying why
they don’t have their homework done
right in my dog ate it
it’s a standard joke my doggy ate my
homework sometimes a dog does eat your
homework though right so it happens
happened to me when I was a kid my dog
ate my homework it’s true you can
imagine me I was straight-a student go
into the teacher my dog ate my homework
she’s like really you Richard I thought
better of you so so why cats are better
they don’t eat your homework so imagine
a dag not dog dad with four variables in
it labeled H H star a and D H is your
homework and that is the score it’s
worth the quality of your homework as a
quantitative variable H star is the
version with missing values so a bunch
of students are coming turning their
homeworks and some of them are missing
yeah
a is a certain attribute of the student
which causally influences the quality of
their homework my attention span working
memory adderall
whatever and DS our dog the missingness
mechanism right it was R on the previous
graphs it’s now a dog and on the Left
we’ve got missing completely at random
here I relabeled it the dog eats any
homework and this is the dad for it so
the attribute influences homework on the
top of that the true state of the
homework influences H star the version
of homework with missing values and the
dog influences H star but nothing
influences the dog but dog elite any
homework it’s not selective right that’s
missing completely random
you with me okay I’m working on this it
does need some development I know so in
the middle we’ve got missing in random
the dog eats particular students
homework they care about the student the
attributed the student so now it’s like
the dogs of students who have particular
values of this attribute are more likely
to get their homework e right I don’t
have a mechanism so I said need to work
on this mechanism of this the attribute
could be something like attention span
or something like that and so if you
don’t if you don’t pay close attention
to your homework and you turn away the
dog eats it something like that right so
it’s an attribute of the student not of
the homework yeah now of course it’s
correlated getting eaten when your
homework is correlated with the score of
the homework but not because of the
score in the homework it’s because of
the attribute of the student who is
working on it that’s missing at random
and that’s the thing I said is really
common in science incredibly common in
science okay finally the worst case dog
only eats bad homework the dog sniffs
the homework assesses his score eats it
right so you know or more likely the
homeworks bad the student feeds it to
the dog so it’s another way this could
happen but it depends upon the score on
the homework right and so that’s missing
not a grantham our dog eats bad homework
and now in the dag we’ve got an arrow
directly from the true h to the missing
this mechanism D does this help I’m
working on this something I think I’ll
add this to the book it might be a
little too weird but that’s never
stopped me before okay so let me show
you a little bit about the mechanics of
this and this is all in the book the
code to do this so I’ll necessarily move
a little bit faster this just to give
you you know the first step of drawing
the owl the conceptual bit of it and the
key insight again it’s just we think
about the generative process and we
write down the same model when we go
every missing value just gets a
parameter now because we haven’t
observed it so it becomes a parameter
model stays the same we run the model so
that’s it basically that’s it well
there’s all this drawing the outer part
in between it has to do with the
algorithm but let me give you the
intuition so there’s 12 missing values
for new cortex in this dataset on the
right I’m showing you the whole dataset
the last column on the far right is the
neocortex percent each of those anaise
is a missing value
and we’re going to assume missing at
random that M is influencing the
missingness the body mass is influencing
missingness conceptually what we’re
going to do is we’re going to replace
each of the N a s in this column with a
parameter and then we’re going to get
posterior distributions for each of the
missing values and the information in
those will also flow into the regression
so you’ll get different slopes out of
this too so let me show you how this
works the idea is we think about each of
these gaps now getting assigned a
parameter because it’s unobserved and
unobserved variables are by definition
parameters that’s what they are in-phase
and these things will be imputed by the
model this is what the model looks like
B is now a vector in which some
positions are observed values and some
are parameters let’s all mix together
and we’re going to stick that mix of
things into the linear model in an
ordinary regression model and then the
only additional thing we have to do
shown in blue here is have some prior
for the B values you can think about
this as the model of B and sometimes
your dad will tell you this right what
is be caused by your dad will give you
some information about this in the case
when B is observed it informs the
parameters inside this so new is the
that’s new degree letter so that little
V is going to be its new is the mean of
the B values of the of the neocortex
values this is a standardized variable
so it’ll be very near zero and Sigma sub
B is the standard deviation those will
be estimated from the observed values
when the values not observed this is
been a prior for that value it can keeps
it from being any old thing mind blowing
I can see there’s at least one mind
blowing excellent but it’s the same
model as before it’s alright here’s my
my annotation to say the same thing
that’s what B is a mix and then whidbey
is observed this thing’s actually but
when it’s not observed
it’s a product in toad form it looks
exactly the same except we add this
prior around here what lulav is going to
do is it’s going to detect the NA s and
it’s going to construct that mix vector
for you tries to help you it automates
this and what you get out in the
posterior distribution there’s a
parameter for every missing values so
you see 12 B imputes here and each of
these is an imputed neocortex value and
the question of all your minds I’m sure
is what is this due to the slopes in
this model now we’ve added 12 cases to
the data that’s nice and almost doubles
our sample size and so let’s compare the
same model you don’t have to change the
model code at all you just delete the
cases that have missing values and rerun
it
all right the model stays exactly the
same and now we can compare the slopes
remember this is one of these masking
effect cases we’ve got two predictors
that are positively associated with one
another but one is negatively associated
with the outcome the one’s positively
associated model 15.3 is our new is our
new imputed model that uses the full
sample and 15.4 is the old one from way
back in the previous chapter notice what
has happened is the estimates have
gotten more precise they’ve also shrunk
a little bit towards zero so the
previous model is probably
overestimating the influence of each of
these but we’ve gotten precision by
adding twelve parameters for things we
don’t know we’ve got more precise
estimates of the slope yeah this is what
you expect in a missing at random or dog
eats particular students homework
situation is that you get UD confound
and get extra precision by doing the
imputation let me show you what happens
as a pattern in the data so we can plot
these imputed values up mixed in with
all the observed values but they’re
going to have standard errors on them
and that’s what I’m showing you here we
don’t know exactly what these values are
the posterior distributions are pretty
big right despite that they help us
understand the slopes more so this on
this graph we’ve got neocortex percent
on the horizontal each of those open
circles is a computed value the blue
circles are observed values they follow
the regression line the posterior means
follow the regression trend right
because the regression informs them if
you’ve got some species with a big body
mass that tells you something about its
neocortex percent because those two
variables are strongly associated in all
of the observed cases yeah so the model
automatically accounts for this you
don’t have to be clever cool right I
don’t like being clever it’s very hard
disappointing thing about this model is
that the relationship between the
imputed values and the other predictor
is zero which is wrong so if you look at
this graph you see a regression print
for the blue points and the observed
points there’s a strong positive
correlation to a log body mass in
neocortex percent but the imputed values
don’t follow this at all
and that’s because we didn’t tell the
models of these two things are
associated so in the text I show you how
to fix this just very quickly because I
know I’m out of time we do this by
saying M and B come from a multivariate
normal and we model their correlation we
did this just like an instrumental
variable right it’s the same trick the
same kind of code but you can do it now
even though B has missing values inside
of it you can still do it it’s a mixed
vector of parameters and such and
observed values there’s code show you
how to do this you just have to manually
construct this mixed vector of things
and there’s some code to do that that’s
the drawing the owl part and then at the
end happy days they’re associated you
get even more precision from the
estimate of these things okay this is a
really big topic missing data and there
are lots of things which are kind of
like missing data but don’t feel exactly
the same one of the areas that I think
is most important in this family of
models in ecology called occupancy
models mark-recapture methods these are
really missing data problems in a sense
you think of them they’re kind of like
measurement error there’s a true
occupancy is the species there but you
can’t observe that zeroes are not
trustworthy right and so there’s a
special relationship you do imputations
of the true States right there’s this
latent thing whether the species is
really
and you need to impute it and that’s how
these models works they’re like missing
data models but they have special
structure which comes from the detection
process that you model okay I realize
amount of time I’m gonna put up a final
homework later this afternoon after I do
it myself I think it’s good but we’ll
see after I try it in which you will do
some imputation practice with some
primates and it’s due in a week even
though the course ends today please turn
it in in a week for a full sense of
satisfaction and if you so want some
certificate of completion I’ll be happy
to give you one as well
so with that thank you for your
indulgence for the last 10 weeks we’ve
gone a long way from the Golem of Prague
and as you go home and deploy your
golems I just want you to remain humble
in their presence and I hope you’ve
learned something valuable thank you
Comment is the energy for a writer, thanks!