On a scale of 0 to 10, how much does the average citizen of the Republic of
Elbonia trust the president?
You’re conducting a survey to find out, and you’ve calculated that in order
to get the precision you want, you’re going to need a sample of 100
statistically independent individuals. Now you have to decide how to do
this.
You could stand in the central square of the capital city and survey the
next 100 people who walk by. But these opinions won’t be independent:
probably politics in the capital isn’t representative of politics in
Elbonia as a whole.
So you consider travelling to 100 different locations in the country and
asking one Elbonian at each. But apart from anything else, this is far
too expensive for you to do.
Maybe a compromise would be OK. You could go to 10 locations and ask… 20
people at each? 30? How many would you need in order to match the
precision of 100 independent individuals — to have an “effective
sample size” of 100?
The answer turns out to be closely connected to a quantity I’ve written
about many times before:
magnitude.
Let me explain…
The general situation is that we have a large population of individuals (in
this case, Elbonians), and with each there is associated a real number
(in this case, their level of trust in the president). So we have a probability
distribution, and we’re interested in discovering some statistic θ\theta
(in this case, the mean, but it might instead be the median
or the variance or the 90th percentile). We do this by taking some sample
of nn individuals, and then doing something with the sampled data to
produce an estimate of θ\theta.
The “something” we do with the sampled data is called an estimator.
So, an estimator is a real-valued function on the set of possible sample
data. For instance, if you’re trying to estimate the mean of the
population, and we denote the sample data by Y 1,…,Y nY_1, \ldots, Y_n, then the
obvious estimator for the population mean would be just the sample mean,
1nY 1+⋯+1nY n.
\frac{1}{n} Y_1 + \cdots + \frac{1}{n} Y_n.
But it’s important to realize that the best estimator for a given statistic
of the population (such as the mean) needn’t be that same statistic applied
to the sample. For example, suppose we wish to know the mean mass of
men from Mali. Unfortunately, we’ve only weighed three men from Mali, and
two of them are brothers. You could use
13Y 1+13Y 2+13Y 3
\frac{1}{3} Y_1 + \frac{1}{3} Y_2 + \frac{1}{3} Y_3
as your estimator, but since body mass is somewhat genetic, that would give
undue importance to one particular family. At the opposite extreme, you
could use
12Y 1+14Y 2+14Y 3
\frac{1}{2} Y_1 + \frac{1}{4} Y_2 + \frac{1}{4} Y_3
(where Y 1Y_1 is the mass of the non-brother). But that would be going too
far, as it gives the non-brother as much importance as the two brothers put
together. Probably the best answer is somewhere in between. Exactly
where in between depends on the correlation between masses of brothers,
which is a quantity we might reasonably estimate from data gathered elsewhere
in the world.
(There’s a deliberate echo here of something I wrote
previously:
in what proportions should we sow
poppies, Polish wheat and Persian wheat in order to maximize
biological diversity? The similarity is no coincidence.)
There are several qualities we might seek in an estimator. I’ll focus on
two.
-
High precision The precision of an estimator is the
reciprocal of its variance. To make sense of this, you have to realize
that estimators are random variables too! An estimator with high
precision, or low variance, is not much changed by the effects of
randomness. It will give more or less the same answer if you run it
multiple times.
For instance, suppose we’ve decided to do the Elbonian survey by asking
30 people in each of the 5 biggest cities and 20 people from each of 3
chosen villages, then taking some specific weighted mean of the resulting
data. If that’s a high-precision estimator, it will give more or
less the same final answer no matter which specific Elbonians happen to
have been stopped by the pollsters.
-
Unbiased An estimator of some statistic is unbiased if its expected value is
equal to that statistic for the population.
For example, suppose we’re trying to estimate the variance of some
distribution. If our sample consists of a measly two individuals, then the
variance of the sample is likely to be much less than the variance of the
population. After all, with only two individuals observed, we’ve barely
begun to glimpse the full variation of the population as a whole. It can
actually be shown
that with a sample size of two, the expected value of the sample variance
is half the population variance. So the sample variance is a biased
estimator of the population variance, but twice the sample variance is an
unbiased estimator.
(Being unbiased is perhaps a less crucial property of an estimator than
it might at first appear. Suppose the boss of a chain of pizza takeaways
wants to know the average size of pizzas ordered. “Size” could be measured
by diameter — what you order by — or area — what you eat.
But since the relationship between diameter and area is quadratic rather
than linear, an unbiased estimator of one will be a biased estimator of the
other.)
No matter what statistic you’re trying to estimate, you can talk
about
the “effective sample size” of an estimator. But for simplicity, I’ll only
talk about estimating the mean.
Here’s a loose definition:
The effective sample size of an estimator of the population mean is
the number n effn_{eff} with the property that our estimator has the same
precision (or variance) as the estimator got by sampling n effn_{eff}
independent individuals.
Let’s unpack that.
Suppose we choose nn individuals at random from the population (with
replacement, if you care). So we have independent, identically distributed
random variables Y 1,…,Y nY_1, \ldots, Y_n. As above, we take the sample mean
1nY 1+⋯+1nY n
\frac{1}{n} Y_1 + \cdots + \frac{1}{n} Y_n
as our estimator of the population mean. Since variance is additive for
independent random variables, the variance of this estimator is
n⋅Var(1nY 1)=n⋅1n 2Var(Y 1)=σ 2n
n \cdot Var\Bigl( \frac{1}{n} Y_1 \Bigr)
=
n \cdot \frac{1}{n^2} Var(Y_1)
=
\frac{\sigma^2}{n}
where σ 2\sigma^2 is the population variance. The precision of the
estimator is, therefore, n/σ 2n/\sigma^2. That makes sense: as your sample
size nn increases, the precision of your estimate increases too.
Now, suppose we have some other estimator μ^\hat{\mu} of the population
mean. It’s a random variable, so it has a variance Var(μ^)Var(\hat{\mu}). The
effective sample size of the estimator μ^\hat{\mu} is the number n effn_{eff}
satisfying
σ 2/n eff=Var(μ^).
\sigma^2/n_{eff} = Var(\hat{\mu}).
This doesn’t entirely make sense, as the unique number n effn_{eff} satisfying
this equation needn’t be an integer, so we can’t sensibly talk about a
sample of size n effn_{eff}. Nevertheless, we can absolutely rigorously
define the effective sample size of our estimator μ^\hat{\mu} as
n eff=σ 2/Var(μ^).
n_{eff} = \sigma^2/\Var(\hat{\mu}).
And that’s the definition. Differently put,
effective sample size=precision ×population variance.
\text{effective sample size}
=
\text{precision }
\times
\text{population variance}.
Trivial examples If μ^\hat{\mu} is the mean value of nn
uncorrelated individuals, then the effective sample size is nn. If
μ^\hat{\mu} is the mean value of nn extremely highly correlated
individuals, then the variance of the estimator is little less than the
variance of a single individual, so the effective sample size is little
more than 11.
Now, suppose our pollsters have come back from their trips to various parts
of Elbonia. Together, they’ve asked nn individuals how much they trust the
president. We want to take that data and use it to estimate the population
mean — that is, the mean level of trust in the president across
Elbonia — in as precise a way as possible.
We’re going to restrict ourselves to unbiased estimators, so that the
expected value of the estimator is the population mean. We’re also going
to consider only linear estimators: those of the form
a 1Y 1+⋯+a nY n
a_1 Y_1 + \cdots + a_n Y_n
where Y 1,…,Y nY_1, \ldots, Y_n are the trust levels expressed by the nn
Elbonians surveyed.
Question:
What choice of unbiased linear estimator maximizes the effective sample
size?
To answer this, we need to recall some basic statistical notions…
Correlation and covariance
Variance is a quadratic form, and covariance is the corresponding bilinear
form. That is, take two random variables XX and YY, with respective
means μ X\mu_X and μ Y\mu_Y. Then their covariance is
Cov(X,Y)=E((X−μ X)(Y−μ Y)).
Cov(X, Y) = E((X - \mu_X)(Y - \mu_Y)).
This is bilinear in XX and YY, and Cov(X,X)=Var(X)Cov(X, X) = Var(X).
Cov(X,Y)Cov(X, Y) is bounded above and below by ±σ Xσ Y\pm \sigma_X \sigma_Y, the
product of the standard deviations. It’s natural to normalize, dividing
through by σ Xσ Y\sigma_X \sigma_Y to obtain a number between −1-1 and 11.
This gives the correlation coefficient
ρ X,Y=Cov(X,Y)σ Xσ Y∈[−1,1].
\rho_{X, Y}
=
\frac{Cov(X, Y)}{\sigma_X\sigma_Y}
\in
[-1, 1].
Alternatively, we can first scale XX and YY to have variance 11, then
take the covariance, and this also gives the correlation:
ρ X,Y=Cov(X/σ X,Y/σ Y).
\rho_{X, Y} = Cov(X/\sigma_X, Y/\sigma_Y).
Now suppose we have nn random variables, Y 1,…,Y nY_1, \ldots, Y_n. The
correlation matrix RR is the n×nn \times n matrix whose (i,j)(i, j)-entry
is ρ Y i,Y j\rho_{Y_i, Y_j}. Correlation matrices have some easily-proved properties:
The entries are all in [−1,1][-1, 1].
The diagonal entries are all 11.
The matrix is symmetric.
The matrix is positive semidefinite. That’s because the corresponding
quadratic form is (a 1,…,a n)↦Var(∑a iY i/σ i)(a_1, \ldots, a_n) \mapsto Var(\sum a_i
Y_i/\sigma_i), and variances are nonnegative.
And actually, it’s not so hard to prove that any matrix with these
properties is the correlation matrix of some sequence of random variables.
In what follows, for simplicity, I’ll quietly assume that the correlation
matrices we encounter are strictly positive definite. This only amounts to
assuming that no linear combination of the Y iY_is has variance zero —
in other words, that there are no exact linear relationships between the
random variables involved.
Back to the main question
Here’s where we got to. We surveyed nn individuals from our population,
giving nn identically distributed but not necessarily independent random
variables Y 1,…,Y nY_1, \ldots, Y_n. Some of them will be correlated because of
geographical clustering.
We’re trying to use this data to estimate the population mean in as precise
a way as possible. Specifically, we’re looking for numbers a 1,…,a na_1, \ldots,
a_n such that the linear estimator ∑a iY i\sum a_i Y_i is unbiased and has the
maximum possible effective sample size.
The effective sample size was defined as n eff=σ 2/Var(∑a iY i)n_{eff} = \sigma^2/Var(\sum a_i
Y_i), where σ 2\sigma^2 is the variance of the distribution we’re drawing
from. Now we need to work out the variance in the denominator.
Let RR denote the correlation matrix of Y 1,…,Y nY_1, \ldots, Y_n. I said a
moment ago that (a 1,…,a n)↦Var(∑a iY i)(a_1, \ldots, a_n) \mapsto Var (\sum a_i Y_i) is the
quadratic form corresponding to the bilinear form represented by the
covariance matrix. Since each Y iY_i has variance σ 2\sigma^2, the
covariance matrix is just σ 2\sigma^2 times the correlation matrix RR. Hence
Var(a 1Y 1+⋯+a nY n)=σ 2⋅a *Ra
Var(a_1 Y_1 + \cdots + a_n Y_n)
=
\sigma^2 \cdot a^\ast R a
where *\ast denotes a transpose and a=(a 1,…,a n)a = (a_1, \ldots, a_n).
So, the effective sample size of our estimator is
1/a *Ra.
1/a^\ast R a.
We also wanted our estimator to be unbiased. Its expected value is
E(a 1Y 1+⋯+a nY n)=(a 1+⋯+a n)μ
E(a_1 Y_1 + \cdots + a_n Y_n) = (a_1 + \cdots + a_n) \mu
where μ\mu is the population mean. So, we need ∑a i=1\sum a_i = 1.
Putting this together, the maximum possible effective sample size among all
unbiased linear estimators is
sup{1a *Ra:a∈ℝ n,∑a i=1}.
\sup \Bigl\{ \frac{1}{a^\ast R a} \, : \,
a \in \mathbb{R}^n, \, \sum a_i = 1 \Bigr\}.
Which a∈ℝ na \in \mathbb{R}^n achieves this maximum, and what is the maximum
possible effective sample size? That’s easy, and in fact it’s something
that’s appeared many times at this blog before…
The magnitude of a matrix
The magnitude |R||R| of an invertible n×nn \times n matrix RR is the sum of
all n 2n^2 entries of R −1R^{-1}. To calculate it, you don’t need to go as
far as inverting RR. It’s much easier to find the unique column vector
ww satisfying
Rw=(1 ⋮ 1)
R w = \begin{pmatrix} 1 \\ \vdots \\ 1 \end{pmatrix}
(the weighting of RR), then calculate ∑ iw i\sum_i w_i. This sum is the
magnitude of RR, since w iw_i is the iith row-sum of R −1R^{-1}.
Most of what I’ve written about
magnitude
has been in the situation where we start with a finite metric space X={x 1,…,x n}X =
\{x_1, \ldots, x_n\}, and we use the matrix ZZ with entries Z ij=exp(−d(x i,x j))Z_{i j} =
exp(-d(x_i, x_j)). This turns out to give interesting information about
XX. In the metric situation, the entries of the matrix ZZ are between
00 and 11. Often ZZ is positive definite (e.g. when X⊂ℝ nX
\subset \mathbb{R}^n), as correlation matrices are.
When RR is positive definite, there’s a third way to describe the
magnitude:
|R|=sup{1a *Ra:a∈ℝ n,∑a i=1}.
|R|
=
\sup \Bigl\{ \frac{1}{a^\ast R a} \, : \,
a \in \mathbb{R}^n, \, \sum a_i = 1 \Bigr\}.
The supremum is attained just when a=w/|R|a = w/|R|, and the proof is a simple
application of the Cauchy–Schwarz inequality.
But that supremum is exactly the expression we had for maximum effective sample size! So:
The maximum possible value of n effn_{eff} is |R||R|.
Or more wordily:
The maximum effective sample size of an unbiased linear estimator of the
mean is the magnitude of the sample correlation matrix.
Or wordily but approximately:
Effective sample size == magnitude of correlation matrix.
Moreover, we know how to attain that maximum. It’s attained if and only if
our estimator is
1|R|(w 1Y 1+⋯+w nY n)
\frac{1}{|R|} (w_1 Y_1 + \cdots + w_n Y_n)
where w=(w 1,…,w n)w = (w_1, \ldots, w_n) is the weighting of the correlation matrix.
I’m not too sure where this “result” — observation, really —
comes from. I learned it from the statistician Paul
Blackwell at Sheffield, who, like
me, had been reading this paper:
Andrew Solow and Stephen Polasky, Measuring biological diversity.
Environmental and Ecological Statistics 1 (1994), 95–103.
In turn, Solow and Polasky refer to this:
Morris Eaton, A group action on covariances with applications to the
comparison of linear normal experiments. In: Moshe Shaked and Y.L. Tong
(eds.), Stochastic inequalities: Papers from the AMS-IMS-SIAM Joint Summer
Research Conference held in Seattle, Washington, July 1991, Institute of
Mathematical Statistics Lecture Notes — Monograph Series, Volume 22,
1992.
But the result is so simple that I’d imagine it’s much older. I’ve been
wondering whether it’s essentially the Gauss-Markov
theorem; I
thought it was, then I thought it wasn’t. Does anyone know?
The surprising behaviour of effective sample size
You might expect the effective size of a sample of nn individuals to be at
most nn. It’s not.
You might expect the effective sample size to go down as the correlations
within the sample go up. It doesn’t.
This behaviour appears in even the simplest nontrivial example:
Example Suppose our sample consists of just two individuals.
Call the sampled values Y 1Y_1 and Y 2Y_2, and write the correlation matrix
as
R=(1 ρ ρ 1).
R =
\begin{pmatrix}
1 & \rho \\
\rho & 1
\end{pmatrix}.
Then the maximum-precision unbiased linear estimator is 12(Y 1+Y 2)\frac{1}{2}(Y_1 +
Y_2), and its effective sample size is
|R|=21+ρ.
|R| = \frac{2}{1 + \rho}.
As the correlation ρ\rho between the two variables increases from 00 to
11, the effective sample size decreases from 22 to 11, as you’d expect.
But when ρ0\rho \lt 0, the effective sample size is greater than 2. In
fact, as ρ→−1\rho \to -1, the effective sample size tends to ∞\infty.
That’s intuitively plausible. For if ρ\rho is close to −1-1 then, writing
Y 1=μ+ε 1Y_1 = \mu + \varepsilon_1 and Y 2=μ+ε 2Y_2 = \mu + \varepsilon_2, we have ε 1≈−ε 2\varepsilon_1
\approx -\varepsilon_2, and so 12(Y 1+Y 2)\frac{1}{2}(Y_1 + Y_2) is a very good estimator
of μ\mu. In the extreme, when ρ=−1\rho = -1, it’s an exact estimator of
μ\mu — it’s infinitely precise.
The fact that the effective sample size can be greater than the actual
sample size seems to be very well known. For instance, there’s a whole
page about
it
in the documentation for Q, which is apparently “analysis software for
market research”.
What’s interesting is that this doesn’t only occur when
some of the variables are negatively correlated. It can also happen when
all the correlations are nonnegative, as in the following example from the
paper by Eaton cited above.
Example Consider the correlation matrix
R=(1 0 ρ 0 1 ρ ρ ρ 1)
R =
\begin{pmatrix}
1 &0 &\rho \\
0 &1 &\rho \\
\rho &\rho &1
\end{pmatrix}
where 0≤ρ2/2=0.707…0 \leq \rho \lt \sqrt{2}/2 = 0.707\ldots. This is positive
definite, so it’s the correlation matrix of some random variables Y 1,Y 2,Y 3Y_1,
Y_2, Y_3.
A routine computation shows that
|R|=3−4ρ1−2ρ 2.
|R| = \frac{3 - 4\rho}{1 - 2\rho^2}.
As we’ve shown, this is the greatest possible effective sample size you can achieve by taking an unbiased linear combination of Y 1Y_1, Y 2Y_2 and Y 3Y_3.
When ρ=0\rho = 0, it’s 33, as you’d
expect: the variables are uncorrelated. As ρ\rho increases, |R||R|
decreases, again as you’d expect: more correlation between the variables
leads to a smaller effective sample size. This behaviour continues until
ρ=1/2\rho = 1/2, where |R|=2|R| = 2.
But then something strange happens. As ρ\rho increases from 1/21/2 to
2/2\sqrt{2}/2, the effective sample size increases from 22 to ∞\infty.
Increasing the correlation increases the effective sample size. For
instance, when ρ=0.7\rho = 0.7, we have |R|=10|R| = 10: the
maximum-precision estimator is as precise as if we’d chosen 1010
independent individuals! For that value of ρ\rho, the maximum-precision
estimator turns out to be
32Y 1+32Y 2−2Y 3.
\frac{3}{2} Y_1 + \frac{3}{2} Y_2 - 2 Y_3.
Go figure!
This is very like the fact that a metric space with nn points can have
magnitude (“effective number of points”) greater than nn, even if the
associated matrix ZZ is positive definite.
These examples may seem counterintuitive, but Eaton cautions us
to beware of our feeble intuitions:
These examples show that our rather vague intuitive feeling that
“positive correlation tends to decrease information content in an
experiment” is very far from the truth, even for rather simple normal
experiments with three observations.
Anyone with any statistical knowledge who’s still reading will easily have
picked up on the fact that I’m a total amateur. If that’s you, I’d love to
hear your comments!