variance of the mean of correlated and uncorrelated data

Since the mean of many highly correlated quantities has higher variance than does the mean of many quantities that are not as highly correlated, the test error estimate resulting from LOOCV tends to have higher variance than does the test error estimate resulting from k-fold CV.

Can you give me a numerical example e.g. in R to check the validity of this claim? I tried to check it using the following code:

x = 1:100 #highly correlated data y = sample(100) #same data without correlation var(x) == var(y) # TRUE

LOOCV stands for "leave one out cross-validation"

[1]: James, G., Witten, D., Hastie, T., Tibshirani, R. (2013),
An Introduction to Statistical Learning with Applications in R,
Springer Texts in Statistics, Springer Science+Business Media, New York

287k 37 37 gold badges 643 643 silver badges 1.1k 1.1k bronze badges asked Jul 12, 2016 at 21:10 Farid Cheraghi Farid Cheraghi 275 2 2 silver badges 9 9 bronze badges

$\begingroup$ If you sort(sample(100)) you will see it is identical to 1:100 and hence their variances are identical. Can't help you with the first bit of your post - I would have thought correlated quantities have lower variance (e.g. intra-cluster correlations) but then I don't know what LOOCV is. $\endgroup$

Commented Jul 12, 2016 at 21:23

$\begingroup$ I know they are the same and just the ordering is changed. However 1:100 are correlated numbers but sample(100) are not. $\endgroup$

Commented Jul 12, 2016 at 21:27

$\begingroup$ The vector 1:100 is no more correlated than the vector sample(100) . They may have been generated differently but, apart from the ordering, are identical. Certainly the calculation of variance does not take ordering into account. There are examples on-line of how to simulate correlated data which are probably what you need. $\endgroup$

Commented Jul 12, 2016 at 21:38 $\begingroup$ Try acf(x) and acf(y) and see yourself! $\endgroup$ Commented Jul 12, 2016 at 21:47

$\begingroup$ Ah, I didn't think about autocorrelation. Still, the reason the variances are equal is because the ordering is irrelevant to the var function, just as it would be if you calculated the variance by had. Also, and possibly more helpfully, see link $\endgroup$

Commented Jul 12, 2016 at 22:04

1 Answer 1

$\begingroup$

The variance computed in the code views each array as if it were one sample of 100 separate values. Because both the array and its permuted version contain the same 100 values, they have the same variance.

The right way to simulate the situation in the quotation requires repetition. Generate a sample of values. Compute its mean. (This plays the role of "test error estimate.") Repeat many times. Collect all these means and look at how much they vary. This is the "variance" referred to in the quotation.

We may anticipate what will happen:

When the elements of each sample in this process are positively correlated, when one value is high the others tend to be high, too. Their mean will then be high. When one value is low the others tend to be low, too. Their mean will then be low. Thus, the means tend either to be high or low.
When elements of each sample are not correlated, the amount by which some elements are high is often balanced (or "canceled out") by other low elements. Overall the mean tends to be very close to the average of the population from which the samples are drawn--and rarely much greater or much less than that.

R makes it easy to put this into action. The main trick is to generate correlated samples. One way is to use standard Normal variables: linear combinations of them can be used to induce any amount of correlation you might like.

Here, for instance, are the results of this repeated experiment when it was conducted 5,000 times using samples of size $n=2$. In one case the samples were obtained from a standard Normal distribution. In the other they were obtained in a similar way--both with zero means and unit variances--but the distribution they were drawn from had a correlation coefficient of $90\%$.

The top row shows the frequency distributions of all 5,000 means. The bottom row shows the scatterplots generated by all 5,000 pairs of data. From the difference in spreads of the histograms, it is clear the set of means from the uncorrelated samples is less scattered than the set of means from the correlated samples, exemplifying the "canceling out" argument.

The difference in the amount of spread becomes more pronounced with higher correlation and with larger sample sizes. The R code allows you to specify these as rho and n , respectively, so you can experiment. Like the code in the question, its aim is to produce arrays x (from the uncorrelated samples) and y (from the correlated samples) for further comparison.

Now when you compute the variances of the arrays of means x and y , their values will differ:

> var(x) [1] 0.5035174 > var(y) [1] 0.9590535

Theory tells us these variances will be close to $(1+1)/2^2 = 0.5$ and $(1 + 2\times 0.9 + 1)/2^2 = 0.95$. They differ from the theoretical values only because just 5,000 repetitions were done. With more repetitions, the variances of x and y will tend closer to their theoretical values.