Introduction
Correlation has been a cue-word for a number of alleged and real
findings. Usually correlation describes the tendency of two measures
to vary in a (more or less) similar manner. In this playful study I am dealing with Pearson's product moment correlation coefficient or Pearson's r, if you prefer.
One way correlation is used in the literature is to see whether an
estimator you have created, is good enough in predicting a quantity.
Or whether, e.g. a medical index is predictive of a problem. Let us
consider a finance-based example, where someone creates a stock
predictor, and happily claims that his “predicted values
show a 91% linear correlation, with a p-value much below
0.0001 to the actual value of the stock”. This seems to
be really something, doesn't it? It implies that our
claiming friend can predict the value of the stock with
his stock predictor.
But you know me: I can be really annoying at times. I wanted to
see whether such a number is always what we expect it to be. So, I
went the other way round, I started with a clear linear correlation
between an estimate and a real value and worked my way to trick the
statistical test (Pearson's) or, at least, to see how its output is
affected by the noise. And the findings were interesting. Let's
follow through this statistical playground below, but I want to make
a comment early on: if one works directly with the Pearson's r
analytical formula, I am certain it will be quite clear how each type
of noise can affect it; however, in this setting, I try to work
empirically to improve our intuition through experimentation.
On with the show!
NOTES:
i) The text is quite long, so take your time... I have split it in sections to help reading.
ii) I do not expect to cause any surprise to mathematicians or statisticians. But I want to slowly raise awareness related to the dangers of scientific claims and the fallacies built upon (erroneously used) statistics.
iii) In the text below I am using (simple) R language functions to illustrate my points, without going into detail about the technical specifics and without any effort to optimize. The code is merely a tool to demonstrate my ideas.
First things first: nice correlations
First, I will create a list of 800 increasing numbers between -0.3 and 0.7 (arbitrary bounds), we consider
to be a set of REAL values we want to predict:
x = seq(-0.3, 0.7, by = 1/800)
OK. Now, let us create a variable which has a linear relation to
x.
y = 0.5 + 0.1 * x
The figure below clearly shows this relation: the horizontal axis
represents x, while the vertical represents y. It is clear that if x
goes up, so will y. Or, if we get two y values, e.g. y1 < y2
we will be quite safe to say that we expect that the corresponding x1
< x2. So, y is a useful predictor of (the increase or decrease of) x. If y was an estimate
of the value of a stock (x) and we see that the value of y for
today’s market is lower than our y for tomorrow, then we expect
that the market will go up tomorrow (as y does).
Let us try to get the Pearson correlation of these instances. Using the
cor.test(x, y, method='pearson')
function from the R language we find out that
correlation is 1 with p-value almost zero (< 10^(-15))
Pearson easily understands the correlation. Keep in mind that this
is not actually a line, but a drawing including many, many small
circles close to one another. These small circles are our
instances (x,y). Again, we consider that x is the real world
variable (the one we would like to have an estimate on), while y
is our estimate of
the value of x.
Let us, now, add some noise (about an order of magnitude less than
the actual x value) in the y variable, based on a sine function:
y= 0.5 + 0.1 * x + 0.01*sin(2*pi*x)
...and
the correlation is...
0.9757339 with p-value almost zero (< 10^(-15))
The correlation is clear, despite the (limited) noise.
Let's now increase the frequency of the sine function by 10-fold:
y= 0.5 + 0.1 * x + 0.01*sin(20*pi*x)
The correlation is now 0.9704897 with p-value almost zero (<
10^(-15)).
Thus, even though the changes are more
abrupt, our test cannot detect significant change. It is still
certain of the linear correlation.
Moving to bigger challenges
What happens if we change the strength
of the original noise function (the height of the sine function) by
increasing it 10-fold?
y= 0.5 + 0.1 * x + 0.1 * sin(2*pi*x)
The correlation now is
0.555662 with very a p-value of almost zero (as above)
Now, this is a
problem. One can see (visually in the figure above) that the
connection between our predictor (y) and the real value (x) is far
from a simple linear relation. But our statistical test still sees
medium to strong correlation.
Prediction and correlation
One step before
this playground on Pearson’s r coefficient, I will try to see
whether we can really predict x based on y. I will only use the first
and the last versions of our formulas to see what we can do.
In the case of
the clear, linear correlation:
y = 0.5 + 0.1*x
<=> x = (y – 0.5) / 0.1 <=> x = 10y – 5
Indeed,
a linear regression model based on our samples only
(i.e. without exposing the true formula) finds
this exact relation between
x and y.
We use the R
command
myModel <-
lm(x~y)
and get
Coefficients:
(Intercept) y
-5 10
It is now quite clear that given a value of our prediction y,
we can estimate the true value of x.
Nice! Let
us now try to do the same for the last case.
y=
0.5 + 0.1 * x + 0.1 *
sin(2*pi*x)
(I
will not try to solve this here...)
The
figure above illustrates the true relation (curve) and the estimated
relation (straight line).
For the
skeptics, I will remove the dependency from x in the sine function,
adding a really orthogonal random variable between 0 and 1 (uniform
distribution):
x = seq(-0.3,
0.7, by = 1/800)
y= 0.5 + 0.1 * x
+ 0.1 * sin(5*pi*runif(800))
The correlation
is now
0.3756015 with a p-value almost
at zero.
Still
a medium correlation. The
figure we get by repeating the regression process is the following. I
will try to explain again that the red
line is the connection between y (i.e. what we call our predictor)
and the real value (x).
What the illustration shows is that, essentially, even
though we have a clearly statistically significant medium
correlation,
building a
linear prediction model is clearly not effective/helpful.
Finale – Making the worst out of correlation
I will
provide a last, hand-crafted example of how linear correlation
thorugh Pearson’s product moment correlation coefficient can be
misguiding. Image a case where correlation is 0.9183222
with a p-value of essentially zero.
Now let us depict this correlation and the estimated linear model
line (strong red line)
in the figure below:
To
generate the data use:
y= 100 * (0.5 + 0.1 * (2 +
sin(16*pi*seq(-3, 7, by = 1/800))) * seq(-0.6, 1.4, by = 2/8000))
x = 100 * (seq(-0.14, 0.67, by =
-(-0.14 - 0.67)/8000))
NOTE: Don't ask how I ended up here: I was simply playing with various parameters and combinations for the fluctuation of y values. This was the last state of my playground, where my decision to write a blog post was mode, and I had not kept all the intermediate steps.
Essentially, this data offers values of x between -0.14 and 0.67, over 8000 equidistant samples.
What one can see is that the predictive value of y regarding x is very good in specific points (i.e. where the variance in the y axis becomes minimum), while in other places it becomes almost useless (e.g. right extreme of the figure, where very different values of y map to very similar values of x). Thus, even though there exists a possible underlying linear correlation between x and y, the 0.91 value of our statistic and the supportive p-value could easily have us fooled that we had an excellent case. Things become worse if one chooses a subset of the points. E.g. the ones with x values over 50, as shown below:
What one can see is that the predictive value of y regarding x is very good in specific points (i.e. where the variance in the y axis becomes minimum), while in other places it becomes almost useless (e.g. right extreme of the figure, where very different values of y map to very similar values of x). Thus, even though there exists a possible underlying linear correlation between x and y, the 0.91 value of our statistic and the supportive p-value could easily have us fooled that we had an excellent case. Things become worse if one chooses a subset of the points. E.g. the ones with x values over 50, as shown below:
In this (targeted) case the
correlation calculation returns a correlation of 0.2461371 with
almost zero p-value. Thus, in a subset of the original
dataset we can a completely different image of the correlation. And since we rarely have the full data existing in the world, our idea of the correlation between two quantities can be very biased by the subset we use to measure the correlation.
What is worse, is the fact that in this case the predictive value of v with respect to x is almost zero (i.e. we cannot really detect x based on y, since y has a very strong fluctuation with minor corresponding changes in x).
What is worse, is the fact that in this case the predictive value of v with respect to x is almost zero (i.e. we cannot really detect x based on y, since y has a very strong fluctuation with minor corresponding changes in x).