Πέμπτη 18 Μαΐου 2017

An empirical (dummy) study on Pearson correlation and linear model prediction

Introduction

Correlation has been a cue-word for a number of alleged and real findings. Usually correlation describes the tendency of two measures to vary in a (more or less) similar manner. In this playful study I am dealing with Pearson's product moment correlation coefficient or Pearson's r, if you prefer.
One way correlation is used in the literature is to see whether an estimator you have created, is good enough in predicting a quantity. Or whether, e.g. a medical index is predictive of a problem. Let us consider a finance-based example, where someone creates a stock predictor, and happily claims that his “predicted values show a 91% linear correlation, with a p-value much below 0.0001 to the actual value of the stock”. This seems to be really something, doesn't it? It implies that our claiming friend can predict the value of the stock with his stock predictor.
But you know me: I can be really annoying at times. I wanted to see whether such a number is always what we expect it to be. So, I went the other way round, I started with a clear linear correlation between an estimate and a real value and worked my way to trick the statistical test (Pearson's) or, at least, to see how its output is affected by the noise. And the findings were interesting. Let's follow through this statistical playground below, but I want to make a comment early on: if one works directly with the Pearson's r analytical formula, I am certain it will be quite clear how each type of noise can affect it; however, in this setting, I try to work empirically to improve our intuition through experimentation.
On with the show!
NOTES: 
i) The text is quite long, so take your time... I have split it in sections to help reading.
ii) I do not expect to cause any surprise to mathematicians or statisticians. But I want to slowly raise awareness related to the dangers of scientific claims and the fallacies built upon (erroneously used) statistics.
iii) In the text below I am using (simple) R language functions to illustrate my points, without going into detail about the technical specifics and without any effort to optimize. The code is merely a tool to demonstrate my ideas.

First things first: nice correlations
First, I will create a list of 800 increasing numbers between -0.3 and 0.7 (arbitrary bounds), we consider to be a set of REAL values we want to predict:
x = seq(-0.3, 0.7, by = 1/800)
OK. Now, let us create a variable which has a linear relation to x.
y = 0.5 + 0.1 * x
The figure below clearly shows this relation: the horizontal axis represents x, while the vertical represents y. It is clear that if x goes up, so will y. Or, if we get two y values, e.g. y1 < y2 we will be quite safe to say that we expect that the corresponding x1 < x2. So, y is a useful predictor of (the increase or decrease of) x. If y was an estimate of the value of a stock (x) and we see that the value of y for today’s market is lower than our y for tomorrow, then we expect that the market will go up tomorrow (as y does).

Let us try to get the Pearson correlation of these instances. Using the 
cor.test(x, y, method='pearson') 
function from the R language we find out that

correlation is 1 with p-value almost zero (< 10^(-15))

Pearson easily understands the correlation. Keep in mind that this is not actually a line, but a drawing including many, many small circles close to one another. These small circles are our instances (x,y). Again, we consider that x is the real world variable (the one we would like to have an estimate on), while y is our estimate of the value of x.

Let us, now, add some noise (about an order of magnitude less than the actual x value) in the y variable, based on a sine function:
y= 0.5 + 0.1 * x + 0.01*sin(2*pi*x)



...and the correlation is...

0.9757339 with p-value almost zero (< 10^(-15))
The correlation is clear, despite the (limited) noise.
Let's now increase the frequency of the sine function by 10-fold:

y= 0.5 + 0.1 * x + 0.01*sin(20*pi*x)

The correlation is now 0.9704897 with p-value almost zero (< 10^(-15)).
Thus, even though the changes are more abrupt, our test cannot detect significant change. It is still certain of the linear correlation.

Moving to bigger challenges

What happens if we change the strength of the original noise function (the height of the sine function) by increasing it 10-fold?

y= 0.5 + 0.1 * x + 0.1 * sin(2*pi*x)

The correlation now is
0.555662 with very a p-value of almost zero (as above)
Now, this is a problem. One can see (visually in the figure above) that the connection between our predictor (y) and the real value (x) is far from a simple linear relation. But our statistical test still sees medium to strong correlation.

Prediction and correlation

One step before this playground on Pearson’s r coefficient, I will try to see whether we can really predict x based on y. I will only use the first and the last versions of our formulas to see what we can do.
In the case of the clear, linear correlation:
y = 0.5 + 0.1*x <=> x = (y – 0.5) / 0.1 <=> x = 10y – 5
Indeed, a linear regression model based on our samples only (i.e. without exposing the true formula) finds this exact relation between x and y.
We use the R command
myModel <- lm(x~y)
and get
Coefficients:
(Intercept) y
-5 10
It is now quite clear that given a value of our prediction y, we can estimate the true value of x.
Nice! Let us now try to do the same for the last case.

y= 0.5 + 0.1 * x + 0.1 * sin(2*pi*x) (I will not try to solve this here...)


The figure above illustrates the true relation (curve) and the estimated relation (straight line).

For the skeptics, I will remove the dependency from x in the sine function, adding a really orthogonal random variable between 0 and 1 (uniform distribution):
x = seq(-0.3, 0.7, by = 1/800)
y= 0.5 + 0.1 * x + 0.1 * sin(5*pi*runif(800))
The correlation is now
0.3756015 with a p-value almost at zero.

Still a medium correlation. The figure we get by repeating the regression process is the following. I will try to explain again that the red line is the connection between y (i.e. what we call our predictor) and the real value (x). What the illustration shows is that, essentially, even though we have a clearly statistically significant medium correlation, building a linear prediction model is clearly not effective/helpful.

Finale – Making the worst out of correlation


I will provide a last, hand-crafted example of how linear correlation thorugh Pearson’s product moment correlation coefficient can be misguiding. Image a case where correlation is 0.9183222 with a p-value of essentially zero. Now let us depict this correlation and the estimated linear model line (strong red line) in the figure below:


To generate the data use:

y= 100 * (0.5 + 0.1 * (2 + sin(16*pi*seq(-3, 7, by = 1/800))) * seq(-0.6, 1.4, by = 2/8000))
x = 100 * (seq(-0.14, 0.67, by = -(-0.14 - 0.67)/8000))
NOTE: Don't ask how I ended up here: I was simply playing with various parameters and combinations for the fluctuation of y values. This was the last state of my playground, where my decision to write a blog post was mode, and I had not kept all the intermediate steps.
Essentially, this data offers values of x between -0.14 and 0.67, over 8000 equidistant samples.
What one can see is that the predictive value of y regarding x is very good in specific points (i.e. where the variance in the y axis becomes minimum), while in other places it becomes almost useless (e.g. right extreme of the figure, where very different values of y map to very similar values of x). Thus, even though there exists a possible underlying linear correlation between x and y, the 0.91 value of our statistic and the supportive p-value could easily have us fooled that we had an excellent case. Things become worse if one chooses a subset of the points. E.g. the ones with x values over 50, as shown below:

In this (targeted) case the correlation calculation returns a correlation of 0.2461371 with almost zero p-value. Thus, in a subset of the original dataset we can a completely different image of the correlation. And since we rarely have the full data existing in the world, our idea of the correlation between two quantities can be very biased by the subset we use to measure the correlation.
What is worse, is the fact that in this case the predictive value of v with respect to x is almost zero (i.e. we cannot really detect x based on y, since y has a very strong fluctuation with minor corresponding changes in x).