Πέμπτη 18 Μαΐου 2017

An empirical (dummy) study on Pearson correlation and linear model prediction

Introduction

Correlation has been a cue-word for a number of alleged and real findings. Usually correlation describes the tendency of two measures to vary in a (more or less) similar manner. In this playful study I am dealing with Pearson's product moment correlation coefficient or Pearson's r, if you prefer.
One way correlation is used in the literature is to see whether an estimator you have created, is good enough in predicting a quantity. Or whether, e.g. a medical index is predictive of a problem. Let us consider a finance-based example, where someone creates a stock predictor, and happily claims that his “predicted values show a 91% linear correlation, with a p-value much below 0.0001 to the actual value of the stock”. This seems to be really something, doesn't it? It implies that our claiming friend can predict the value of the stock with his stock predictor.
But you know me: I can be really annoying at times. I wanted to see whether such a number is always what we expect it to be. So, I went the other way round, I started with a clear linear correlation between an estimate and a real value and worked my way to trick the statistical test (Pearson's) or, at least, to see how its output is affected by the noise. And the findings were interesting. Let's follow through this statistical playground below, but I want to make a comment early on: if one works directly with the Pearson's r analytical formula, I am certain it will be quite clear how each type of noise can affect it; however, in this setting, I try to work empirically to improve our intuition through experimentation.
On with the show!
NOTES: 
i) The text is quite long, so take your time... I have split it in sections to help reading.
ii) I do not expect to cause any surprise to mathematicians or statisticians. But I want to slowly raise awareness related to the dangers of scientific claims and the fallacies built upon (erroneously used) statistics.
iii) In the text below I am using (simple) R language functions to illustrate my points, without going into detail about the technical specifics and without any effort to optimize. The code is merely a tool to demonstrate my ideas.

First things first: nice correlations
First, I will create a list of 800 increasing numbers between -0.3 and 0.7 (arbitrary bounds), we consider to be a set of REAL values we want to predict:
x = seq(-0.3, 0.7, by = 1/800)
OK. Now, let us create a variable which has a linear relation to x.
y = 0.5 + 0.1 * x
The figure below clearly shows this relation: the horizontal axis represents x, while the vertical represents y. It is clear that if x goes up, so will y. Or, if we get two y values, e.g. y1 < y2 we will be quite safe to say that we expect that the corresponding x1 < x2. So, y is a useful predictor of (the increase or decrease of) x. If y was an estimate of the value of a stock (x) and we see that the value of y for today’s market is lower than our y for tomorrow, then we expect that the market will go up tomorrow (as y does).

Let us try to get the Pearson correlation of these instances. Using the 
cor.test(x, y, method='pearson') 
function from the R language we find out that

correlation is 1 with p-value almost zero (< 10^(-15))

Pearson easily understands the correlation. Keep in mind that this is not actually a line, but a drawing including many, many small circles close to one another. These small circles are our instances (x,y). Again, we consider that x is the real world variable (the one we would like to have an estimate on), while y is our estimate of the value of x.

Let us, now, add some noise (about an order of magnitude less than the actual x value) in the y variable, based on a sine function:
y= 0.5 + 0.1 * x + 0.01*sin(2*pi*x)



...and the correlation is...

0.9757339 with p-value almost zero (< 10^(-15))
The correlation is clear, despite the (limited) noise.
Let's now increase the frequency of the sine function by 10-fold:

y= 0.5 + 0.1 * x + 0.01*sin(20*pi*x)

The correlation is now 0.9704897 with p-value almost zero (< 10^(-15)).
Thus, even though the changes are more abrupt, our test cannot detect significant change. It is still certain of the linear correlation.

Moving to bigger challenges

What happens if we change the strength of the original noise function (the height of the sine function) by increasing it 10-fold?

y= 0.5 + 0.1 * x + 0.1 * sin(2*pi*x)

The correlation now is
0.555662 with very a p-value of almost zero (as above)
Now, this is a problem. One can see (visually in the figure above) that the connection between our predictor (y) and the real value (x) is far from a simple linear relation. But our statistical test still sees medium to strong correlation.

Prediction and correlation

One step before this playground on Pearson’s r coefficient, I will try to see whether we can really predict x based on y. I will only use the first and the last versions of our formulas to see what we can do.
In the case of the clear, linear correlation:
y = 0.5 + 0.1*x <=> x = (y – 0.5) / 0.1 <=> x = 10y – 5
Indeed, a linear regression model based on our samples only (i.e. without exposing the true formula) finds this exact relation between x and y.
We use the R command
myModel <- lm(x~y)
and get
Coefficients:
(Intercept) y
-5 10
It is now quite clear that given a value of our prediction y, we can estimate the true value of x.
Nice! Let us now try to do the same for the last case.

y= 0.5 + 0.1 * x + 0.1 * sin(2*pi*x) (I will not try to solve this here...)


The figure above illustrates the true relation (curve) and the estimated relation (straight line).

For the skeptics, I will remove the dependency from x in the sine function, adding a really orthogonal random variable between 0 and 1 (uniform distribution):
x = seq(-0.3, 0.7, by = 1/800)
y= 0.5 + 0.1 * x + 0.1 * sin(5*pi*runif(800))
The correlation is now
0.3756015 with a p-value almost at zero.

Still a medium correlation. The figure we get by repeating the regression process is the following. I will try to explain again that the red line is the connection between y (i.e. what we call our predictor) and the real value (x). What the illustration shows is that, essentially, even though we have a clearly statistically significant medium correlation, building a linear prediction model is clearly not effective/helpful.

Finale – Making the worst out of correlation


I will provide a last, hand-crafted example of how linear correlation thorugh Pearson’s product moment correlation coefficient can be misguiding. Image a case where correlation is 0.9183222 with a p-value of essentially zero. Now let us depict this correlation and the estimated linear model line (strong red line) in the figure below:


To generate the data use:

y= 100 * (0.5 + 0.1 * (2 + sin(16*pi*seq(-3, 7, by = 1/800))) * seq(-0.6, 1.4, by = 2/8000))
x = 100 * (seq(-0.14, 0.67, by = -(-0.14 - 0.67)/8000))
NOTE: Don't ask how I ended up here: I was simply playing with various parameters and combinations for the fluctuation of y values. This was the last state of my playground, where my decision to write a blog post was mode, and I had not kept all the intermediate steps.
Essentially, this data offers values of x between -0.14 and 0.67, over 8000 equidistant samples.
What one can see is that the predictive value of y regarding x is very good in specific points (i.e. where the variance in the y axis becomes minimum), while in other places it becomes almost useless (e.g. right extreme of the figure, where very different values of y map to very similar values of x). Thus, even though there exists a possible underlying linear correlation between x and y, the 0.91 value of our statistic and the supportive p-value could easily have us fooled that we had an excellent case. Things become worse if one chooses a subset of the points. E.g. the ones with x values over 50, as shown below:

In this (targeted) case the correlation calculation returns a correlation of 0.2461371 with almost zero p-value. Thus, in a subset of the original dataset we can a completely different image of the correlation. And since we rarely have the full data existing in the world, our idea of the correlation between two quantities can be very biased by the subset we use to measure the correlation.
What is worse, is the fact that in this case the predictive value of v with respect to x is almost zero (i.e. we cannot really detect x based on y, since y has a very strong fluctuation with minor corresponding changes in x).

Δευτέρα 31 Οκτωβρίου 2016

Required qualities of scientific writing

In this first post, I want to visit, in a non-technical manner, some qualities of scientific works, which highlight potential caveats in research. Keep in mind that from now on, wherever I feel a reference/citation makes sense in the text, I either use a link or a number in brackets (e.g. [1]), which indicates a numbered reference you can find at the end of the text (similar to an endnote).
 
In my years as a researcher, I have been called upon to perform experiments, support or reject hypotheses and review my work, as well as the work of others. Trying to decipher what is acceptable in science and as science, I pursued a first understanding of how things work by reading published articles. My early understanding was that a piece of scientific writing, describing some research, should have the following qualities (non-exhaustively):
  1. it should be well motivated (i.e. should try to solve a meaningful problem or add to a clear theoretic question/line of thought);
  2. it should be clear (i.e. understandable without too much effort);
  3. it should be concise (i.e. not bloated with unneeded information or - to the other extreme - lacking significant information);
  4. it should be self-sufficient (i.e. introducing the terms used, as well as the context of the problem in an adequate manner, such that minimal information beyond the text required);
  5. it should be innovative and correctly positioned with respect to the related work (i.e. showing was is the missing piece of knowledge it offers the world);
  6. it should be as unbiased as possible with respect to the experimental setup and findings (whether the latter be positive or negative);
  7. it should be useful and reusable, in that they should offer insights about things we do not know fully, allowing reuse of the findings to further pursue scientific (or applied) goals.
  8. the work should provide enough information on experiments to make them repeatable.
Unfortunately, a number of the (mind you: published)  papers I read, held only a number of the above qualities. Thus, I started to understand that not all is well in science.

My second source of scientific method know-how were other scientists. During my PhD thesis I received or simply heard about a number of pieces of "advice" on how to perform research, essentially following practices that would oppose the above qualities. I provide a few examples below:
  • Add complex math formulas to your presentations/articles: people like to see things they do not understand. They feel your work is worth more. Opposing qualities: clarity, conciseness
  • A dissertation should be at least/at most X pages. Opposing qualities: conciseness / self-sufficiency
  • Has anyone else tried using method M for your setting? Everyone uses method M these days! Opposing qualities: motivation, innovation, unbiased approach.
  • It is no big deal if you hack the numbers a bit. No one will notice. Opposing qualities: unbiased approach, usefulness and reusability.
  • Make your own data for testing and see how well you are doing there. This is enough, as long as you get a nice p-value in the experiments. Opposing qualities: innovation and correct positioning, unbiased approach, usefulness and reusability, repeatability.
  • Do not refer to any failures of the method; just show the strong points. Opposing qualities: usefulness and reusability, correct positioning, unbiased approach. 
Are the above pieces of advice meaningful? Do they help? Let us see what I found out myself during my practice as a researcher/reviewer/supervisor/professor, with respect to the above "advice" and how other scientists react to those who follow them:
  • If I find complex formulas (or any other unclear statement, no matter how scientific-looking it is) in an article without textual/intuitive support as a reviewer, I negatively comment and reduce the "clarity" grade.
  • To students that create long dissertations without good reason, I have them rewrite the text. And that can hurt a lot...
  • When a student or a paper I review starts by "others used method M in other settings, so we use it in our setting", I clearly state that the positioning and the motivation of the paper is problematic and (guess what!) I reduce its "technical quality" grade.
    NOTE: I will try to offer some insight regarding scientific "hype" in another post. This hype is a very common cause of such badly motivated works.
  • When I submitted my first scientific journal paper [1], one of the reviewers actually repeated ALL my experiments and contacted me to validate the findings. Thankfully, my method was simply good, so the findings were true and validated. In other words, I found out early enough that being honest is the only thing that makes sense. If you lie, sooner or later you will be found and, probably, be humiliated, no matter the scientific status you may have attained (cf. here and here for cases of false evidence and their outcomes).
  • Working on one set of data, made by yourself, is usually not enough to get a work accepted (in most established conferences and journals). Even if your work does get accepted, the lack of reusable data minimzes the impact of your work (i.e. few people will actually cite it). It is much better practice to simply put effort into creating a sharable dataset and get it out there for use by others. This is one of the best ways to see your work being reused and cited. 
  • Do not count on p-values too much: they are being severely discussed and criticized lately [2,3]. This is due to a simple fact: p-value is usually not what we expect it to be.
    NOTE: I will try to cover the "reproducibility crisis" and the problems of statistical significance in later posts.
  • Omitting the downsides of a method, allows others to criticize your work as insufficient. Such debates have begun in the past, ending in no good outcomes. Once again, see the identification of downsides as the means to propose a next publication (a.k.a. future work).
Based on the above, it was quite clear since my early research steps that "building on the shoulders of giants" may not be enough, if the giants have clay legs. I will start by a strong claim (such claims are to be avoided in scientific writings):
Science is simply a guess of what reality is about. It also appears to be the best one we have, when in comes to measurable and observable phenomena.

This is what makes science amazingly useful. This is also why we need to be ready to surpass every claim science makes, to reach further towards the truth, when new evidence surface to open new ways and indicate new challenges.


References:
[1] Giannakopoulos, George, et al. "Summarization system evaluation revisited: N-gram graphs." ACM Transactions on Speech and Language Processing (TSLP) 5.3 (2008): 5.
[2] Wasserstein, Ronald L., and Nicole A. Lazar. "The ASA's statement on p-values: context, process, and purpose." The American Statistician (2016). 
[3] Halsey, Lewis G., et al. "The fickle P value generates irreproducible results." Nature methods 12.3 (2015): 179-185.

Δευτέρα 10 Οκτωβρίου 2016

Back to the basics

I had to put my research questions and thoughts under a roof to, hopefully, reach out to students and researchers alike. My aim is to provide intuition related to things I have studied, revisited and matured over. To share my fears, hopes, faults and successes related to the workings of science.

The main topics of this blog-roof are (more or less):
  • science and scientific method
  • data, as in data mining, data analysis and learning from data.
  • intelligence, as in artificial intelligence and intelligent systems.
Above all these, I aim to provide a critical view of phenomena in the above domains, as a human being, i.e. the only nature I am aware of possessing.

Thus, in the next posts, I plan to provide some thoughts on existing practices, methods, but also on future steps and challenges.

As always, I rely on the good-willed reader for comments and feedback with the main requirement I have from myself, my colleagues and my students:
Every time you claim you do not agree with something, provide a good-willed suggestion for improvement. 
Otherwise the critique is useless.

In this collaborative spirit, let the knowledge sharing begin!