In Part I of The Jellybean Trilogy, we looked at the statistical science behind why most new medical research could well mean nothing before, in Part II, extending that line of thinking to the realm of financial research. In this final part of our trilogy, we will now raise a large question mark over an aspect of research that is particularly popular in financial circles – the practice of back-testing.
You will no doubt be aware of the idea of the roomful of monkeys all bashing away at typewriters with a view to reproducing the complete works of Shakespeare. Well, imagine for a moment that, across the corridor, there is another roomful of monkeys – only this lot are randomly flicking switches up and down. If a switch is flipped up, you buy a stock; if a switch is flipped down, you sell a stock.
Given enough time and switch-flipping, a number of apparently successful simian traders will eventually emerge, but the truth is that these are just lucky monkeys. Back in the real world, modern computing power means that financial researchers can today have the equivalent of quadrillions of monkeys all flipping switches simultaneously and, again, some success stories will emerge.
So how can the researchers then set about assessing which of these success stories could continue to work going forwards and which are just spurious accidents of historical data – the lucky monkeys, as it were? One possibility is back-testing – essentially the historical simulation of an algorithmic investment strategy – but the wrinkle here is that this can be carried in two ways.
Say our research has thrown up the stunning revelation that a portfolio of companies whose CEOs are Sagittarians with a 36-inch inside-leg measurement will significantly outperform the wider market, we could either back-test the data ‘in sample’ or ‘out of sample’. With the former, we would look at the performance of a series of data over, say, the last 20 years up to today to see if it worked or not.
The problem is, these days there are so many different data series around and so much computer power at researchers’ fingertips that, even if our ‘Tall Sagittarian’ thesis is knocked down by back-testing, if you test ‘in sample’, the chances are there will be something else that can be shown to ‘work’ – perhaps a portfolio of companies beginning with a ‘T’ with at least three bald directors on the board.
As David Bailey and his co-authors observe in their 2014 paper, Pseudo-mathematics and financial charlatanism: the effects of back-test overfitting on out-of-sample performance: "The number of possible model configurations (or trials) is enormous, and naturally the researcher would like to select the one that maximises the performance of the strategy.”
Thus, in a bid to discover “the optimal specification of an investment strategy” – more bluntly, to persuade investors to part with their cash – the researchers can, if so minded, use many different variables, from the size and frequency of trades to risk measures, stop-losses and so on, to tailor their findings and achieve a very high reading on the full set of historic data.
Now, a better approach to back-testing would be to split our 20 years of data into two parts – using the first 15 years’ worth, say, to come up with what we think could be a successful trading strategy and then using the final five-year period to test whether or not the strategy works. This is the ‘out of sample’ method of back-testing and it is apparently rarely to be found in academic research.
“We would feel sufficiently rewarded in our efforts if this paper succeeds in drawing the attention of the mathematical community to the widespread proliferation of journal publications, many of them claiming profitable investment strategies on the sole basis of in-sample performance,” observe Bailey and co.
“This is understandable in business circles, but a higher standard is and should be expected from an academic forum.” Still, if researchers continue to insist on back-testing their entire series of data ‘in-sample’, an important question for anyone thinking about relying on any conclusions that result is, how long ought the series to be?
According to Bailey and co, the minimum back-test period depends on the number of trials conducted. Thus, as you can see from the chart below, research involving above 1,000 trials should have at least 10 years of history backing it up – which chimes neatly enough with Warren Buffett’s view that you need at least a decade of history before you can differentiate between a good investor and a lucky one.
Source: Bailey, 2014
The computational power of modern technology means that nowadays most studies conduct well in excess of 1,000 trials and yet, far from using at least a decade of data, most back-tests will use two or three years’ worth. As such, whenever you are faced with an investment strategy underpinned by back-tested data, ask yourself how many trials were conducted before the researchers came up with the one lucky monkey you are now being sold.
Read The Jellybean Trilogy (Part I) here. Read The Jellybean Trilogy (Part II) here.