Assessment Methodology

Posted on 2015-01-10 by Admin

This is a high level overview of the procedure we use to assess predictions that have associated market data as well as the people (or other sources) that made them.

Prediction Assessments

Because prediction timeframes are generally not known exactly, the assessment of a prediction is actually an aggregation of assessments over a range of times, with weightings described by a trapeziodal time kernel function. Time kernel parameters are based on available timing information, typically an interpretation of the language used in a quote. This approach minimises sensitivity to an arbitrary cut-off date.

For a given day, the assessment of a prediction makes use of a probability distribution of the associated topic prices (and comparison topic prices where there is a comparison topic) taken from a statistical model of market movements that is constructed using only knowledge available at the prediction stated date.

During an assessment, we calculate a number of values, the most important being the Risk Adjusted Annualized Excess Return (RAAER) and Standardized Score.

RAAER is the annualised return of a market position corresponding to the prediction, relative to the average return expected by the market model (which is currently an estimate of the risk free rate for all topics) and scaled according to the topic variance. Maintaining a given level of excess return over a longer time is more impressive (less likely to happen by chance alone) than the same annualized return over a shorter time, so for short term or partially assessed predictions, some very large RAAERs will occur by chance.

The standardised score is effectively the RAAER adjusted to account for this time effect, and is designed so that a random collection of prediction scores will have a standard deviation 1 and mean 0.

RAAER is the best estimate of the underlying under- or out-performance due to skill of the source, while the standard score is a measure of the statistical significance of the result. Note that the distribution of standardized scores is not normal - the heavy-tailed distribution of market returns (on all time scales) frequently results in some relatively large standardized scores.

Some topics have a benchmark index defined (for example, the benchmark currently associated with all US stocks is the S&P 500 Total Return). For predictions where this is the case, we assess the prediction both with and without the benchmark index set as the comparison topic.

To assess comparison predictions we replace the primary topic by a composite formed by subtracting a multiple of the comparison topic returns from the primary topic returns. For benchmarking, this multiple is calculated using historical variance and correlation and chosen so that the composite is theoretically uncorrelated to the comparison topic. Thus benchmarked assessments represent the return with exposure to the broader market removed as far as possible.

Source Assessments

We produce a number of assessments for each person (or other source) based on various subsets of their assessed predictions. The most comprehensive of these (referred to as "Mixed") is derived from the set of all benchmarked prediction assessments, where available, and absolute prediction assessments where not. Other assessments include "Benchmarked" (the set of all benchmarked assessments only), "Absolute" (the set of all absolute assessments only), "Market" (the set of assessments of predictions about broad based US market indices) and "< 1yr" (the set of assessments of predictions that have a representative timeframe less than one year).

One of the components of a person assessment is the person RAAER - a weighted average of the RAAERs of their assessed predictions. The person RAAER is a best estimate of the excess return that might be achieved by following the person's advice (assuming a person's ability is constant through time and continues into the future). We also display this value multiplied by the standard deviation of the yearly S&P 500 returns, making the two directly comparable. Unfortunately, the person RAAER is usually very unreliable (has a large associated error), and so in practice it is not very useful.

As part of a person assessment we also produce a p-value, which is an estimate of the probability that the person RAAER could have been obtained by chance alone (no predictive skill). People who are more likely to posess skill (positive or negative) will have a smaller p-values. p-values below 0.05 (5%) are usually considered to be "statistically significant". We do see a number of people with p-values close to or below the statistical significant level, suggesting there is some value in looking at this number (and noting whether the RAAER is positive or negative).

There are a number of factors that we account for which complicate calculation of the person RAAER and p-values considerably. The most important is probably correlation between prediction RAAERs - the results of multiple predictions about the same or similar topics over overlapping time periods convey (often much) less information than independent results. Another very important consideration is the heavy-tailed distribution of standardised scores. There are also a number of other less important factors that we have incorporated into our methodology.

In addition to the RAAER and p-value, we produce a set of simpler metrics. The Annualized Excess Return (AER) is an annualized return corrected for the risk-free rate of return but not historical volatility, so it represents the return in excess of the risk free rate obtained by holding the recommended assets without leverage. The BM AER is the equivalent annualised return generated by holding the primary benchmark (S&P 500) instead of the recommended assets in the same amount over the same period, while the Excess AER is the difference. Excess AER can be interpreted as the return that would have been achieved by following a persons advice, relative to simply investing in the S&P 500, without making any attempt to correct for differences in risk.


The person assessment results together with all predictions that are currently open or partially open are used to produce a statistical market model that we hope has some predictive power.

We start with an instance of the market model identical to that which would be used to assess predictions with a stated date equal to the current time. This model captures the nature of market volatility quite well, but has no exploitable predictive power. We then update this model (by way of a monte carlo simulation) so that the assessment of all open predictions are as consistent as possible with all person assessments.

A key function of the forecasting algorithm is to deal with conflict in the best way possible - not every source with a high positive score will be in agreement about a particular topic, nor will they always disagree with sources with negative scores. We are effectively trying to find topics that have the least controversy amongst sources with the highest significant scores (either positive or negative).

There are a number of factors that make generating a reliable forecast very difficult. The most important of these is correlation between the person assessments. One reason for the correlation is that different people might work from the same sort of analysis or model. It could be a good approach, but when it fails everyone using it is wrong at the same time. A more pernicious effect is that people are swayed by what others are saying, which at the extreme can lead to situations of group-think where an incorrect consensus emerges, often seen before a market crash. By modelling at least some correlation between all sources, we retain a residual level of doubt in the face of overwhelming agreement.

Note: Forecaster is still experimental and we don't expose it's output yet. It appears to be working quite well, but requires more testing and more prediction data before we are prepared to declare the results useful.


Posted on 2014-12-01 by Matt

Backrecord is a tool for collecting and analyzing the opinions of people in the public spotlight, currently with a strong focus on economics and finance. We've been working on this on and off for about four years now (it's turned out to be a much bigger challenge than we ever imagined!) and today we are ready to go live with a public beta.

One of the key things we wanted to do with backrecord was bring some level of accountability to publicly stated opinion - if we can define a good measure of the quality of someone's previous opinions, this should be useful in forming an opinion about whether their current ideas are likely to be any good or not.

Our initial focus has been to assess predictions1 about topics that have corresponding securities in publicly traded markets. One benefit of focusing on this subset of the problem is that it allows us to develop a very rigorous assessment methodology. A disadvantage is the fact that this is necessary - in practice, public markets are very information efficient and the value of most opinion about them is at best small compared to their price volatility.

But there is evidence to suggest that some people possess some skill in predicting the likelihood of future movements of some markets2, which brings us to the first goal of backrecord - to what extent is it possible to differentiate luck from skill in market prediction results (and can this be automated)? We are hopeful that a combination of carefully formulated statistical methods and attention to detail in data collection will allow us to extract some signal from the noise that is significant enough to be useful - if not about individuals then maybe in aggregate. As I write, we do not have a strong opinion about whether we are going to succeed at this or not (more data is required).

The second goal of backrecord is really to further our own understanding of economic systems (as well as to help others who may be similarly inclined). We enjoy thinking about macroeconomics and our starting point is often the thoughts of others. We wanted a tool to bring all of these thoughts together in a high information density, easily navigable form.

Finally, humans have strong tendency to believe a good narrative, confidently delivered by someone with authority. Unfortnately in finance, this confidence is often misplaced. Our third goal is simply just to make this plain to see. When you are tempted to be highly influenced by someone's opinion, have a look on backrecord and see how similar statements have turned out in the past.

So that's backrecord. We hope you find it interesting and useful.

If you would like learn more, we suggest reading the blog post "assessment methodology"

1 By prediction we mean any statement that can be construed as an opinion about the likelihood of something happening in the future. This includes, for example, stock recommendations which can generally be interpreted as an opinion about the likelihood of a particular stock under- or outperforming a market index over some time period ahead.

2 The paper Luck Versus Skill in the Cross Section of Mutual Fund Returns by Fama and French is particularly good ... and sets realistic expectations.

Before It Began

Posted on 2010-09-13 by Matt

I thought it would be interesting to include myself in the experiment. The web site is not going to be launched for a number of months, but it will be better if I start going on record now...

Over the next 5 years, residential real estate in Australia will prove to be a poor investment. I think the median house price will decline in real terms and will be no higher than it is today in nominal terms. There is also a non-negligible chance of a sharp correction in house prices over this time frame. Same opinion for Vancouver - house prices there are also too high.

Blog Post Archive

2015-01-10 Assessment Methodology
2014-12-01 Welcome!
2010-09-13 Before It Began