Cut by Occam's Razor

Occam's razor is a popular scientific rule of thumb. If you are faced with several theories and can't choose between them, use the simplest one. It's a fancy version of K.I.S.S..

I love that advice and always try to err on the side of parsimony. But in my World Cup postmortem, I crossed the line from simple to stupid. The scoring I used to evaluate the forecasts of World Cup outcomes was flawed.

I measured the error of a forecast as the absolute value of the outcome (100% if the team advanced, 0% if they didn't) minus the forecasted probability of advancing. If I said a team had a 75% chance of advancing and they did advance, the error was 25%. If they did not advance the error was 75%.

Example Case1 Case2
Forecast Advance Probability (A) 90% 100%
True Advance Probability (B) 90% 90%
Error if Advance (C = 1-A) 10% 0%
True Fail Probability (D = 1-B) 10% 10%
Error if Fail (E = A) 90% 100%
Expected Error (F = B×C + D×E) 18% 10%

An example helps illustrate why that is a bad measure. Say that the truth is that a particular team would advance 90% of the time. Then 90% is the best possible guess and should have the lowest expected error.

If I guess 90% then my absolute error will be 10% 90% of the time (when the team does win) and 90% 10% of the time (when the team doesn't). The average expected error will be 18%.

A guess of 100% is wrong, so the expected error shoud be bigger. If I guess 100%, then my error would be 0% 90% of the time (when the team won) and 100% 10% of the time (when the team lost). The average expected error will be 10%.

So I got a better score (lower error) for a worse guess?! That should not happen.

Improper Behavior

The mean absolute error did not yield the best score when the probability was guessed correctly.

That's a big problem. In terms of decision theory, my scoring rule was not proper. Trying to do something more intuitive and easier to interpret, I mistakenly sacrificed accuracy.

Graphically you can see the problem below. Each chart represents a different version of reality, where a team had a 20%, 50%, and 80% chance of advancing respectively. The blue line shows the expected score for every possible probability forecast according to the difference rule I used in the original article (absolute value of outcome minus forecast).

You can see that when 20% is the real probability, the forecast with the best (highest) expected score was 0%. For the rule to be proper, the expected score should peak at 20%. In the 50% case, all forecasts had the same expected score when it should uniquely peak at 50% to be a proper scoring rule.

All Positions

On the same charts I drew a line for the log score in red, a strictly proper scoring rule defined: Actual×ln(Forecast) + (1-Actual)×ln(1-Forecast). (Nerd note: I exponentiated the log score in the chart so that it takes a value between 0 and 1).

Notice that the log score peaks in all the right places. A 20% forecast gives the highest expected score if reality is a 20% probability. A 50% forecast gives the highest expected score if reality is a 50% probability. An 80% forecast gives the highest expected score if reality is a 80% probability. That means the rule is proper! Hooray.

Results using the Log Score

If I calculate the log score for each forecast and average for each forecaster, VividNumeral still comes out ahead. The mean log score is -0.59 for VividNumeral, higher than -0.62 for FiveThirtyEight. Those numbers are really hard to interpret, but if I use elog score they are 55.2% and 53.7% respectively where a naïve 50% forecast for each would have a score of 50% to provide some context.

My predictions were better, but it's still hard to gauge how much better and the degree to which that would be consistent over time. I hope this post proves I'm open to suggestions of better ways of doing things. If anyone has a better way to compare these forecasts, feel free to email me ideas:

Finally, a big thank you to Ben for raising questions about the original postmortem.