Is that sound you hear Goliath falling?
At the start of the World Cup, I published predictions of group stage results, assigning each team a probability of advancing. A day later Nate Silver at FiveThirtyEight.com published his own predictions. Thus began an epic battle of statistical forecasters: David (VividNumeral, me writing out my
basement garage apartment) versus Goliath (FiveThirtyEight, a subsidiary of ESPN).
Well, the group stage is over and the results are in. The chart on the left tells you everything you need to know - FiveThirtyEight made bigger prediction errors on average than VividNumeral. I can officially declare a resounding victory for VividNumeral.
Goliath has fallen.
Or is that the sound a shrug makes?
Now that I've satisfied my ego, it's time to say outloud what my quant savy reader probably already figured out. The chart I confidently claimed "tells you everything you need to know" actually tells you almost nothing you need to know.
My average prediction error was indeed smaller than FiveThirtyEight, but plotted on a relevant scale the difference is totally imperceptable. (I'll try not to feel too bad about my brief chart manipulation at the expense of an ESPN affiliate, the sports network that brought you this)
The chart on the left uses a more informative scale and introduces a third bar that shows the average error for a naive prediction that gives equal weights to each team.
I measured the error of a forecast as the absolute value of the outcome (100% if the team advanced, 0% if they didn't) minus the forecasted probability of advancing. If I said a team had a 75% chance of advancing and they did advance, the error was 25%. If they did not advance the error was 75%.
A completely ignorant forecast that gave a 50% chance to each team would have an average error of 50% by definition, like the grey bar in the chart. A perfect forecaster would have 0% error. At 41.625% and 41.614% respectively, FiveThirtyEight's and my forecasts were closer to ignorant than perfect. That randomness makes predictions difficult and the World Cup fun.
I haven't run statistical significance tests, but my common sense significance test screams that a difference of 0.011% in average forecast error is completely irrelevant. If I had lowered the probability of a single successful team's advancement by a meager 0.35% the difference in average error would be zero.
Instead of a loud victory featuring David over Goliath, the true result is more of a inconclusive shrug with a strong nod to the power of randomness.
The anatomy of a tie
This is about as close to a tie as any result could have been, but the internal sources of error were pretty varied. In the graphic on the right, the dots indicate forecast errors made by FiveThirtyEight and VividNumeral for each country. The line connecting the dots is colored according to who made the better forecast for that country (i.e. missed by less) to easily highlight where mistakes were made. You can find the original forecasts for both sites here.
Bigger mistakes are at the top of the chart and smaller errors at the bottom. Spain, at the very top of the chart, was a huge miss for both forecasters. It's no surprise that the reigning champs failing to advance was a surprise.
I discussed why I think FiveThirtyEight's model is better ex-ante (third paragraph here). That superiority lead their model to generally be more certain about predictions than mine. That means they gave really good teams (Brazil, Argentina, Germany) very high probabilities of advancing and really bad teams (Australia) very low probabilities. Even though the Vividnumeral model also got those four predictions, the model's higher uncertainty led to more moderate predictions. As a result, FiveThirtyEight had the four smallest errors, as indicated by the four red lines at the bottom of the chart.
However, FiveThirtyEight was burned by their certainty that Algeria would not advance, assigning the African nation the second lowest probability of success. I was more sanguine about Algeria's chances, giving them a 47% probability to FiveThirtyEight's 19%. If Russia had advanced instead of Algeria, my ever-so-slight victory (0.011%) would have turned in to a pretty substantial defeat (1.68%).
In fact if most of the close groups had broken differently, the tenuous win turns in to a loss. If Cote d'Ivoire had advanced instead of Greece, I would have lost by 2.5%. Ecuador instead of Switzerland = 2.8% loss. Ghana instead of US = 1.2% loss. Italy, who came with in a lick (or a bite) of advancement, actually would have helped me a bit, leading to a 0.9% win.
An ugly win
Those alternative scenarios, a bounce or two away from reality, highlight how difficult it is to judge the process that goes in to these forecasts from a single tournament. I really enjoy telling people that my forecasts were more accurate than Nate Silver's, but my conscious only allows me a couple seconds to bask in that glow before I come clean.
I'm pretty sure FiveThirtyEight's model is better. Like a soccer team that squeaks out a win on PKs despite being outshot 5 to 20, I'll take the win, but I know I was lucky.