Rating Systems and The Confederations Cup

I like to do keep tabs on my rating system this from time to time, always looking for ways to improve it. Since it appears most everybody took the Confederations cup quite seriously, why not have a look to see how the rating systems did in predicting the outcomes of games.

The four systems I looked at were the Official FIFA Rankings, the World Football ELO Ratings and my two rating systems. Each system has their ratings before the start of the tournament documented online. They were:

Team/System   Voros1OFF   Voros1DEF    Voros2    ELO   FIFA
Brazil             2.01      320.55   1078.21   2036   1288
Spain              1.56      353.22   1246.33   2093   1761
Italy              1.38      264.66    825.36   1961   1292
USA                1.29      191.68    420.40   1795    947
Egypt              1.03      149.63    253.41   1700    719
South Africa       0.72      129.02    125.36   1520    471
Iraq               0.76      116.74    108.39   1509    450
New Zealand        0.63       75.11     59.94   1518    431

For my systems, Voros1 is the system that rates each team based on their ability to score and prevent goals. This is the system I use to produce the World Cup Qualification odds. Voros2 is the system that produces one rating, where you get a percentage of wins based on the scoreline, with the increases in that percentage going down for each additional goal you win by (so that the difference between winning by three and by four is very small and the difference between winning by six and winning by sixty is irrelevant).

First off the won/loss record for each system for the 15 competitive games (I didn’t count the third place game, though every system got that one right anyway).

System    Wins  Draws  Loss
Voros1      11      2     2
Voros2      11      2     2
ELO         11      2     2
FIFA        10      2     3

Two things are notable about that list, the first is that the systems pick the same team most of the time. The two Voros systems picked the same team in every match. There was one difference between the Voros systems and the ELO System: the Voros systems both picked Iraq over New Zealand, ELO took New Zealand over Iraq. The match wound up being one of the two draws. The FIFA Rankings picked the same team as the Voros systems every time except for the Brazil/Italy matchup. FIFA took Italy (oops!), the other three systems all took Brazil.

The other is, contrary to what people say sometimes, every single time I do this the systems (even FIFA) always do way better than the proverbial “picking names out of a hat.” Past results can be used to predict future outcomes, and can do so fairly well. The two losses in 15 matches is a fairly close to what I usually get when I look at the systems from time to time.

There’s a few problems with the FIFA Rankings. FIFA does not give you know guidance as to how they would like you to adjust for Home Field (it’s not even factored into their rankings). FIFA doesn’t even tell you how to use their rankings to predict matchups. This limits your ability to compare it to others, because you don’t really know what it’s trying to say when it rates one at 947 and another at 719. Both my systems and ELO have rankings you can use to predict the results of a matchup between two teams.

So all of the systems got most of the games right. What else can we do? FIFA’s limitations also limit how in depth we can analyze things. One thing we can run is a confidence pool. I don’t know how popular these are abroad, but these are the most popular way in America to run office NFL Football pools. The idea being that you not only pick which team is going to win, but you also pick how confident you are in that pick. That way you can differentiate between your confidence in Brazil beating Italy and your confidence in Brazil beating American Samoa.

These are done by assigning a numerical value to each pick, starting with the least confident (1) to the most confident (in this case 15 since there are 15 games). If you win, you get those points, if you lose you get nothing. For our purposes, a draw means you get half the points you risked. The person with the most points wins the pool.

Every system had the same two games as their 14 and 15 picks: Spain Over New Zealand and Spain over Iraq.Every system won those two games. In fact every single system won every one of their 12 through 15 picks. The highest pick not to win was FIFA’s 11 pointer, the USA upset over Spain. All the systems had Spain of course, but Voros1 had only 6 points, Voros2 had 8 and ELO had 10 points on it. For the Egypt over Italy upset, Voros1 had Italy at 7, Voros2 had Italy at 9, ELO also had Italy at 9, FIFA had Italy at 10. The worst pick from Voros1 (besides those) was the draw between Iraq and New Zealand. Voros1 risked the most points of any system (5) on that one. In the other draw (South Africa and Iraq), Voros1 only risked 2 points on the hosts.

The final point totals were:

Voros1 = 103.5
Voros2 = 99.5
ELO = 98
FIFA = 95.5

This is pretty consistent with what I usually find. Voros1 tends to slightly outperform Voros2 in international play (the reverse is true for club play), ELO is pretty close behind and then FIFA behind them. It’s obviously way too small of a sample to draw any conclusions from, but like I said that’s generally the order they tend to come in when I do this.

If we throw FIFA’s out, we can compare the predicted win% with the actual outcome and gauge them that way. I’ll use the Root Mean Squared Error for that. What that simply means is that I’ll take the systems predicted win% for the favorite (win% = [wins + [draws/2]]/games) take the difference the actual result (1 if the favorite wins, 0 if it loses, 0.5 if it’s a draw) and square it. Then I average out those numbers for all of the games, and then take the square root of it. It’s a fairly common way to measure errors in predicted values. Here’s the RMSE for each system (lower number indicates more accurate prediction):

Voros1 = 0.341
Voros2 = 0.363
ELO = 0.366

If you take the average of the absolute error (the errors without squaring them), you get:

Voros1 = 0.281
Voros2 = 0.299
ELO = 0.265

ELO has the advantage, mainly because it tends to make its favorites bigger favorites across the board than my systems. So when it misses, like the USA and Spain (Spain = 85%), it takes much less of a hit than in the system that squares the errors. As I said before, if my systems do have a weakness, it’s that they may underrate the chances a favorite has of winning. This tends to be a bigger problem for Voros2 than Voros1. I have yet to find an acceptable fix for it, though.

That’s a lot of stuff, but to sum it up. I think my systems compare well to ELO. I won’t say they’re better, because that’s not really supported by the data. They were (with some qualifications) better at predicting results in this tournament, but that’s obviously not the same thing. The differences aren’t large, and the sample size is quite small.

The key thing is that these systems really do have some relationship to team quality, even FIFA’s much maligned rankings. They aren’t “meaningless” and do tell you quite a bit about the quality of the rated teams. They all can be better, and I’m always looking for ways to improve mine.

vorosmccracken.com

Rating Systems and The Confederations Cup

One response to “Rating Systems and The Confederations Cup”

Leave a Reply Cancel reply