vorosmccracken.com

The “triumphant” return of The Knack of the baseball world

vorosmccracken.com header image 2

The National Team Rating System Explanation

October 5th, 2009 · 21 Comments

Just what the hell am I doing back in the vorosmccracken.com command bunker when I’m running all of these ratings and simulations?

Now I’m not going to bother pretending I know anything about the effects subtle tactical and formation changes have on the outcome of a match. That’s not what I do. I’m a numbers guy and that’s how the analysis is going to be done.

If you really don’t care what precisely is going on, but have looked over the ratings and decided that they seem reasonable enough to you, you might want to stop here. This post will be long and a bit technical, so if this kind of explanation doesn’t interest you then you don’t have to continue.

You could get very in depth with a whole series of numbers and analyze all of them to rate teams. I have no problem with this approach and encourage people to have a try if they’d like. The issue is that I’m running simulations on dozens of qualifiers throughout every region of the world and I have neither the time nor the resources to do that kind of analysis on every team in the world. I also don’t have the time to verify and test such analyses to make sure the information they are giving me is reliable.

So I rely on my National Team Ratings to do the heavy lifting for all of these games. So how do those work? Unlike the FIFA Ranking or even ELO, my system uses an iterative method to derive the ratings. At its basics, the system asks “what ratings would produce a predicted total goals scored and allowed for every team that matches their actual goals scored and allowed in their matches.” To put it simply, if Team A plays Team B 10 times, and Team A scores 15 goals and team B scores 8 in those 10 matches, what ratings for Team A and Team B would produce those exact goal totals over 10 matches (while not necessarily matching the exact result for each individual matches).

Add in 206 more national teams, uneven schedules, weighting of games by recentness and match importance, home field advantage and a few other minor details and that question becomes rather difficult to solve directly. The iterative method allows me to take guesses at it, and then refine my guesses until the numbers eventually converge on the answer. I’m not saying my system is the best, but I will say that whatever system out there is the best, I’m 95% sure it’s going to be an iterative system.

The System Explained

But why goals? Why not wins and losses? That’s a fair question and the answer is that I think it gives a better estimation of actual team quality than one based strictly on wins, draws and losses. This has been shown pretty conclusively in baseball and other sports and my experience looking at the issue in football/soccer is that it applies here (though maybe not as absolutely) as well. I do include in my ratings an additional rating (calculated by a slightly different system) which takes into account wins, draws and losses (as well as the margin of victory) and you see that rating on the right hand side of the columns. Testing has convinced me that the goals method is slightly more accurate in future predictions when it comes to international football.

The reason why the goals method works best in international football is somewhat of a happy accident. In league play for clubs, all teams play the same number of games. In international football, the better the team, the more games that team is likely to play. This fact essentially solves the “running up the score” problem because the outcome of such a game will disproportionately affect the ratings of the weaker team, because this one game is a much larger chunk of their total games played than the stronger team. So Australia gained little by the message they sent to the OFC with their qualifier against American Samoa. All that happened was American Samoa’s already bad defensive rating became that much worse, Australia’s offensive rating barely nudged up. The other reason why it works is that Australia could only try and run up one of their two ratings. With the other half, all they could do was not let American Samoa score, an outcome as unimpressive in the ratings as it sounds on paper. However this does mean that the ratings start to get less accurate as you start to get past the top 100 and into the bottom half of teams. We know they’re bad, but because of a lack of games and a lack of games against good opposition, precisely how bad is up for debate.

One of the adjustments I make in the system is that one game does not necessarily equal “one game” in the system. All games are counted as fractions of one game, with “games=1.0” being the maximum any single match is worth. Such a match would be a World Cup Finals match that ended today. That sounds a little complicated, but what you really need to know is that if a team scores 35 goals and allows 31 in 30 games, the system might only see it as 7.6 goals scored and 7.1 allowed in 7.4 games depending on what kind of matches were played and when.

So how do I assign those weights? Well that’s really hard to do actually, and I’m always looking for better ways to do it. As it stands right now, a friendly is worth about half as much as a World Cup finals match. Having studied the issues, friendlies do tell us some information about the quality of a National Team, and for statistical purposes provide us with a much greater sample size of games which helps the overall accuracy of the system. I’m open to the idea of lowering their value, but so far trial and error has led me to settle on that kind of weighting. All types of matches that are neither friendlies nor World Cup finals matches lie somewhere in between. Again it’s just trial and error (and a little common sense) to come up with the right figures for those matches. In the future I’m working on ways to better accomplish this task. My current project is to try and devise an “outlier” system where the weight of the match is affected by how unusual a result it might be. This may sound unfair to underdogs, but if China beats Brazil 3-0 in a non World Cup match, that itself suggests Brazil wasn’t exactly treating the match very seriously. This type of adjustment has not been implemented (or tested) yet.

Another adjustment is for the recentness of games. This one is a little easier to get at empirically, and I’m a lot more confident in these weights. The matches used in the system go back eight years. Eight years! Yes, eight years. Rest assured, a match that took place eight years ago has almost no bearing on anything. Truthfully, the biggest reason for such a long time period is that it helps with teams that don’t play that many games. For the system not to return errors and zeroes, everybody has to be rated and the time period helps with that. Another thing is that I think if there’s an error in the public perception of team strength, I think the general public puts far more weight on the most recent results than is really warranted. Club seasons last 30 games or more and the champion at the end of 30 games is quite often not the champion at the end of 5. A great example of the problem was in this year’s Confederaions Cup where an Egyptian team whose previous two games were extremely impressive came up against a higher ranked American team whose previous tow games against the same teams were abysmal. The U.S. won easily (and then got very lucky with the result of the other game) and wound up making the tournament final. At the end of the day, the true talent level of a team is better identified with more games used to make that identification.

System Structure

It’s been almost 7 years now since I started trying to do this (where does the time go) and after about a year I settled on the current structure as one that could yield me accurate results, fit the competition format of international soccer (actually it was college baseball but it works for both) and was within my mental and computer capabilities. At the very base of it was KRACH a college hockey rating system by Ken Butler. Since then it’s been modified and adjusted beyond any real recognition of those original roots (this system has two ratings instead of one just for starters) so as to best take care of the task at hand. I suppose the most interesting thing about that history is that the Boston Red Sox indirectly helped create an International Football National Team rating system. Funny how things work out.

The ratings have this format:

In a game between Team A and Team B:

Team A Predicted Goals Scored = (Team A OFFrat/(Team A OFFrat + Team B DEFrat)) * 180
Team B Predicted Goals Scored = (Team B OFFrat/(Team B OFFrat + Team A DEFrat)) * 180

You might ask, what’s that “180” about. Well, sometimes numbers are easier to work with in a binomial distribution format. What this basically means is that all possible outcomes lie somewhere between the number 0 and 1. To convert goals scored in a soccer match to this format, I needed to break down a game into smaller parts so that I could keep the number “1” as the upper bound. I decided that breaking the 90 minute game into half minutes was the best way. This does in fact work, I have tested it, and in many ways it’s quite similar to a Poisson distribution. Because that’s the way the numbers went in to making the ratings, they have to be reformatted that way once you want to use them.

If that sounds like an awful lot of math gibberish, well it is. Working through math gibberish is what I do. The point is simply that the ratings have a relationship with goals scored and allowed that can be gotten at using that formula. Those results I adjust for home field advantage. That adjustment currently is the same for all home teams, though my very next change to the system will probably be a way to allow that number to change for each team (it’s a little harder to do than you might think). So not only can I predict the outcome of game about to be played using the ratings, but I can predict the outcomes of games that have already been played. As stated above, the latter predictions are used to adjust the ratings so that they best fit the actual results.

These ratings are then the backbone of the simulations I run. In my next post I’ll show exactly how that is done, with the upcoming Russia/Germany match as the example.

Tags: Soccer!! · South Africa 2010 · Uncategorized

21 responses so far ↓

  • 1 dorian // Oct 6, 2009 at 6:27 am

    Hi Voros,

    Very interesting. Thanks for sharing.

    With regard to the breaking a game into 180 half-minutes and then mentioning the Poisson distribution, are you looking at the time in the matches that goals occurred? So, if a team averages one goal every 45 minutes, there’s an expectation based on the time in the prior match that the most recent goal was scored.

    Also, out of curiosity, how do you adjust for the recentness of the game? Is it a linear or exponential decay; what’s the half-life?

  • 2 Voros // Oct 6, 2009 at 7:20 am

    I’m not really breaking the game into 180 sections. It could be a number like 360 or 5,000 or whatever. I just need a number big enough so that I won’t ever have to worry about going over 1. IE:

    3 goals in a game is converted to a percentage like:

    3/180. The main point is that goals per game has to be converted to a number between 0 and 1 for the system to work, and that’s what the 180 is for.

    As for the decay on match recentness. It actually is linear, but not out of laziness. I studied the issue and tried out an exponential decay and it simply didn’t predict future results as well as a vanilla linear model.

    Eight years ago is 0, today is 1, and it’s linear for each day in between. I’m willing to consider going to an exponential decay system (it would be very easy to implement), but I’d need the right exponent and a good reason to change it from what it is now. So far I haven’t come up with either.

  • 3 Mitz // Oct 6, 2009 at 8:19 am

    Hi Voros, and to echo Dorian thanks for sharing this.

    I’m looking forward to seeing the practical example that is forthcoming. In the meantime, a note regarding the importance rating of each game, especially friendlies.

    Personally, I feel that a very low importance co-efficient should be attached to friendly internationals. The hypothetical example you cite re: China beating Brazil 3-0 is one reason; another is the huge variable from nation to nation regarding how “up for” certain games they are. There are some teams that seem to produce their best football when playing high quality opposition in the most important mathces, whereas other teams (with an arguably more professional approach) will always play at the top of their game no matter what the occasion. Sad but true, a lot of players are not going to give of their best these days in an international friendly match, if they know that a few days later they will be playing in an important league game for their club. There is often very little resemblance between an experimental side put out in a friendly and the first XI that kicks off in a major tournament.

    For all these reasons and more, if it were my system, a friendly would have no more than a quarter of the weight of a World Cup Finals match. I would go with something like:

    WC Finals – 1
    Continental Finals – 0.9
    WC qualifier – 0.8
    Continental qualifier – 0.7
    Other tournament – 0.5
    Friendly – 0.25

    Regarding the length of time that games are relevant in the calculations, eight years does seem a very long time. I haven’t checked, but I would guess that the proportion of players that were in South Korea & Japan in 2002 that will also be going to South Africa next year (among the teams that qualify for both of course) will be a pretty low percentage. And the team managers common to both tournaments will be a lower percentage still. It seems a little unfair to base even a small part of a rating score on the deeds of a team that has most likely transmogrified beyond recognition. For me, four years would be sufficient.

    Anyway, details, details. One thing that would be very interesting would be an insight into your accuracy record…!

  • 4 dorian // Oct 6, 2009 at 12:53 pm

    Hi Voros,

    For the decay, have you looked at what I’ll call an “S-curve”? — where the rate of decay is slow for more recent events, then the rate of decay speeds up until it reaches the halfway point (4 years in your case) at which point the rate of decay slows down. The rational would be that matches a year ago would be worth more than what a linear decay would show, while matches 7 years ago would be worth less than what a linear decay would produce.

    There are a few equations that produce an S-curve. (Apologies if posting equations is poor etiquette.) Here is one:

    X = days ago of the match divided by (8 years times 365.25) times 2

    (The “times 2” simplifies the equation below. Note X ranges from 0 to 2, with X=1 representing 4 years ago.)

    (a) if X 1 then WT = 0.5 * (2-X)^P

    P is an exponent. When P is 1, the weights are your linear decay function. When P is greater than 1, the weights are an S-curve. Here is a table showing how a match would be weighted with P = 1 and P = 2.35 (which produced some roundish numbers):

    YearsAgo P=1 P=2.35

    0 1.00 1.00
    1 0.88 0.98
    2 0.75 0.90
    3 0.63 0.75
    4 0.50 0.50
    5 0.38 0.25
    6 0.25 0.10
    7 0.13 0.02
    8 0.00 0.00

    The midpoint (year 4) will always be 0.50. The larger the value of P, the steeper decline in the middle years, and the more the overall rating is impacted by the most recent four years. Using a P greater than 1 could balance your need to have more data points with, say, Mitz’s belief that the most recent four years should have more (all is more) of the weight.

    So, not sure if any of this would provide any value to your overall system, but thought I would throw it out there.

  • 5 dorian // Oct 6, 2009 at 12:58 pm

    Oops! Look like my “less than” and “greater than” signs were read as HTML. Rookie mistake. (Table spacing didn’t work either…)

    Here is the original equation:

    (a) if X is less than 1 then WT = 1 – 0.5 * (X)^P

    (b) if X is greater than 1 then WT = 0.5 * (2-X)^P

  • 6 Voros // Oct 6, 2009 at 6:00 pm

    That’s right dorian, basically any multi-term function can produce the kind of curve you’re talking about as long as you can fiddle with the coefficients and exponents.

    Like I said, to make the change I have to nail down what function I’d like to try and then see if it’s an improvement. I generally like to run my system tests after big tournaments like the Euros and World Cup to get a handle on what can be tweaked to improve the system. So I expect next fall I’ll be fiddling with this and other issues (home field being my big target right now).

    As for the issue with friendlies, the basic issue is that there’s really not a ton of competitive games between teams from different confederations. Unfortunately, unlike clubs, you’re always in a sample size battle. Ultimately if I’m counting games you maybe think I shouldn’t be, the reason will usually be because I need them to have a decent size sample.

    Like I said, I checked a few years bak (after the 2006 World Cup) and doing the system where _only_ friendlies counted produced ratings that were well in line with the consensus on team strength. I used to have friendlies reduced to that 25% range you suggested, I’ve since had enough reason to raise it from there. Maybe somewhere in between there and where I have it now (40%?).

  • 7 Ross // Oct 6, 2009 at 7:46 pm

    Hi Voros,

    Does an African World Cup Qualifier carry the same weighting of importance as a European World Cup Qualifier?

    The reason I ask is that if I’m not mistaken, the number of “quality” teams divided by the total number of participants is much higher in the African Confederation than it is in UEFA. The distribution of team ratings is much more radical in the former over the later (am I right?).

    Essentially, the top African teams play a higher percentage of games against very poor opposition.

    This would lead me to suggest that WC and regional qualifying games should be weighted differently depending on the Confederation. This would then be revised for regional tournaments representing the quality spread/participating teams differences.

    Great write up and read, as usual.

  • 8 Mitz // Oct 7, 2009 at 12:59 am

    I was really talking from an instinctive point of view rather than with any empirical evidence backing me up – I’m sure that the tweaks and adjustments you have made over the last several years using actual data has led to a system that works. Alvaro for one will swear by your method if Serbia (as expected) do the necessary on Saturday…

  • 9 Voros // Oct 7, 2009 at 3:02 am

    “Does an African World Cup Qualifier carry the same weighting of importance as a European World Cup Qualifier?”

    Well, disparity in teams should be covered by the strength of schedule function and not really by the weighting function. That said I do have OFC qualifiers weighted slightly less than qualifiers from elsewhere.

    I really want to make the relative strength of the confederations a strictly objective affair. I don’t want to thumb the scales because if the system is constructed right, I shouldn’t have to. The way match importance is currently weighted is far from perfect, but it’s a tough nut to crack.

    Mitz, I want people to challenge the way I’ve constructed the system, because sometimes those challenges help me think about ways to improve it. So by all means, fire away at any and all aspects of it you think might be able to be worked out better. Again the match weighting stuff I’m always trying to come up with ways to improve it that can be accomplished with the data and time that I have.

  • 10 Mitz // Oct 7, 2009 at 4:12 am

    FIFA do use a weighting co-efficient for different continents when working out their rankings, and while I do have some issues with their list I do think this policy makes sense. As has been pointed out in relation to another argument hereabouts recently, the only World Cup winners to date have come from Europe and South America, and those two continents utterly dominate the latter stages of all World Cup tournaments right back to 1930. From memory the only semifinalists from elsewhere were USA (1930) and South Korea (incredibly dodgy refereeing and all, 2002). Pele keeps banging on about it, but the best Africa has managed so far is a couple of quarterfinalists, although some of their top teams do seem to be making progress – we shall see next year. Three, or maybe four tiers would make sense to me:

    Europe & South America = 1
    Africa = 0.9
    Asia & Concacaf = 0.8
    Oceania = not a lot, especially with Australia now officially being Asian.

    With regards to how far back games count, what would constitute a useful sample? Taking England as an example, since October 7th 2001 they have played 94 games:

    42 friendlies
    20 Euro qualifiers
    4 Euro finals
    18 WC qualifiers
    10 WC finals

    Unsurprisingly, since October 7th 2005, they have played exactly half that amount – 47:

    Friendlies – 20
    Euro qualifiers – 12
    Euro finals – 0 (boo hoo – damned Croatia!)
    WC qualifiers – 10
    WC finals – 5

    Is the smaller sample from the more relevant 4 year period not sufficient?

  • 11 Amir // Oct 7, 2009 at 1:07 pm

    Hi voros,
    About the importance factor, I would use FIFA Rankings’ factors, because they reflect the way teams should treat the matches, which means:
    WC -4
    CC – 3
    WC/CC qualifier – 5
    I’m not if you remember that we exchanged emails a few months ago… The discussion we had was also about home field advantage.

    here is what we wrote back then in case you forgot about our ideas:

    amir wrote:

    I actually have a nice idea about dealing with home advantage objectively.
    You can treat every team as two different teams – one in home matches, and one in away matches (and neutral ground). for each one of them compute offense rating and defense ratings (or win ratings).
    That way, if team A losses on road to team B, and team B is especially good at home, its rankings won’t decrease by much.
    I know there aren’t many matches on neutral ground, so the multi-graph of matches is “almost bipartite graph” (home teams and away teams), but it may be nice to see the rankings made in such a way.

    Thanks again for your answers,
    Amir

    The problem with that approach, unfortunately, is one of sample size. You really have two separate sample size issues. The first issue is the standard sample size problem: making sure each team has enough games to accurately rate it. The double bubble sample size problem is a relative lack of games between teams in different confeds, particularly meaningful games. A system like ELO actually works fairly well within confeds, but breaks down when you try to compare Asia to Europe to Africa to America and so on. By cutting teams’ games in half, you start to run that risk.

    Another approach I could do would be to have a different home park factor for each team. I actually know precisely how I’d do that (a second round of iterations for HFAs would do it), but it doesn’t really solve the “pseudo home field” problem, and again you’re weighing being able to differentiate between an Ecuadorian home field advantage and a Canadian one against the fact that the smaller samples will increase the error ranges on those numbers. Maybe a compromise would be to go half way between a standard home field advantage and the derived one for each team. I’ll probably give it a whirl in between the end of qualifying and the World Cup.

  • 12 Amir // Oct 7, 2009 at 1:08 pm

    The factors are of course:
    WC -4
    CC – 3
    WC/CC qualifier – 2.5
    friendly – 1

  • 13 Amir // Oct 7, 2009 at 1:13 pm

    And somehow Mitz’s factors are very similar to FIFA’s factors…

  • 14 dorian // Oct 8, 2009 at 8:14 am

    Hi Voros,

    Some perhaps crazy ideas:

    (1) If you used match halves, would you double your sample size?

    (2) For Friendly outliers, could you use winsorization?

    (3) For Friendly outliers, could you adjust the weight down by dividing by the z-score?

    (4) If there were a database of which ref was used for each match, could the ref by an additional variable? Several possible interpretations, such as “bad” refs may cause you to downweight the match. (Of course, there may reasons not to touch this possibility…)

    (3) For

  • 15 Russia vs Germany // Oct 8, 2009 at 10:29 pm

    […] RSS ← The National Team Rating System Explanation […]

  • 16 scaryice // Oct 9, 2009 at 1:02 am

    Confederations should definitely not be weighted – that’s not fair to the individual teams. A match shouldn’t be worth more because of where it’s being played.

    Voros, have you considered just having two weights for matches: one for friendlies and one for matches that count?

  • 17 Road to South Africa 11 // Nov 13, 2009 at 1:33 am

    […] important note: One minor change has been made to the National Team rankings. In the explanation to the rating system posted earlier, I explained that I needed to convert goal scoring to a number between 0 and 1. To […]

  • 18 Marko // Nov 15, 2009 at 1:30 am

    Hi,

    I still don`t get behind, why the OFFRat is a value around 1, 2 or 3 and the DEFRat is a value around 100-300 in your tables.

    Why don`t you take results simply as they are and have:

    Spain scored 15 goals and have gotten 8 in the last ten matches so their OFFRat = 1.5 and their DEFRat is 0.8? Afterwards, these values are remodified by home/away, match importantness, aso.

    But why two different ranges for OFF and DEF in your system?!

  • 19 Marko // Nov 15, 2009 at 1:44 am

    Additionally, I don`t think, that going only on the goals by calculating strengths does suit the “soul of soccer”.

    One example. If you have a crucial match that is 1-0 midway throughout a second half and its an all or nothing, then the team that is behind will open gaps in defence to score the equalizer. I often encounter that a team gets then a second or third, not because it was that worse, but because the history of the match led to this.

    But maybe in a qualifier on a last day, the team behind needs only not to conceed a two goal defeat. Think of Algeria yesterday and their constallation at the match in Egypt. They were never in an attempt to win that match.

    Another indicator is the tables of european qualifiying groups. The goals scored and received do often give not a real picture of strength.

    In group 2: Greece and Israel both managed to score 20 goals and to receive 10 goals, but Greece finished 4 points above Israel.

    Group 3 has Poland having won 3, draw 2 and lose 5 matches with a goal diff of 19-14, whereas Finland in group 5 has scored 14-14 with 5 wins, 3 draws and two losses! They had gotten 7 points more than Poland.

  • 20 Marko // Nov 15, 2009 at 1:53 am

    Ok, I should have red it more thoroughly.

  • 21 Ryusuke // Dec 24, 2009 at 10:04 am

    Hi Voros, I am doing research on soccer scores modelling recently, while I am stuck at time series, I would like to apply exponential decay function while I have no idea to calculate the optimal decay rate. Any good ideas?

Leave a Comment