Thanks to Barry Revzin for this analysis.
With the first four matches of the second season of the ISL in the books, we’re now seen all ten teams compete and are starting to be able to form a real picture of the relative strengths and weaknesses of the teams. We already put together a set of qualitative rankings in the league, so we thought we’d also take a quantitative approach and see how things turn out.
The chess world has long used a ratings system called Elo (named after its creator Arpad Elo). Elo is based on giving every player a rating, where the difference in two players’ ratings gives a sense of what the relative win probability would be if they played against each other. A player with a rating 100 points higher than their opponent would be expected to win 64% of the time. 200 points higher and they would be expected to win 76% of the time. After a match, the two players’ ratings are adjusted based on the outcome — but in such a way as to be weighted based on this difference. If Magnus Carlsen beats me in a game of chess, as he would be expected to basically 100.0% of the time, his rating wouldn’t go up and mine wouldn’t go down. But if I were to somehow beat him (say, if he falls unconscious early enough in the match but is still able to erratically move pieces legally), then my rating would go up fairly dramatically based on that new information. Lots of other sports use Elo or Elo-based ratings systems. 538’s forecasts for NFL, NBA, and MLB games are also based on Elo.
But Elo is based solely on 1-on-1 matchups. The ISL doesn’t have those. Instead, we’re 1-on-1-on-1-on-1. So we can’t use Elo (at least, not directly). Instead, we’ll turn to Microsoft, who needed to solve this problem for how to do their XBox Live ratings. The system they came up with, also a Bayesian probability model in the same vein as Elo, is called TrueSkill. TrueSkill extends Elo in tracking two values – a player’s rating and a degree of uncertainty about that rating – and also is able to work in more kinds of configurations, such as the one we need. For those interested in a long description of how TrueSkill works, I would recommend this blog or, if you want more of the math behind it, this description.
So far though, we’ve only had 3 matches, and most teams have only competed once, so we don’t really have enough data points for TrueSkill to give a meaningful result (the system would need at least 5 matches per team, which we will get to by the end of the season). But let’s find out where we’re at anyway. I’ve chosen a starting set of values to preserve the original Elo meaning – a difference of 200 points equates to a win probability of 76%. That gives us:
Note that TeamSkill, just like Elo, does not have any notion of margin of victory. A win is a win, so the fact that London won by an enormous amount doesn’t affect their rating any more than Cali’s lower-margin win over Energy Standard. The third column here, sigma, is a sense of the confidence of the rating — which as you can see is quite wide for basically every team. You should expect this to narrow over time.
But I’m quite impatient and I do not want to wait several more weeks, can’t we do better now?
Last week we wrote about what would happen if we swapped London and NY in the first matchup, and how those two new matchups would look. That’s two new matches, in addition to the two we had in real life. But there’s a lot more matches we could look at. In fact, there are 70. We could simulate all 70 of those possible matchups and produce ratings for those teams based on those new results. Since mostly we’re just swapping times around and re-ranking, those simulations are going to be pretty accurate. We just have to make a few assumptions:
1) Specific races would play out the same regardless of opponent.
2) Team lineups would be largely similar regardless of opponent.
3) Teams that win a medley relay would pick the same Skins they did, regardless of opponent. For teams that didn’t win the medley relay in real life but would in a simulated match, we’re going to guess what stroke they would choose.
4) If in a matchup, we end up picking a stroke for Skins that a team did not swim in real life, we’re going to first fall back on the previous weeks’ results, and then fall back from that to the individual 50m race.
The first assumption seems fairly reasonable. The second probably holds for Day 1 of the match, but clearly teams are going to choose their Day 2 lineup based on the specific choice of Skins event and I’m not even trying to account for that. Any attempt on simulating skins is going to be inherently noisy and error-prone. The third and fourth assumptions are really best-effort.
To attempt to reduce at least some of the noise, I’m going to consider close meets (within 20 points) as ties.
Given all of the above, we now have 140 matches to consider instead of just 4 (70 each from two weeks), which lets us really get a sense of how all the teams can compare to each other. Doing so, we get the following ratings:
These rankings line up pretty closely with SwimSwam’s Power Ratings (the only difference is that here Toronto is ahead of Iron and NY), but additionally gives a sense of the margin of the difference. The Cali Condors, thus far, seem well ahead of Energy Standard, who are themselves well ahead of LA, Tokyo, and London. We could expect a fairly close three-battle between those three teams for the remaining two finals slots. And at the bottom, the DC Trident and Aqua Centurions seem very unlikely to make it into the semifinals.
In contrast, the ISL has its own ratings system based on summing aggregate ratings of the team members. This suffers somewhat from the problem described earlier – we’ve only had 4 matchups, and London in particular has not yet faced one of the top teams. As a result, in the ISL’s ratings, London is the second ranked team (just as it was in the table above), with Energy, Tokyo, and LA rounding out the top 5.
We’ll keep updating these numbers as the season progresses.