This post is basically an update and a revision of the OP of this thread. Any discussions from that thread should continue here.
Before I start, let me say this explicitly:
Now, moving on...
Unlike in many of my other threads, I'm going to be fairly lax in "pruning" replies to this post--as long as your post is vaguely on topic and doesn't violate any of the rules of the site, I probably won't delete it and will most likely try to respond. What I'm saying is if you just want to make a post that says, "this is a stupid decision, and you're all elitist pigs," go for it. But I hope you'll at least read through this post first and give your responses some thought.
So as many of you may have already read, it was decided this month that the stats used for the OU<->UU cutoff would be the ones calculated off of a baseline of 1695 instead of 1760. That might sound like a relaxing of our standards, but it actually reflects the fact that the way I calculate the stats has changed quite a bit since the last period.
The purpose of this post is to clarify the original decision to base the stats on a higher-than-1500 baseline, explain what's changed since the original announcement and discuss what it all means and what it means for the future of usage stats and tiering.
Background
Over a year ago, we decided to move away from using unweighted stats for tiering and instead assign weights to players' teams based on the player's rating. For the full details, read here, but here's the tl;dr version:
The Problem
This definition worked fine for a time, and it was definitely an improvement over our previous, non-weighted stats system. But the problem was that Pokemon Showdown's popularity continued to grow, and while we are certainly seeing more competitive players battling on Showdown than ever before, the fact of the matter is that the overwhelming majority of players on Pokemon Showdown are not competitive players. This sounds elitist to say, but let me clarify: I'm not saying the problem is that players are using ineffective teams, where EV spreads are sub-optimal or where the players are using subpar sets or moves because they don't know any better--I'm saying the problem is players who have no desire to be competitive, who run monotype teams, Metronome teams, Anime-based teams (Blastoise's top teammate is Snorlax, with Pikachu at #4*) and teams with Hyper Beam variants (4%--weighted--of Sylveon run Hyper Beam, as do 5% of Heliosk and 14% of Porygon-Z*). These players are not interested in being "the very best"--they're merely interested in having fun. And that's fine for them, but they should not be influencing our tiers, because a player who's in it to win will almost always beat a player who's just playing for kicks. So it doesn't matter if your team can't handle a good Blastoise if the only Blastoise you're actually seeing runs Hydro Cannon**).
Raising the Cutoff to 1760
To combat this problem, we are entertaining several solutions. My favorite is that we establish an "unrated" OU ladder on Showdown where players could play if they're not interested in laddering. But this does nothing for us in the short-term. In the short-term, the best and easiest solution was simply to raise the cutoff to the level we believed corresponds to the strength of the "average competitive" player, which we chose to be 1760 for OU, though that was strictly a judgement call.
Weights and relative contributions to the stats
At the end of the day, the purpose of this decision was to minimize the relative contribution of "bad" players and maximize that of good ones.
I put some calcs into the previous thread's OP:
Candles
As I said, the choice of 1760 was strictly a judgement call. I had this idea that we could compute metrics called "candles" which would basically be a series of indicators we can look at to objectively assess at which points gimmicks and non-competitive teams "fall away." In Little Cup, an example would be the use of Leftovers, Sitrus Berry and Assault Vest, which never have any competitive use in the metagame. These candles are things that are completely unambiguous. Donphan being used when there are better options? Not a candle. Donphan carrying Giga Impact?** Yes.
Here's the thing: I expected that as the baseline rose, the "candle" values would go down. A nice hypothesis, but it turned out to be wrong. In many cases, the "candle" value actually went up with increased baseline.
The reason? As I quoted above, the contribution of brand new alts actually increases as baseline increases. This is because, if the baseline is 1760, an unknown / brand new alt has a greater chance of belonging to a player whose "true rating" is at least 1760 than a player whose current Glicko rating is 1695±25. This is a fact--that 1695 player has demonstrated unambiguously that he or she is not quite as skilled as we're looking for, but the unknown player could be anyone. You can see already that this presents a problem in "gaming" the ladder--if you want to contribute to the stats, but you're not good enough to get a Glicko R of at least 1760, then the best thing you can do is reset alts after every battle.
But even leaving off players who are purposefully trying to game the system, the fact is that about half of alts currently on the OU ladder belong to players who have played five matches or less. And even if 1% of those players have a "true rating" above 1760, that means that for every "good" team we get from players at that level, we're also collecting on 99 "bad" teams. And that led to stats that were, to be blunt, garbage.
Maximum RD and the Awesome Size of Showdown
I considered a few different schemes to salvage the weighted stats system, and what I ended up deciding on was a relatively simple fix: throw out all teams belonging to players whose RD>100. This means that a player with rating 3/4 of the way to the baseline will always be weighted more heavily than a player with an R of 1500 (the starting rating).
From a practical perspective, this means throwing out the first five or so battles of any new alt.
This might or might not sound reasonable to you, but the fact is that it was a drastic policy change: where previously I was simply reducing the contribution made by bad players, now I'm explicitly throwing out data. And as I described in the above section, there are likely more than a few babies in with that bathwater.
I would not have even considered a decision this drastic a year ago. But given the fact that OU sees about 2.5 million battles each month, and even "unpopular" metagames like Little Cup get more than a battle a minute on average, "throwing out" data isn't the worst thing in the world, even if it does make a data scientist like me cringe a little. And it certainly produced the desired results.
Below are the "candle" curves for OU
(note that "no item" does not include Pokemon that can potentially learn Acrobatics)
Up until a very high level--each "candle" "diminishes in intensity" the higher we raise the baseline. Beyond ~1950, we encounter issues where there are simply no players with Glicko ratings that high, so everyone ends up getting weighted equally (badly). This is almost exactly the behavior I was expecting to find when I came up with the idea of "candles."
Note also the sharp drop from 1500 to 1510, the first data point above 1500: that's because I only start throwing out teams with high RD when the baseline is greater than 1500. Why is the dropoff be so sharp? I theorize it's because a lot of players make new alts for each team, and typos happen, but are usually detected within the first few matches. So "no item" especially--if you left that Choice Scarf off your Garchomp, you're gonna figure it out pretty quickly. Anyway, that's one hypothesis. Another is that bad players rarely stick around past the first few matches. But either way, the system works.
"Lowering" the baseline
Now that I was explicitly throwing out so many battles, I needed to counterbalance that by lowering the baselines--we didn't want to get to the point where the stats were being determined by the whims of an extremely small group of players, no matter how highly rated.
To put it another way, I was faced with two competing interests: a desire to remove as much "bad stuff" as possible from the usage stats, and a desire to base the stats off of as large a number of teams as possible.
To quantify this second measure, I defined a quantity called "median team-instances" which is the smallest number of teams-instances (# of battles x2) needed to make up 50% of the stats.
Here's a graph showing how median team-instances decreases as the baseline increases:
Based on this data, and after careful consultation with the upper echelons of the Smogon staff, I ended up choosing baselines of 1695 / 1630 / 1630 for OU, UU and RU, respectively, which corresponds to ~150k / 75k / 40k median team-instances respectively. What that means is that if you play 20 battles a day on the OU ladder, and your rating is just about 1695, then over the course of a month (30 days), your team(s) make up 0.2% of the stats. Not too shabby!
What Does This Mean For You?
The bottom line is that by raising the cutoff from 1500, the OU-UU list better reflects the competitive OU metagame, especially if I throw out contributions from alts with a small number of battles. And that's better for all involved: better for the OU player who's trying to decide who needs countering and better for the UU player, whose banlist better removes major threats. It means that if you're a competitive player, your contribution to our tiers will likely increase, and if you're not, then, well, you're probably not reading this thread.
It might seem unfair that your first five or so battles on an alt in a metagame don't get counted, but again, if you're truly a competitive player, you're going to be playing a lot more than five matches, and in the long run, it'll work out that those battles you had after those first five will contribute a lot more to the stats than your entire record would have in the long run.
I'm not going to publish any new calcs regarding percentage contribution to the stats per battle, but the take-away is still: unless you're at the very top or the bottom half of the ladder, your contribution to the stats won't change much (if anything it'll rise slightly). But if you're a bad player, your contribution will be removed, and if you're one of the very best players, your contribution will mean a lot, lot more.
Any Questions?
As I said at the top of the post, feel free to ask anything. I'm not going to be nearly as ruthless in terms of deleting posts as I have been in my other threads.
Footnotes
*This is actually true
**This is exaggerated
Before I start, let me say this explicitly:
Smogon Usage Stats are weighted according to Glicko rating, not Elo. A 1500 Elo rating is incredible. A 1500 Glicko rating is the starting value. When I refer to ratings in this thread, it is always referring to Glicko ratings.
Now, moving on...
Unlike in many of my other threads, I'm going to be fairly lax in "pruning" replies to this post--as long as your post is vaguely on topic and doesn't violate any of the rules of the site, I probably won't delete it and will most likely try to respond. What I'm saying is if you just want to make a post that says, "this is a stupid decision, and you're all elitist pigs," go for it. But I hope you'll at least read through this post first and give your responses some thought.
So as many of you may have already read, it was decided this month that the stats used for the OU<->UU cutoff would be the ones calculated off of a baseline of 1695 instead of 1760. That might sound like a relaxing of our standards, but it actually reflects the fact that the way I calculate the stats has changed quite a bit since the last period.
The purpose of this post is to clarify the original decision to base the stats on a higher-than-1500 baseline, explain what's changed since the original announcement and discuss what it all means and what it means for the future of usage stats and tiering.
Background
Over a year ago, we decided to move away from using unweighted stats for tiering and instead assign weights to players' teams based on the player's rating. For the full details, read here, but here's the tl;dr version:
- Tiers are first and foremost threatlists: the Pokemon that are classified as OU are all Pokemon that your team *needs* to be able to deal with in order to succeed. It doesn't matter as much if your team gets demolished by, say, Shedinja, if Shedinja only appears in one out of every 200 matches*.
- Building on that reasoning, if that 1 Shedinja in 200 is used solely by players who don't know how to use it, who give it defense investment* and run Giga Impact,** then your team with the huge Shedinja weakness is probably fine even then.
- THUS, the weighting we assign to a player's team is based on the probability--given the player's Glicko rating--that the player is "above average," that is, more skilled than the "average" player, who by definition has a Glicko rating of 1500.
The Problem
This definition worked fine for a time, and it was definitely an improvement over our previous, non-weighted stats system. But the problem was that Pokemon Showdown's popularity continued to grow, and while we are certainly seeing more competitive players battling on Showdown than ever before, the fact of the matter is that the overwhelming majority of players on Pokemon Showdown are not competitive players. This sounds elitist to say, but let me clarify: I'm not saying the problem is that players are using ineffective teams, where EV spreads are sub-optimal or where the players are using subpar sets or moves because they don't know any better--I'm saying the problem is players who have no desire to be competitive, who run monotype teams, Metronome teams, Anime-based teams (Blastoise's top teammate is Snorlax, with Pikachu at #4*) and teams with Hyper Beam variants (4%--weighted--of Sylveon run Hyper Beam, as do 5% of Heliosk and 14% of Porygon-Z*). These players are not interested in being "the very best"--they're merely interested in having fun. And that's fine for them, but they should not be influencing our tiers, because a player who's in it to win will almost always beat a player who's just playing for kicks. So it doesn't matter if your team can't handle a good Blastoise if the only Blastoise you're actually seeing runs Hydro Cannon**).
Raising the Cutoff to 1760
To combat this problem, we are entertaining several solutions. My favorite is that we establish an "unrated" OU ladder on Showdown where players could play if they're not interested in laddering. But this does nothing for us in the short-term. In the short-term, the best and easiest solution was simply to raise the cutoff to the level we believed corresponds to the strength of the "average competitive" player, which we chose to be 1760 for OU, though that was strictly a judgement call.
Weights and relative contributions to the stats
At the end of the day, the purpose of this decision was to minimize the relative contribution of "bad" players and maximize that of good ones.
I put some calcs into the previous thread's OP:
That sounded nice at the time, but that last bullet proved to be this particular decision's undoing.Here's the take-away: unless you're at the very top or the bottom half of the ladder, your contribution to the stats won't change much (if anything it'll rise slightly). But if you're a bad player, your contribution will be removed, and if you're one of the very best players, your contribution will mean a lot, lot more.
- The player at the top of the OU ladder right now has a rating of 1933±28, which translates to a weight of 1.0 for both 1500 and 1760 stats, meaning each time that player battles, his or her team contributes roughly 0.00007% to the 1500 stats and 0.002% to the 1760 stats.
- Compared to most "competitive" teams, my OU team is subpar (I'm working on it**). My current rating is 1711±45*. My weighting for 1500 stats is also 1, while my weighting for 1760 is 0.138. This means that one of my battles contributes roughly 0.00007% to the 1500 stats and 0.0003% to the 1760 stats.
- A new player just starting out has a rating of 1500±130. That player's weight is 0.5 under 1500 and 0.0228 for 1760. One battle by such a player contributes roughly 0.00004% to the 1500 stats and roughly 0.00005% to the 1760 stats.
Candles
As I said, the choice of 1760 was strictly a judgement call. I had this idea that we could compute metrics called "candles" which would basically be a series of indicators we can look at to objectively assess at which points gimmicks and non-competitive teams "fall away." In Little Cup, an example would be the use of Leftovers, Sitrus Berry and Assault Vest, which never have any competitive use in the metagame. These candles are things that are completely unambiguous. Donphan being used when there are better options? Not a candle. Donphan carrying Giga Impact?** Yes.
Here's the thing: I expected that as the baseline rose, the "candle" values would go down. A nice hypothesis, but it turned out to be wrong. In many cases, the "candle" value actually went up with increased baseline.
The reason? As I quoted above, the contribution of brand new alts actually increases as baseline increases. This is because, if the baseline is 1760, an unknown / brand new alt has a greater chance of belonging to a player whose "true rating" is at least 1760 than a player whose current Glicko rating is 1695±25. This is a fact--that 1695 player has demonstrated unambiguously that he or she is not quite as skilled as we're looking for, but the unknown player could be anyone. You can see already that this presents a problem in "gaming" the ladder--if you want to contribute to the stats, but you're not good enough to get a Glicko R of at least 1760, then the best thing you can do is reset alts after every battle.
But even leaving off players who are purposefully trying to game the system, the fact is that about half of alts currently on the OU ladder belong to players who have played five matches or less. And even if 1% of those players have a "true rating" above 1760, that means that for every "good" team we get from players at that level, we're also collecting on 99 "bad" teams. And that led to stats that were, to be blunt, garbage.
Maximum RD and the Awesome Size of Showdown
I considered a few different schemes to salvage the weighted stats system, and what I ended up deciding on was a relatively simple fix: throw out all teams belonging to players whose RD>100. This means that a player with rating 3/4 of the way to the baseline will always be weighted more heavily than a player with an R of 1500 (the starting rating).
From a practical perspective, this means throwing out the first five or so battles of any new alt.
This might or might not sound reasonable to you, but the fact is that it was a drastic policy change: where previously I was simply reducing the contribution made by bad players, now I'm explicitly throwing out data. And as I described in the above section, there are likely more than a few babies in with that bathwater.
I would not have even considered a decision this drastic a year ago. But given the fact that OU sees about 2.5 million battles each month, and even "unpopular" metagames like Little Cup get more than a battle a minute on average, "throwing out" data isn't the worst thing in the world, even if it does make a data scientist like me cringe a little. And it certainly produced the desired results.
Below are the "candle" curves for OU
(note that "no item" does not include Pokemon that can potentially learn Acrobatics)
Up until a very high level--each "candle" "diminishes in intensity" the higher we raise the baseline. Beyond ~1950, we encounter issues where there are simply no players with Glicko ratings that high, so everyone ends up getting weighted equally (badly). This is almost exactly the behavior I was expecting to find when I came up with the idea of "candles."
Note also the sharp drop from 1500 to 1510, the first data point above 1500: that's because I only start throwing out teams with high RD when the baseline is greater than 1500. Why is the dropoff be so sharp? I theorize it's because a lot of players make new alts for each team, and typos happen, but are usually detected within the first few matches. So "no item" especially--if you left that Choice Scarf off your Garchomp, you're gonna figure it out pretty quickly. Anyway, that's one hypothesis. Another is that bad players rarely stick around past the first few matches. But either way, the system works.
"Lowering" the baseline
Now that I was explicitly throwing out so many battles, I needed to counterbalance that by lowering the baselines--we didn't want to get to the point where the stats were being determined by the whims of an extremely small group of players, no matter how highly rated.
To put it another way, I was faced with two competing interests: a desire to remove as much "bad stuff" as possible from the usage stats, and a desire to base the stats off of as large a number of teams as possible.
To quantify this second measure, I defined a quantity called "median team-instances" which is the smallest number of teams-instances (# of battles x2) needed to make up 50% of the stats.
Here's a graph showing how median team-instances decreases as the baseline increases:
Based on this data, and after careful consultation with the upper echelons of the Smogon staff, I ended up choosing baselines of 1695 / 1630 / 1630 for OU, UU and RU, respectively, which corresponds to ~150k / 75k / 40k median team-instances respectively. What that means is that if you play 20 battles a day on the OU ladder, and your rating is just about 1695, then over the course of a month (30 days), your team(s) make up 0.2% of the stats. Not too shabby!
What Does This Mean For You?
The bottom line is that by raising the cutoff from 1500, the OU-UU list better reflects the competitive OU metagame, especially if I throw out contributions from alts with a small number of battles. And that's better for all involved: better for the OU player who's trying to decide who needs countering and better for the UU player, whose banlist better removes major threats. It means that if you're a competitive player, your contribution to our tiers will likely increase, and if you're not, then, well, you're probably not reading this thread.
It might seem unfair that your first five or so battles on an alt in a metagame don't get counted, but again, if you're truly a competitive player, you're going to be playing a lot more than five matches, and in the long run, it'll work out that those battles you had after those first five will contribute a lot more to the stats than your entire record would have in the long run.
I'm not going to publish any new calcs regarding percentage contribution to the stats per battle, but the take-away is still: unless you're at the very top or the bottom half of the ladder, your contribution to the stats won't change much (if anything it'll rise slightly). But if you're a bad player, your contribution will be removed, and if you're one of the very best players, your contribution will mean a lot, lot more.
Any Questions?
As I said at the top of the post, feel free to ask anything. I'm not going to be nearly as ruthless in terms of deleting posts as I have been in my other threads.
Footnotes
*This is actually true
**This is exaggerated
Last edited: