Unlike in many of my other threads, I'm going to be fairly lax in "pruning" replies to this post--as long as your post is vaguely on topic and doesn't violate any of the rules of the site, I probably won't delete it and will most likely try to respond. What I'm saying is if you just want to make a post that says, "this is a stupid decision, and you're all elitist pigs," go for it. But I hope you'll at least read through this post first and give your responses some thought.
So I realize I'm a little late in making this post, but better late than never, right?
As many of you are already aware, UU is soon to be official, and it was decided that the initial UU banlist will be based not off of "standard" stats but off of "1760" stats.
The purpose of this post is to clarify that decision, what it means, and what it means for the future of usage stats and tiering.
Background
Just over a year ago, we decided to move away from using unweighted stats for tiering and instead assign weights to players' teams based on the player's rating. For the full details, read here, but here's the tl;dr version:
The Problem
This definition worked fine for a time, and it was definitely an improvement over our previous, non-weighted stats system. But the problem was that Pokemon Showdown's popularity continued to grow, and while we are certainly seeing more competitive players battling on Showdown than ever before, the fact of the matter is that the overwhelming majority of players on Pokemon Showdown are not competitive players. This sounds elitist to say, but let me clarify: I'm not saying the problem is that players are using ineffective teams, where EV spreads are sub-optimal or where the players are using subpar sets or moves because they don't know any better--I'm saying the problem is players who have no desire to be competitive, who run monotype teams, Metronome teams, Anime-based teams (Blastoise's top teammate is Snorlax, with Pikachu at #4*) and teams with Hyper Beam variants (4%--weighted--of Sylveon run Hyper Beam, as do 5% of Heliosk and 14% of Porygon-Z*). These players are not interested in being "the very best"--they're merely interested in having fun. And that's fine for them, but they should not be influencing our tiers, because a player who's in it to win will almost always beat a player who's just playing for kicks. So it doesn't matter if your team can't handle a good Blastoise if the only Blastoise you're actually seeing runs Hydro Cannon**).
Raising the Cutoff to 1760
To combat this problem, we are entertaining several solutions. My favorite is that we establish an "unrated" OU ladder on Showdown where players could play if they're not interested in laddering. But this does nothing for us in the short-term. In the short-term, the best and easiest solution is simply to raise the cutoff to the level we believe corresponds to the strength of the "average competitive" player.
Unfortunately, this is strictly a judgement call. Moving forward, I'll be producing some metrics that I'm calling "candles of known brightness," basically a series of indicators we can look at to objectively assess at which points gimmicks and non-competitive teams "fall away." In Little Cup, an example would be the use of Leftovers, Sitrus Berry and Assault Vest, which never have any competitive use in the metagame. These will be things that are completely unambiguous. Donphan being used when there are better options? Not a candle. Donphan carrying Giga Impact?** Yes.
In the short-term, however, we're just doing it by feel, and what felt right this month was 1760.
This means the cutoff won't always be 1760. Some months it might be lower, some higher. Nor will the cutoff be the same for all metagames (I suspect OU will have the highest cutoff by far). And the decision of where to put the cutoff won't be decided by me, but rather by the leaders or councils of each metagame. It'll work like this: each month, I'll provide the councils with a series of stats as well as metrics for "candle intensity," and they'll use those figures to decide where the cutoff will lie that month. Note that I will *not* be providing tiering councils with the list of changes to the tiers that would result from their decision--the justification for choosing a cutoff should never be "I want X out of UU but want UU to keep Y, so I'm choosing this number." The rationale has to be that the cutoff was chosen because these are the stats that best reflect the state of the competitive metagame.
An Aside: Monthly Usage Stats
As many of you have noticed and commented on, the usage stats threads have gotten a bit unmanageable--there are too many tiers and too many analysis types to fit neatly in one thread, and this problem is only going to get worse if I start generating stats at three or four cutoff levels each month. So starting with March's stats, instead of making a stats thread, I'll be putting all the stats at all the levels on a web server (which I actually do already) and then just linking to the web folder by way's of an announcement.
Individual metagames can decide if they want dedicated threads for their tier posted in their subforums for the purpose of discussion, at which point the decision can be made about which cutoff(s) to post.
What Does This Mean For You?
The bottom line is that by raising the cutoff from 1500, the OU-UU list will better reflect the competitive OU metagame, and that's better for all involved: better for the OU player who's trying to decide who needs countering and better for the UU player, whose banlist better removes major threats. It means that if you're a competitive player, your contribution to our tiers will likely increase, and if you're not, then, well, you're probably not reading this thread.
I'd like to close with some sample calculations to give you an idea of how individual battles influence the usage stats.
Note that there were 2,549,546 OU battles last month on the ladder. (That's a lot.) Using a cutoff of 1500, the "average weight" was 0.559. With a cutoff of 1760, that number drops to
0.016. These numbers mean that the sum of all weights was 1,420,000 for 1500 and 40,800 for 1760.
Thus:
Here's the take-away: unless you're at the very top or the bottom half of the ladder, your contribution to the stats won't change much (if anything it'll rise slightly). But if you're a bad player, your contribution will be removed, and if you're one of the very best players, your contribution will mean a lot, lot more.
Any Questions?
As I said at the top of the post, feel free to ask anything. I'm not going to be nearly as ruthless in terms of deleting posts as I have been in my other threads.
Footnotes
*This is actually true
**This is exaggerated
So I realize I'm a little late in making this post, but better late than never, right?
As many of you are already aware, UU is soon to be official, and it was decided that the initial UU banlist will be based not off of "standard" stats but off of "1760" stats.
The purpose of this post is to clarify that decision, what it means, and what it means for the future of usage stats and tiering.
Background
Just over a year ago, we decided to move away from using unweighted stats for tiering and instead assign weights to players' teams based on the player's rating. For the full details, read here, but here's the tl;dr version:
- Tiers are first and foremost threatlists: the Pokemon that are classified as OU are all Pokemon that your team *needs* to be able to deal with in order to succeed. It doesn't matter as much if your team gets demolished by, say, Shedinja, if Shedinja only appears in one out of every 200 matches*.
- Building on that reasoning, if that 1 Shedinja in 200 is used solely by players who don't know how to use it, who give it defense investment* and run Giga Impact,** then your team with the huge Shedinja weakness is probably fine even then.
- THUS, the weighting we assign to a player's team is based on the probability--given the player's Glicko rating--that the player is "above average," that is, more skilled than the "average" player, who by definition has a Glicko rating of 1500.
The Problem
This definition worked fine for a time, and it was definitely an improvement over our previous, non-weighted stats system. But the problem was that Pokemon Showdown's popularity continued to grow, and while we are certainly seeing more competitive players battling on Showdown than ever before, the fact of the matter is that the overwhelming majority of players on Pokemon Showdown are not competitive players. This sounds elitist to say, but let me clarify: I'm not saying the problem is that players are using ineffective teams, where EV spreads are sub-optimal or where the players are using subpar sets or moves because they don't know any better--I'm saying the problem is players who have no desire to be competitive, who run monotype teams, Metronome teams, Anime-based teams (Blastoise's top teammate is Snorlax, with Pikachu at #4*) and teams with Hyper Beam variants (4%--weighted--of Sylveon run Hyper Beam, as do 5% of Heliosk and 14% of Porygon-Z*). These players are not interested in being "the very best"--they're merely interested in having fun. And that's fine for them, but they should not be influencing our tiers, because a player who's in it to win will almost always beat a player who's just playing for kicks. So it doesn't matter if your team can't handle a good Blastoise if the only Blastoise you're actually seeing runs Hydro Cannon**).
Raising the Cutoff to 1760
To combat this problem, we are entertaining several solutions. My favorite is that we establish an "unrated" OU ladder on Showdown where players could play if they're not interested in laddering. But this does nothing for us in the short-term. In the short-term, the best and easiest solution is simply to raise the cutoff to the level we believe corresponds to the strength of the "average competitive" player.
Unfortunately, this is strictly a judgement call. Moving forward, I'll be producing some metrics that I'm calling "candles of known brightness," basically a series of indicators we can look at to objectively assess at which points gimmicks and non-competitive teams "fall away." In Little Cup, an example would be the use of Leftovers, Sitrus Berry and Assault Vest, which never have any competitive use in the metagame. These will be things that are completely unambiguous. Donphan being used when there are better options? Not a candle. Donphan carrying Giga Impact?** Yes.
In the short-term, however, we're just doing it by feel, and what felt right this month was 1760.
This means the cutoff won't always be 1760. Some months it might be lower, some higher. Nor will the cutoff be the same for all metagames (I suspect OU will have the highest cutoff by far). And the decision of where to put the cutoff won't be decided by me, but rather by the leaders or councils of each metagame. It'll work like this: each month, I'll provide the councils with a series of stats as well as metrics for "candle intensity," and they'll use those figures to decide where the cutoff will lie that month. Note that I will *not* be providing tiering councils with the list of changes to the tiers that would result from their decision--the justification for choosing a cutoff should never be "I want X out of UU but want UU to keep Y, so I'm choosing this number." The rationale has to be that the cutoff was chosen because these are the stats that best reflect the state of the competitive metagame.
An Aside: Monthly Usage Stats
As many of you have noticed and commented on, the usage stats threads have gotten a bit unmanageable--there are too many tiers and too many analysis types to fit neatly in one thread, and this problem is only going to get worse if I start generating stats at three or four cutoff levels each month. So starting with March's stats, instead of making a stats thread, I'll be putting all the stats at all the levels on a web server (which I actually do already) and then just linking to the web folder by way's of an announcement.
Individual metagames can decide if they want dedicated threads for their tier posted in their subforums for the purpose of discussion, at which point the decision can be made about which cutoff(s) to post.
What Does This Mean For You?
The bottom line is that by raising the cutoff from 1500, the OU-UU list will better reflect the competitive OU metagame, and that's better for all involved: better for the OU player who's trying to decide who needs countering and better for the UU player, whose banlist better removes major threats. It means that if you're a competitive player, your contribution to our tiers will likely increase, and if you're not, then, well, you're probably not reading this thread.
I'd like to close with some sample calculations to give you an idea of how individual battles influence the usage stats.
Note that there were 2,549,546 OU battles last month on the ladder. (That's a lot.) Using a cutoff of 1500, the "average weight" was 0.559. With a cutoff of 1760, that number drops to
0.016. These numbers mean that the sum of all weights was 1,420,000 for 1500 and 40,800 for 1760.
Thus:
- The player at the top of the OU ladder right now has a rating of 1933±28, which translates to a weight of 1.0 for both 1500 and 1760 stats, meaning each time that player battles, his or her team contributes roughly 0.00007% to the 1500 stats and 0.002% to the 1760 stats.
- Compared to most "competitive" teams, my OU team is subpar (I'm working on it**). My current rating is 1711±45*. My weighting for 1500 stats is also 1, while my weighting for 1760 is 0.138. This means that one of my battles contributes roughly 0.00007% to the 1500 stats and 0.0003% to the 1760 stats.
- A new player just starting out has a rating of 1500±130. That player's weight is 0.5 under 1500 and 0.0228 for 1760. One battle by such a player contributes roughly 0.00004% to the 1500 stats and roughly 0.00005% to the 1760 stats.
Here's the take-away: unless you're at the very top or the bottom half of the ladder, your contribution to the stats won't change much (if anything it'll rise slightly). But if you're a bad player, your contribution will be removed, and if you're one of the very best players, your contribution will mean a lot, lot more.
Any Questions?
As I said at the top of the post, feel free to ask anything. I'm not going to be nearly as ruthless in terms of deleting posts as I have been in my other threads.
Footnotes
*This is actually true
**This is exaggerated