The decision to base UU off of 1695 stats

Antar · Jun 4, 2014

This post is basically an update and a revision of the OP of this thread. Any discussions from that thread should continue here.

Before I start, let me say this explicitly:

Smogon Usage Stats are weighted according to Glicko rating, not Elo. A 1500 Elo rating is incredible. A 1500 Glicko rating is the starting value. When I refer to ratings in this thread, it is always referring to Glicko ratings.

Now, moving on...

Unlike in many of my other threads, I'm going to be fairly lax in "pruning" replies to this post--as long as your post is vaguely on topic and doesn't violate any of the rules of the site, I probably won't delete it and will most likely try to respond. What I'm saying is if you just want to make a post that says, "this is a stupid decision, and you're all elitist pigs," go for it. But I hope you'll at least read through this post first and give your responses some thought.

So as many of you may have already read, it was decided this month that the stats used for the OU<->UU cutoff would be the ones calculated off of a baseline of 1695 instead of 1760. That might sound like a relaxing of our standards, but it actually reflects the fact that the way I calculate the stats has changed quite a bit since the last period.

The purpose of this post is to clarify the original decision to base the stats on a higher-than-1500 baseline, explain what's changed since the original announcement and discuss what it all means and what it means for the future of usage stats and tiering.

Background

Over a year ago, we decided to move away from using unweighted stats for tiering and instead assign weights to players' teams based on the player's rating. For the full details, read here, but here's the tl;dr version:

Tiers are first and foremost threatlists: the Pokemon that are classified as OU are all Pokemon that your team *needs* to be able to deal with in order to succeed. It doesn't matter as much if your team gets demolished by, say, Shedinja, if Shedinja only appears in one out of every 200 matches*.
Building on that reasoning, if that 1 Shedinja in 200 is used solely by players who don't know how to use it, who give it defense investment* and run Giga Impact,** then your team with the huge Shedinja weakness is probably fine even then.
THUS, the weighting we assign to a player's team is based on the probability--given the player's Glicko rating--that the player is "above average," that is, more skilled than the "average" player, who by definition has a Glicko rating of 1500.

The Problem

This definition worked fine for a time, and it was definitely an improvement over our previous, non-weighted stats system. But the problem was that Pokemon Showdown's popularity continued to grow, and while we are certainly seeing more competitive players battling on Showdown than ever before, the fact of the matter is that the overwhelming majority of players on Pokemon Showdown are not competitive players. This sounds elitist to say, but let me clarify: I'm not saying the problem is that players are using ineffective teams, where EV spreads are sub-optimal or where the players are using subpar sets or moves because they don't know any better--I'm saying the problem is players who have no desire to be competitive, who run monotype teams, Metronome teams, Anime-based teams (Blastoise's top teammate is Snorlax, with Pikachu at #4*) and teams with Hyper Beam variants (4%--weighted--of Sylveon run Hyper Beam, as do 5% of Heliosk and 14% of Porygon-Z*). These players are not interested in being "the very best"--they're merely interested in having fun. And that's fine for them, but they should not be influencing our tiers, because a player who's in it to win will almost always beat a player who's just playing for kicks. So it doesn't matter if your team can't handle a good Blastoise if the only Blastoise you're actually seeing runs Hydro Cannon**).

Raising the Cutoff to 1760

To combat this problem, we are entertaining several solutions. My favorite is that we establish an "unrated" OU ladder on Showdown where players could play if they're not interested in laddering. But this does nothing for us in the short-term. In the short-term, the best and easiest solution was simply to raise the cutoff to the level we believed corresponds to the strength of the "average competitive" player, which we chose to be 1760 for OU, though that was strictly a judgement call.

Weights and relative contributions to the stats

At the end of the day, the purpose of this decision was to minimize the relative contribution of "bad" players and maximize that of good ones.

I put some calcs into the previous thread's OP:

Antar said:
The player at the top of the OU ladder right now has a rating of 1933±28, which translates to a weight of 1.0 for both 1500 and 1760 stats, meaning each time that player battles, his or her team contributes roughly 0.00007% to the 1500 stats and 0.002% to the 1760 stats.

Compared to most "competitive" teams, my OU team is subpar (I'm working on it**). My current rating is 1711±45*. My weighting for 1500 stats is also 1, while my weighting for 1760 is 0.138. This means that one of my battles contributes roughly 0.00007% to the 1500 stats and 0.0003% to the 1760 stats.

A new player just starting out has a rating of 1500±130. That player's weight is 0.5 under 1500 and 0.0228 for 1760. One battle by such a player contributes roughly 0.00004% to the 1500 stats and roughly 0.00005% to the 1760 stats.

Here's the take-away: unless you're at the very top or the bottom half of the ladder, your contribution to the stats won't change much (if anything it'll rise slightly). But if you're a bad player, your contribution will be removed, and if you're one of the very best players, your contribution will mean a lot, lot more.

That sounded nice at the time, but that last bullet proved to be this particular decision's undoing.

Candles

As I said, the choice of 1760 was strictly a judgement call. I had this idea that we could compute metrics called "candles" which would basically be a series of indicators we can look at to objectively assess at which points gimmicks and non-competitive teams "fall away." In Little Cup, an example would be the use of Leftovers, Sitrus Berry and Assault Vest, which never have any competitive use in the metagame. These candles are things that are completely unambiguous. Donphan being used when there are better options? Not a candle. Donphan carrying Giga Impact?** Yes.

Here's the thing: I expected that as the baseline rose, the "candle" values would go down. A nice hypothesis, but it turned out to be wrong. In many cases, the "candle" value actually went up with increased baseline.

The reason? As I quoted above, the contribution of brand new alts actually increases as baseline increases. This is because, if the baseline is 1760, an unknown / brand new alt has a greater chance of belonging to a player whose "true rating" is at least 1760 than a player whose current Glicko rating is 1695±25. This is a fact--that 1695 player has demonstrated unambiguously that he or she is not quite as skilled as we're looking for, but the unknown player could be anyone. You can see already that this presents a problem in "gaming" the ladder--if you want to contribute to the stats, but you're not good enough to get a Glicko R of at least 1760, then the best thing you can do is reset alts after every battle.

But even leaving off players who are purposefully trying to game the system, the fact is that about half of alts currently on the OU ladder belong to players who have played five matches or less. And even if 1% of those players have a "true rating" above 1760, that means that for every "good" team we get from players at that level, we're also collecting on 99 "bad" teams. And that led to stats that were, to be blunt, garbage.

Maximum RD and the Awesome Size of Showdown

I considered a few different schemes to salvage the weighted stats system, and what I ended up deciding on was a relatively simple fix: throw out all teams belonging to players whose RD>100. This means that a player with rating 3/4 of the way to the baseline will always be weighted more heavily than a player with an R of 1500 (the starting rating).

From a practical perspective, this means throwing out the first five or so battles of any new alt.

This might or might not sound reasonable to you, but the fact is that it was a drastic policy change: where previously I was simply reducing the contribution made by bad players, now I'm explicitly throwing out data. And as I described in the above section, there are likely more than a few babies in with that bathwater.

I would not have even considered a decision this drastic a year ago. But given the fact that OU sees about 2.5 million battles each month, and even "unpopular" metagames like Little Cup get more than a battle a minute on average, "throwing out" data isn't the worst thing in the world, even if it does make a data scientist like me cringe a little. And it certainly produced the desired results.

Below are the "candle" curves for OU

(note that "no item" does not include Pokemon that can potentially learn Acrobatics)

Up until a very high level--each "candle" "diminishes in intensity" the higher we raise the baseline. Beyond ~1950, we encounter issues where there are simply no players with Glicko ratings that high, so everyone ends up getting weighted equally (badly). This is almost exactly the behavior I was expecting to find when I came up with the idea of "candles."

Note also the sharp drop from 1500 to 1510, the first data point above 1500: that's because I only start throwing out teams with high RD when the baseline is greater than 1500. Why is the dropoff be so sharp? I theorize it's because a lot of players make new alts for each team, and typos happen, but are usually detected within the first few matches. So "no item" especially--if you left that Choice Scarf off your Garchomp, you're gonna figure it out pretty quickly. Anyway, that's one hypothesis. Another is that bad players rarely stick around past the first few matches. But either way, the system works.

"Lowering" the baseline

Now that I was explicitly throwing out so many battles, I needed to counterbalance that by lowering the baselines--we didn't want to get to the point where the stats were being determined by the whims of an extremely small group of players, no matter how highly rated.

To put it another way, I was faced with two competing interests: a desire to remove as much "bad stuff" as possible from the usage stats, and a desire to base the stats off of as large a number of teams as possible.

To quantify this second measure, I defined a quantity called "median team-instances" which is the smallest number of teams-instances (# of battles x2) needed to make up 50% of the stats.

Here's a graph showing how median team-instances decreases as the baseline increases:

Based on this data, and after careful consultation with the upper echelons of the Smogon staff, I ended up choosing baselines of 1695 / 1630 / 1630 for OU, UU and RU, respectively, which corresponds to ~150k / 75k / 40k median team-instances respectively. What that means is that if you play 20 battles a day on the OU ladder, and your rating is just about 1695, then over the course of a month (30 days), your team(s) make up 0.2% of the stats. Not too shabby!

What Does This Mean For You?

The bottom line is that by raising the cutoff from 1500, the OU-UU list better reflects the competitive OU metagame, especially if I throw out contributions from alts with a small number of battles. And that's better for all involved: better for the OU player who's trying to decide who needs countering and better for the UU player, whose banlist better removes major threats. It means that if you're a competitive player, your contribution to our tiers will likely increase, and if you're not, then, well, you're probably not reading this thread.

It might seem unfair that your first five or so battles on an alt in a metagame don't get counted, but again, if you're truly a competitive player, you're going to be playing a lot more than five matches, and in the long run, it'll work out that those battles you had after those first five will contribute a lot more to the stats than your entire record would have in the long run.

I'm not going to publish any new calcs regarding percentage contribution to the stats per battle, but the take-away is still: unless you're at the very top or the bottom half of the ladder, your contribution to the stats won't change much (if anything it'll rise slightly). But if you're a bad player, your contribution will be removed, and if you're one of the very best players, your contribution will mean a lot, lot more.

Any Questions?

As I said at the top of the post, feel free to ask anything. I'm not going to be nearly as ruthless in terms of deleting posts as I have been in my other threads.

Footnotes

*This is actually true
**This is exaggerated

Shog · Jul 9, 2014

I don't get this. First, Pokemon Tiers weren't established because ""Tiers are first and foremost threatlists"", they were made because some good Pokemon were, well OVERUSED (*shock*!). Furthermore, I play since Gen 4 competitively and lunger around 1400 - 1500. So what? That doesn't count? For example I try to establish that Type-Resist Berries actually are pretty damn useful simply because without Frisk you can't actually see them(unlike the Balloon item). Safing yourself Priority is damn good.

How am I suppose to give my Moveset input in the OU Metagame if I don't play as much as others? I mean come on. I have lots of battles, but that doesn't count because my ratings changes obviously and they don't make the cut? Etilism at its best.

Antar · Jul 9, 2014

Let's break this down:

Shog said:
Furthermore, I play since Gen 4 competitively and lunger around 1400 - 1500. So what? That doesn't count?

Is that 1400 Elo (which is what gets displayed when you win or lose a match) or 1400 Glicko? Glicko is what's used for the weighting. You can pull it up if you type /rating, but if your Elo rating is 1400, you're almost certainly above the threshold.

First, Pokemon Tiers weren't established because ""Tiers are first and foremost threatlists"", they were made because some good Pokemon were, well OVERUSED (*shock*!).

I think we're saying the same thing. Read the "What is OU?" section of this post, and let me know if you still disagree. What I'm primarily saying is that tiers don't really correspond to power/viability ratings.

How am I suppose to give my Moveset input in the OU Metagame if I don't play as much as others? I mean come on. I have lots of battles, but that doesn't count because my ratings changes obviously and they don't make the cut? Etilism at its best.

Again, assuming your rating is 1400 Elo, your battles are being counted just fine. And if your rating is actually 1400 Glicko, then I'm sorry, you're not very good--on a ladder with no matchmaking you'd lose more often than you'd win. And, seeing as how this is a competitive Pokemon site, your contributions give us no information on what typical Smogon players should be prepared for.

But, again, I'm pretty sure that your post boils down to a misunderstanding about which rating system we use to weight our stats.

Shog · Jul 9, 2014

I see.

ou1388741701 ± 80--2140
-> That is from an alt, that means this battles would count?
...
I get it, I made an obvious mistake. Ignore my post

Antar · Jul 10, 2014

Shog said:
ou1388741701 ± 80--2140
-> That is from an alt, that means this battles would count?

At that rating level, your team-instances (that is, your sides of each of your battles) are given a weight of 0.53, which means they each contribute roughly two part in 1 million to the OU stats.

If we were using a 1500 baseline, your team-instances would be given a weight of essentially 1.0, but each instane would only contribute roughly four parts in ten million to the OU stats.

Lord Wallace · Jul 19, 2014

Man don't I feel freaking stupid now.
I remember when it was first announced that the cutoff would be at 1760 (neglecting to read the Glicko part) and I assumed this referred to Elo and well I became determined to count in the tiering process and ended up at a solid 1816 Elo that session for no real reason after all it seems.
Still though I guess it feels good to have a constantly high Elo since after that session I rarely fall below 1680 or so in OU even when I'm not actively laddering, meaning I can always get decent battles whenever I have to practice for tournaments.

phantom · Aug 7, 2014

Is it an option to increase the weight of the current stats (just anything at all to put more emphasis on the weighted stats?) or is that simply unfeasible now due to "people with only a couple battles were shooting up the rankings then dropping down" thing that was brought up in the last page? I'm mainly asking this because there's a significant disparity between the current stats used for almost all tiers and the 1825 ones. The ones with 1825 at least give a much better idea as to what the metas look like (I realize there's subjectivity to that, but bear with me please), whereas in the stats that are currently used, there are quite a few things being inflated by a number of new players who have no knowledge about the tiers they're playing, so Pokemon that have like 5-3.41% usage in the current stats drop to the lower tier if the weight was higher.

To solve the whole "people with only a couple battles were shooting up the rankings then dropping down", is it possible to decrease their influence on the stats when they lose a match, even if they have a high ranking? I feel like if something like that could be applied, it would be the proper step towards using the higher weighted stats again.

Antar · Aug 8, 2014

Spirit, please read the OP in this thread, and try to understand the reasoning behind the decision. A very important takeaway is this: thanks to how PS does matchmaking, your experience of the metagame will differ significantly from my experience which will differ significantly from the experience of a player using an Ash team.

preserve · Aug 22, 2014

I find it kind of ironic that that when deciding on what tiers pokemon go in, usage is taken into account. But when deciding whether or not to ban a poke, usage isn't taken into account

Also, I'm not so sure how I feel about basing the cutoff on people's ratings. Just because someone doesn't have a good rating, it doesn't mean they aren't good or generally don't know what they are doing. I feel that a combination of how many battles people played and a w/l ratio of maybe over 50 or 60 percent should decide which pokes should count in the statistics.

Antar · Aug 22, 2014

Okay, I originally just deleted the above post, but I did say I wouldn't be doing that in this thread, so let's break down point by point why this is possibly the most ignorant post I've ever read:

preserve said:
I find it kind of ironic that that when deciding on what tiers pokemon go in, usage is taken into account. But when deciding whether or not to ban a poke, usage isn't taken into account

Uh... how is this ironic? We're trying to create fun and enjoyable metagames. Ideally, this would be based strictly on power, but that's pretty subjective, and we can't run 800-whatever suspect tests to determine what is broken in what tier. We let usage approximate power and then we use suspect tests to help out where usage cutoffs fail.

Just because someone doesn't have a good rating, it doesn't mean they aren't good or generally don't know what they are doing.

Uh... that's *exactly* what rating means. How about you take a gander at this here thread before you start talking about something you clearly know nothing about?

I feel that a combination of how many battles people played and a w/l ratio of maybe over 50 or 60 percent should decide which pokes should count in the statistics.

Win-loss ratio is useless on a ladder with matchmaking. In fact, it's worse than useless. This thread on COIL (our rating system for suspect tests) goes into this depth.

And then we have the signature!

1. PS is based in the US. Follow US laws.

2. The First Amendment does not apply to PS.

3. What?

First Amendment applies to the government, not to private individuals or entities. If you go into your work and say you want to bang your boss' wife, he or she has every right in the world to fire you.

preserve · Aug 23, 2014

Antar said:
Okay, I originally just deleted the above post, but I did say I wouldn't be doing that in this thread, so let's break down point by point why this is possibly the most ignorant post I've ever read:

Uh... how is this ironic? We're trying to create fun and enjoyable metagames. Ideally, this would be based strictly on power, but that's pretty subjective, and we can't run 800-whatever suspect tests to determine what is broken in what tier. We let usage approximate power and then we use suspect tests to help out where usage cutoffs fail.

I don't really have a problem with how the tiers are made. But to say let's use usage to determine which pokemon goes in which tier, but then you dismiss this as an argument when suspecting seems wrong. I don't know maybe their is a flaw with that logic, but it seems like you mentioned something about that in the last sentence which I kind of agree with. Just because something has low usage doesn't mean it's not effective in a certain tier. I'm think pokemon that are not really effective in standard play should drop to lower tiers. Deciding on this should be easy as people who play competitively should know more or less which pokemon is effective and which one is not and a vote should be used to decide this.

Uh... that's *exactly* what rating means. How about you take a gander at this here thread before you start talking about something you clearly know nothing about?
Win-loss ratio is useless on a ladder with matchmaking. In fact, it's worse than useless. This thread on COIL (our rating system for suspect tests) goes into this depth.

Not really because someone new to showdown can be actually good and not have a good rating because they haven't played much. But that's not the only reason why. From my experiences, I've been both high and low on the ladder on most tiers. I'm not the best battler and I don't have super hindsight vision, but I know more or less of what I'm doing.. most of the time. So I don't think it's fair for people like me who sometimes either one doesn't really care about ratings or two may have low ratings sometimes to not have my "influence" count when tier making, when they are not a complete moron.

And then we have the signature!
First Amendment applies to the government, not to private individuals or entities. If you go into your work and say you want to bang your boss' wife, he or she has every right in the world to fire you.

Freedom of speech has its limits, but I generally disagree with you when it applies to smogon from what I seen. I've seen people delete posts just because someone's opinion doesn't conform to their or the majority's opinion. People have a right to say what they feel as long as it's not crossing any lines, no matter how ignorant or wrong you think it is. I don't need to be restricted to just a government identity to exercise that right.

But I don't want to argue about that. The point of my signature is to show how hypocritical the rules are. They say you must follow the laws of USA, but restrict one of their main one.

Antar · Aug 23, 2014

preserve said:
Deciding on this should be easy as people who play competitively should know more or less which pokemon is effective and which one is not and a vote should be used to decide this.

Have you *ever* read a suspect testing thread? Like, even once? Competitive players disagree about *everything.*

Not really because someone new to showdown can be actually good and not have a good rating because they haven't played much.

Rating converges within a couple dozen battles (usually less). And the reason we don't reset the ladders any more (outside of suspect tests) is precisely to preserve accuracy. Since you seem unwilling to read, rating is a mathematical assessment, based on passed battles, of a player's skill with regards to the rest of each ladder. My favorite rating system is GXE, which gives, to a statistically proven (you forget, or maybe you just didn't know--I have access to the logs, I can check these things) high degree of accuracy, the odds that a player will win a match against a randomly selected person on the ladder.

I've been both high and low on the ladder on most tiers.

Different ratings mean different things on different ladders. Roughly speaking, ratings correlate to percentile (again, read the ratings faq), so yes, it's a lot harder to get top 100 in OU than it is to get top 100 in Little Cup. But, on the flip side, it's a lot harder to get an 1800 (Glicko) rating in Little Cup than in OU.

I generally disagree with you when it applies to smogon from what I seen

This isn't a matter of opinion, it's a matter of law!

preserve · Aug 23, 2014

Antar said:
Have you *ever* read a suspect testing thread? Like, even once? Competitive players disagree about *everything.*

But we still suspect pokemon. Just because people disagree on something, that doesn't mean we shouldn't vote on something.

This isn't a matter of opinion, it's a matter of law!

Maybe we have a different views on free speech. You may view it as a "law" but I view as an unalienable right.

Stoo · Aug 23, 2014

this probably doesnt even need to be said but in regards to the above post, deciding tiers via a vote is a completely pointless process. if something is good enough to be used in a tier, people will use it > it will rise to that tier. the fact we use high ladder stats such as 1695 is to ensure that usage is coming from people who know what they're doing.

if you think you're good enough to influence the tiering process, then make unorthodox pokes work higher on the ladder, or make suspect reqs. simple as that.

Zebstrika · Aug 24, 2014

Sorry if I'm beating a dead Rapidash, but tiers are supposed to represent different metagames, rather than different power levels. We remove the most common pokemon from OU not because they would necessarily be overpowered in UU, but just to juice up the metagame as much as possible. It would be really dumb if we saw Conkeldurr all over OU with its 8.3% usage last month* but was voted not viable enough, and then we saw Conkeldurr all over UU as well, and repeat that for a handful of other pokemon.

*On a somewhat unrelated note, I was looking through last month's stats and sitting at #18 was Mawile. Ok, Mawile was just banned, well, Aegislash was sitting pretty at the top at 21.8% usage. And then I look down, and see the Deoxys forms. Can this month end faster so we can get some stats that aren't 3 suspect tests outdated?

Mowtom · Aug 25, 2014

preserve said:
But we still suspect pokemon. Just because people disagree on something, that doesn't mean we shouldn't vote on something.

Congratulations, you just discovered the point of a suspect test! People disagree on whether something belongs in the tier, so they battle, discuss, and then vote on it!

Maybe we have a different views on free speech. You may view it as a "law" but I view as an unalienable right.

Try telling any law enforcement official that you disagree with a law and see how far that gets you.

The decision to base UU off of 1695 stats

Antar

Shog

Antar

Shog

Antar

Lord Wallace

Hentai Connoiseur

phantom

Banned deucer.

Antar

preserve

Antar

preserve

Antar

preserve

Stoo

Zebstrika

Mowtom

I'm truly still meta, enjoy this acronym!

Users Who Are Viewing This Thread (Users: 1, Guests: 0)