Getting into some issues only makes you wish that you hadn’t, when you realize how messed up they are, at a fundamental level.

Here’s a great example involving statistical analysis, as applied to win/loss (“WL”) records of sports teams, the base concept of which is that it’s possible to estimate what a team’s WL record “should” have been, based on the number of goals/runs/points that it scored, and allowed, over a defined number of games (typically, a full season or more). This blog post by Bill James partially motivates my thoughts here.

Just where and when this basic idea originated I’m not 100 percent sure, but it appears to have been James, three to four decades ago, under the name “Pythagorean Expectation” (PE). Bill James, if you don’t know, is the originator, and/or popularizer, of a number of statistical methods or approaches applied to baseball data, which launched the so-called “SABR-metric” baseball analysis movement (SABR = Society for American Baseball Research). He is basically that movement’s founder.

In the linked post above, James uses the recent American League MVP votes for Jose Altuve and Aaron Judge, to make some great points regarding the merit of WAR (Wins Above Replacement), arguably the most popular of the *many* SABR-metric variables. The legitimacy of WAR is an involved topic on which much virtual ink has been spilled, but is *not* my focus here; in brief, it tries to estimate the contribution each player makes to his team’s WL record. In the article, James takes pointed exception to how WAR is used (by some, who argue based upon it, that the two players were basically about equally valuable in 2017). In the actual MVP vote, Altuve won by a landslide, and James agrees with the voters’ judgement (pun intended): WAR is flawed in evaluating true player worth in this context. Note that numerous problems have been identified with WAR, but James is bringing a new and serious one, and from a position of authority.

One of James’ main arguments involves inappropriate use of the PE, specifically that the “expected” number of wins by a team is quite irrelevant–it’s the *actual* number that matters when assessing any given player’s contribution to it. For the 2017 season, the PE estimates that Judge’s team, the New York Yankers, “should” have gone 101-61, instead of their actual 91-71, and thus in turn, every Yanker player is getting some additional proportion of those ten extra, imaginary wins, added to his seasonal WAR estimate. For Altuve’s team, the Houston Astros, that’s not an issue because their actual and PE WL records were identical (both 101-61). The WAR-mongers, and most self identified SABR-metricians for that matter, automatically then conclude that a team like this year’s Yanks were “unlucky”: they should have won 101 games, but doggone lady luck was against ’em in distributing their runs scored (and allowed) across their 162 games…such that they only won 91 instead. Other league teams balance the overall ledger by being luck beneficiaries–if not outright pretenders. There are major problems with this whole mode of thought, some of which James rips in his essay, correctly IMO.

But one additional major problem here is that James started the PE craze to begin with, and neither he, nor anybody else who have subsequently either modified or used it, seems to understand the problems inherent in *that* metric. James instead addresses issues in the *application* of the PE as input to the metric (WAR) that he takes issue with, not the legitimacy of the PE itself. Well, there are in fact several issues with the PE, ones that collectively illustrate important issues in statistical philosophy and practice. If you’re going to criticize, start at the root, not the branches.

The issue is one of statistical methodology, and the name of the metric is itself a big clue–it was chosen because the PE formula is similar to the Pythagorean theorem of geometry: A^2 + B^2 = C^2, where A, B and C are the three sides of a right triangle. The original (James) PE equation was: W = S^2 / (S^2 + A^2), where W = winning percentage, S = total runs scored and A = total runs allowed, summed over all the teams in a league, over one or more seasons. That is, it supposedly mimicked the ratio of squared lengths between one side, and the hypotenuse, of a right triangle. Just how James came to this structural form, and parameter values, I don’t know and likely very few besides James himself do; presumably the details are in one of his annual *Baseball Abstracts* from 1977 to 1988, since he doesn’t discuss the issue that I can see, in either of his “Historical Baseball Abstract” books. Perhaps he thought that runs scored and allowed were fully independent of each other, orthogonal, like the two sides of a right triangle. I don’t know.

It seems to me very likely that James derived his equation via fitting various curves to some empirical data set, although it is possible he was operating from some (unknown) theoretical basis. Others who followed him, and supposedly “improved” the metric’s accuracy definitely fitted curves to data, since all parameters (exponents) were lowered to values (e.g. 1.81) for which no theoretical basis is even possible to conceive of: show me the theoretical basis for anything that scales up/down according to the ratio of a sum of parts, and one component thereof, by the power of 1.81. The current PE incarnation (claimed as the definitive word on the matter by some) has the exponents themselves as variables, dependent on the so-called “run environment”, the total number of runs scored and allowed, per game. Thus, the exponents for any given season are estimated by R^0.285, where R is the average number of runs scored per game (both teams) over all games of a season.

Even assuming that James did in fact try to base his PE on theory somehow, he didn’t do it right, and that’s a big problem, because there is in fact a very definite theoretical basis for *exactly* this type of problem…but one never followed, and apparently never even recognized, by SABR-metricians. At least I’ve seen no discussion of it anywhere, and I’ve read my share of baseball analytics essays. Instead, it’s an example of the curve-fitting mentality that is utterly ubiquitous among them. (I have seen some theoretically driven analytics in baseball, but mostly as applied to ball velocity and trajectory off the bat, as predicted from e.g., bat and ball elasticity, temperature, launch angle, and etc, and also the analysis of bat breakage, a big problem a few years back. And these were by Alan Nathan, an actual physicist).

Much of science, especially non-experimental science, involves estimating relationships from empirical data. And there’s good reason for that–most natural systems are complex, and often, one simply does not know, quantitatively and apriori, the fundamental operating relationships upon which to build a theory, much less how those interact with each other in complex ways at the time and space scales of interest. Therefore one tries instead to estimate those relationships by fitting models to empirical data–often some type of regression model, but not necessarily. It goes without saying that since the system is complex, you can only hope to detect some part of the full signal from the noise, often just one component of it. It’s an inverse, or inferential, approach to understanding a system, as opposed to forward modeling driven by theory; these are the two opposing approaches to understanding a system.

On those (rare) occasions when you do have a system amenable to theoretical analysis…well you dang well better do so. Geneticists know this: they don’t ignore binomial/multinomial models, in favor of curve fitting, to estimate likely nuclear transmission genetic processes in diploid population genetics and inheritance. That would be entirely stupid, given that we know for sure that diploid chromosomes conform to a binomial process during meiosis the vast majority of the time. We understand the underlying driving process–it’s simple and ubiquitous.

The binomial must be about the simplest possible stochastic model…but the Poisson isn’t too far behind. The Poisson predicts the expected distribution of the occurrence of discrete events in a set of sample units, given knowledge of the average occurrence rate determined over the full set thereof. It is in fact *exactly* the appropriate model for predicting the per-game distribution of runs/goals scored (and allowed), in sports such as baseball, hockey, golf, soccer, lacrosse, etc. (i.e. sports in which scoring is integer-valued and all scoring events are positive and of equal value).

To start with, the Poisson model can test a wider variety of hypotheses. The PE can only predict a team’s WL record, whereas the Poisson can test whether or not a team’s actual runs scored (and allowed) distribution, follows expectation. To the extent that they do follow is corresponding evidence of true randomness generating the variance in scores across games. This in turn means that the run scoring (or allowing) process is *stationary*, i.e., it is governed by an unchanging set of drivers. Conversely, if the observed distributions differ significantly from expectation, that’s corresponding evidence that those drivers are *not* stationary, meaning that teams’ inherent ability to score (and/or allow) runs is dynamic–they change over time (i.e. between games). That’s an important piece of knowledge in and of itself.

But the primary question of interest here involves the WL record and its relationship to runs scored and allowed. If a team’s runs scored and allowed both closely follow Poisson expectations–then prediction of the WL record follows from theory. Specifically, the distribution of differences in two Poisson distributions follows the Skellam distribution, described by the British statistician J.G. Skellam in the 1950s, as part of his extensive work on point processes. That is, the Skellam directly predicts the WL record whenever the Poisson assumptions are satisfied. However, even if a team’s run distribution deviates significantly from Poisson expectation, it is still possible to accurately estimate the expected WL record, by simply resampling–drawing randomly several thousand times from the *observed* distributions–allowing computers to do what they’re really good at. [Note that in low scoring sports like hockey and baseball, many ties will be predicted, and sports differ greatly in how they break ties at the end of regulation play. The National Hockey League and Major League Baseball vary greatly in this respect, especially now that NHL ties can be decided by shoot-out, which is a completely different process than regulation play. In either case, it’s necessary to identify games that are tied at the end of regulation.]

If instead you take an empirical data set and fit some equation to those data–any equation, no matter how good the fit–you run the risk of committing a very big error indeed, one of the biggest you can in fact make. Specifically, if the data *do* in fact deviate from Poisson expectation, i.e. non-stationary processes are operating, *you will mistake your data-fitted model for the true expectation*–the baseline reference point from which to assess random variation. Show me a bigger error that you can make then that one–it will affect every conclusion you subsequently come to. So, if you want to assess how “lucky” a team was with its WL record, relative to runs scored and allowed, don’t do that. And don’t get me started on use of the term “luck” in SABR-metrics, when what they really mean is chance, or stochastic, variation. The conflation of such terms in sports that very clearly involve heavy doses of both skill and chance, is a fairly flagrant violation of the whole point of language. James is quite right in pointing this out.

I was originally hoping to get into some data analysis to demonstrate the above points but that will have to wait–the underlying statistical concepts needed to be discussed first and that’s all I have time for right now. Rest assured that it’s not hard to analyze the relevant data in R (but it can be a time-consuming pain to obtain and properly format it).

I would also like to remind everyone to try to lay off high fastballs, keep your stick on the ice, and stay tuned to this channel for further fascinating discussions of all kinds. Remember that Tuesdays are dollar dog night, but also that we discontinued 10 cent beer night 40 years ago, given the results.

Interesting stuff but it was hard to follow the reasoning without examples from data sets. Specifically, can you show how a Poisson distribution would calculate a different number of expected wins for the 2017 season for the Yankees and Astros than the Pythagorean?

Yes, but the most effective demo would probably be with simulated data, because with any specific team-year you could get similar results between the two different methods. Will try to get to it soon…