# Severe analytical problems in dendroclimatology, part fifteen

I’m going to give this topic another explanatory shot with some different graphics, because many still don’t grasp the serious problems inherent in trying to signal and noise from tree ring size. The most advanced method for attempting this is called Regional Curve Standardization, or RCS, in which ring size is averaged over a set of sampled trees, according to the rings’ biological age (i.e. ring number, counting from tree center), and then dividing each individual series by this average. I include five time series graphs, successively containing more information, to try to illustrate the problem. I don’t know that I can make it any clearer than this.

First, shown below are the hypothetical series of 11 trees sampled at a single sampling location.

Each black line shows the annual ring area progression for each of 11 trees having origin dates spaced exactly 10 years apart (the bold line is just the oldest tree of the group). By using ring area as the metric we automatically remove part of the non-climatic trend, which is the purely geometric (inverse quadratic) effect from each series. Any remaining variation is then entirely biological and it exhibits a very standard tree growth pattern, one in which growth rate increases to a maximum value reached relatively early life (here, around age 80 or so) and then declines more slowly toward a stable asymptote, which I fix at 1.0. Each tree’s trajectory occurs in a constant climate over the 300-400 year period measured.

The next figure adds two components:

First, the blue line represents a constantly increasing climatic parameter over time, say temperature, expressed as a ratio of its effect on ring size at year 0. Thus, at year 400, the cumulative climatic effect on ring area, regardless of biological age, is exactly 3-fold of its year zero value (scale at right). The second addition is the series of red lines, which simply represent those same 11 trees’ growth trajectories growing under this climate trend. The climatic effect on growth is a super simple linear ramp in all cases–I am not invoking any kind of problematic, complex growth response (e.g. “divergence”), or any other complication. Thus, by definition, if we divide the two corresponding ring series for each tree, we get exactly the blue line, in all cases.

In the third figure:

I add a green line–this is the estimated RCS curve, computed the standard way (by aligning each tree according to its biological age and then averaging the ring sizes over all trees). This RCS curve is thus the estimated non-climatic ring size variation, which we accordingly remove from each tree by dividing the red growth series by it. Finally, we average the resulting 11 index series, over each of the 400 years, giving the stated goal: the estimated climatic time series.

It is at first glance entirely clear that the green RCS curve does not even come close to matching any of the black curves representing the true non-climatic variation…which it must. According to standard dendroclimatological practice we would now divide the 11 red curves by this green RCS curve–which is thereby guaranteed not to return the true climatic signal. So what will it return?

It returns the orange line shown above. No that’s not a mistake: it will return an estimated climatic trend of zero.

And this is the entire point–the supposedly most advanced tree ring detrending method is fully incapable of returning the real climatic trend when one exists. Note that I’m keeping everything very simple here–this result does not depend on: (1) either the direction or magnitude of the true trend, or (2) the magnitude, or shape, of the non-climatic trend in the sampled trees (including no such whatsoever). That is, this type or magnitude of result is not specific to the situation I set up. The problem can be reduced, but never eliminated, by increasing the variance in tree ages in the sample. But since standard field sampling practice is to sample the oldest possible trees at a site, this is very rare, a fact which the data of the International Tree Ring Database (ITRDB) shows clearly–which is ironic given that Keith Briffa and Ed Cook mentioned the importance of exactly this issue in a white paper available at the ITRDB site.

Lastly, suppose now that the last usable year for all ring series occurred a few decades ago. This will occur, for example, due to many ITRDB field samples being collected decades ago now, or for any perceived problems in the climate-to-ring response calibration function, which is must be stable and dependable (notably, the “divergence” effect, in which linear relationships between climate and ring size break down, badly). What will be the result of eliminating, say, the last five decades of data, and replace them with instrumental data? Well, you will then get exactly this:

Look familiar? Does that look like anything remotely approaching success to you? Again, I have not even broached other possibly confounding problems, such as co-varying growth determinants (e.g. increasing CO2- or N-fertilization, changing soil moistures, or inter-tree competition), nor non-linear responses in the calibration function, nor any of the thorny issues in large-scale sampling strategies, reconstructions and their corresponding data analysis methods. Those things would all exacerbate the problem, not improve it. It’s a total analytical mess–beginning and end of story.

I can’t make it any clearer than this. And yes I have the R code that generated these data if you want to see it.

# One hundred years of NHL hockey; some analysis

This post has been updated, with corrected data and modified discussion, as detailed in the text.

Does anything say “100 Years of the National Hockey League” like say, a Tampa Bay vs Vegas matchup.  Montreal, Toronto, Ottawa?  Please; bunch of party crashers them.

In case you missed it, National Hockey League play is, today, exactly 100 years old. On December 19, 1917 the first two games in the new league had the Toronto Arenas at the Montreal Wanderers, and the Montreal Canadiens at the Ottawa Senators. This limited slate was due in large part to those being the only four teams in the league.  It turns out that the Wanderers got their first and only win in franchise history, which lasted just six games. They got past the Arenas in the common hockey score of 10-9. The Arenas, conversely, went on–along with the Canadiens–to become one of the two most storied franchises in NHL history: today’s Toronto Maple Leafs. The Senators’ first incarnation lasted until 1934, and after a 58 year absence came the second (and current) version in 1992.

So anyway, there’s hype and hoopla happening, and also discussions of the greatest seasons, teams, players, etc. As for me, I thought it would be great fun to crunch 90 years of team-season numbers to see what they indicated about team records, actual versus expected. Two minutes for tripping, and without even inhaling anything.

# WAR, Pythagoras, Poisson and Skellam

Getting into some issues only makes you wish that you hadn’t, when you realize how messed up they are, at a fundamental level.

Here’s a great example involving statistical analysis, as applied to win/loss (“WL”) records of sports teams, the base concept of which is that it’s possible to estimate what a team’s WL record “should” have been, based on the number of goals/runs/points that it scored, and allowed, over a defined number of games (typically, a full season or more). This blog post by Bill James partially motivates my thoughts here.

Just where and when this basic idea originated I’m not 100 percent sure, but it appears to have been James, three to four decades ago, under the name “Pythagorean Expectation” (PE). Bill James, if you don’t know, is the originator, and/or popularizer, of a number of statistical methods or approaches applied to baseball data, which launched the so-called “SABR-metric” baseball analysis movement (SABR = Society for American Baseball Research). He is basically that movement’s founder.

In the linked post above, James uses the recent American League MVP votes for Jose Altuve and Aaron Judge, to make some great points regarding the merit of WAR (Wins Above Replacement), arguably the most popular of the many SABR-metric variables. The legitimacy of WAR is an involved topic on which much virtual ink has been spilled, but is not my focus here; in brief, it tries to estimate the contribution each player makes to his team’s WL record. In the article, James takes pointed exception to how WAR is used (by some, who argue based upon it, that the two players were basically about equally valuable in 2017). In the actual MVP vote, Altuve won by a landslide, and James agrees with the voters’ judgement (pun intended): WAR is flawed in evaluating true player worth in this context. Note that numerous problems have been identified with WAR, but James is bringing a new and serious one, and from a position of authority.

One of James’ main arguments involves inappropriate use of the PE, specifically that the “expected” number of wins by a team is quite irrelevant–it’s the *actual* number that matters when assessing any given player’s contribution to it. For the 2017 season, the PE estimates that Judge’s team, the New York Yankers, “should” have gone 101-61, instead of their actual 91-71, and thus in turn, every Yanker player is getting some additional proportion of those ten extra, imaginary wins, added to his seasonal WAR estimate. For Altuve’s team, the Houston Astros, that’s not an issue because their actual and PE WL records were identical (both 101-61). The WAR-mongers, and most self identified SABR-metricians for that matter, automatically then conclude that a team like this year’s Yanks were “unlucky”: they should have won 101 games, but doggone lady luck was against ’em in distributing their runs scored (and allowed) across their 162 games…such that they only won 91 instead. Other league teams balance the overall ledger by being luck beneficiaries–if not outright pretenders. There are major problems with this whole mode of thought, some of which James rips in his essay, correctly IMO.

But one additional major problem here is that James started the PE craze to begin with, and neither he, nor anybody else who have subsequently either modified or used it, seems to understand the problems inherent in that metric. James instead addresses issues in the application of the PE as input to the metric (WAR) that he takes issue with, not the legitimacy of the PE itself. Well, there are in fact several issues with the PE, ones that collectively illustrate important issues in statistical philosophy and practice. If you’re going to criticize, start at the root, not the branches.

The issue is one of statistical methodology, and the name of the metric is itself a big clue–it was chosen because the PE formula is similar to the Pythagorean theorem of geometry: A^2 + B^2 = C^2, where A, B and C are the three sides of a right triangle. The original (James) PE equation was: W = S^2 / (S^2 + A^2), where W = winning percentage, S = total runs scored and A = total runs allowed, summed over all the teams in a league, over one or more seasons. That is, it supposedly mimicked the ratio of squared lengths between one side, and the hypotenuse, of a right triangle. Just how James came to this structural form, and parameter values, I don’t know and likely very few besides James himself do; presumably the details are in one of his annual Baseball Abstracts from 1977 to 1988, since he doesn’t discuss the issue that I can see, in either of his “Historical Baseball Abstract” books. Perhaps he thought that runs scored and allowed were fully independent of each other, orthogonal, like the two sides of a right triangle. I don’t know.

It seems to me very likely that James derived his equation via fitting various curves to some empirical data set, although it is possible he was operating from some (unknown) theoretical basis. Others who followed him, and supposedly “improved” the metric’s accuracy definitely fitted curves to data, since all parameters (exponents) were lowered to values (e.g. 1.81) for which no theoretical basis is even possible to conceive of: show me the theoretical basis for anything that scales up/down according to the ratio of a sum of parts, and one component thereof, by the power of 1.81. The current PE incarnation (claimed as the definitive word on the matter by some) has the exponents themselves as variables, dependent on the so-called “run environment”, the total number of runs scored and allowed, per game. Thus, the exponents for any given season are estimated by R^0.285, where R is the average number of runs scored per game (both teams) over all games of a season.

Even assuming that James did in fact try to base his PE on theory somehow, he didn’t do it right, and that’s a big problem, because there is in fact a very definite theoretical basis for exactly this type of problem…but one never followed, and apparently never even recognized, by SABR-metricians. At least I’ve seen no discussion of it anywhere, and I’ve read my share of baseball analytics essays. Instead, it’s an example of the curve-fitting mentality that is utterly ubiquitous among them. (I have seen some theoretically driven analytics in baseball, but mostly as applied to ball velocity and trajectory off the bat, as predicted from e.g., bat and ball elasticity, temperature, launch angle, and etc, and also the analysis of bat breakage, a big problem a few years back. And these were by Alan Nathan, an actual physicist).

Much of science, especially non-experimental science, involves estimating relationships from empirical data. And there’s good reason for that–most natural systems are complex, and often, one simply does not know, quantitatively and apriori, the fundamental operating relationships upon which to build a theory, much less how those interact with each other in complex ways at the time and space scales of interest. Therefore one tries instead to estimate those relationships by fitting models to empirical data–often some type of regression model, but not necessarily. It goes without saying that since the system is complex, you can only hope to detect some part of the full signal from the noise, often just one component of it. It’s an inverse, or inferential, approach to understanding a system, as opposed to forward modeling driven by theory; these are the two opposing approaches to understanding a system.

On those (rare) occasions when you do have a system amenable to theoretical analysis…well you dang well better do so. Geneticists know this: they don’t ignore binomial/multinomial models, in favor of curve fitting, to estimate likely nuclear transmission genetic processes in diploid population genetics and inheritance. That would be entirely stupid, given that we know for sure that diploid chromosomes conform to a binomial process during meiosis the vast majority of the time. We understand the underlying driving process–it’s simple and ubiquitous.

The binomial must be about the simplest possible stochastic model…but the Poisson isn’t too far behind. The Poisson predicts the expected distribution of the occurrence of discrete events in a set of sample units, given knowledge of the average occurrence rate determined over the full set thereof. It is in fact exactly the appropriate model for predicting the per-game distribution of runs/goals scored (and allowed), in sports such as baseball, hockey, golf, soccer, lacrosse, etc. (i.e. sports in which scoring is integer-valued and all scoring events are positive and of equal value).

To start with, the Poisson model can test a wider variety of hypotheses. The PE can only predict a team’s WL record, whereas the Poisson can test whether or not a team’s actual runs scored (and allowed) distribution, follows expectation. To the extent that they do follow is corresponding evidence of true randomness generating the variance in scores across games. This in turn means that the run scoring (or allowing) process is stationary, i.e., it is governed by an unchanging set of drivers. Conversely, if the observed distributions differ significantly from expectation, that’s corresponding evidence that those drivers are not stationary, meaning that teams’ inherent ability to score (and/or allow) runs is dynamic–they change over time (i.e. between games). That’s an important piece of knowledge in and of itself.

But the primary question of interest here involves the WL record and its relationship to runs scored and allowed. If a team’s runs scored and allowed both closely follow Poisson expectations–then prediction of the WL record follows from theory. Specifically, the distribution of differences in two Poisson distributions follows the Skellam distribution, described by the British statistician J.G. Skellam in the 1950s, as part of his extensive work on point processes. That is, the Skellam directly predicts the WL record whenever the Poisson assumptions are satisfied. However, even if a team’s run distribution deviates significantly from Poisson expectation, it is still possible to accurately estimate the expected WL record, by simply resampling–drawing randomly several thousand times from the observed distributions–allowing computers to do what they’re really good at. [Note that in low scoring sports like hockey and baseball, many ties will be predicted, and sports differ greatly in how they break ties at the end of regulation play. The National Hockey League and Major League Baseball vary greatly in this respect, especially now that NHL ties can be decided by shoot-out, which is a completely different process than regulation play. In either case, it’s necessary to identify games that are tied at the end of regulation.]

If instead you take an empirical data set and fit some equation to those data–any equation, no matter how good the fit–you run the risk of committing a very big error indeed, one of the biggest you can in fact make. Specifically, if the data do in fact deviate from Poisson expectation, i.e. non-stationary processes are operating, you will mistake your data-fitted model for the true expectation–the baseline reference point from which to assess random variation. Show me a bigger error that you can make then that one–it will affect every conclusion you subsequently come to. So, if you want to assess how “lucky” a team was with its WL record, relative to runs scored and allowed, don’t do that. And don’t get me started on use of the term “luck” in SABR-metrics, when what they really mean is chance, or stochastic, variation. The conflation of such terms in sports that very clearly involve heavy doses of both skill and chance, is a fairly flagrant violation of the whole point of language. James is quite right in pointing this out.

I was originally hoping to get into some data analysis to demonstrate the above points but that will have to wait–the underlying statistical concepts needed to be discussed first and that’s all I have time for right now. Rest assured that it’s not hard to analyze the relevant data in R (but it can be a time-consuming pain to obtain and properly format it).

I would also like to remind everyone to try to lay off high fastballs, keep your stick on the ice, and stay tuned to this channel for further fascinating discussions of all kinds.  Remember that Tuesdays are dollar dog night, but also that we discontinued 10 cent beer night 40 years ago, given the results.

# SABR-toothed

Well they’ve been running around on the flat expanses of the early Holocene lake bed with impressively large machines, whacking down and gathering the soybeans and corn. This puts dirt clods on the roads that cause one on a road bike at dusk to weave and swear, but I digress. The Farmer’s Almanac indicates says that it must therefore be about World Series time, which in turn is just about approximately guaranteed to initiate various comments regarding the role of luck, good or bad, in deciding important baseball game outcomes.

There are several important things to be blurted out on this important topic and with the Series at it’s climax and the leaves a fallin’ now’s the time, the time is now.

It was Bill James, the baseball “sabermetric” grandpa and chief guru, who came up with the basic idea some time ago, though not with the questionable terminology applied to it I think, which I believe came later from certain disciples who knelt at his feet.

The basic idea starts off well enough but from there goes into a kind of low-key downhill slide, not unlike the truck that you didn’t bother setting the park brake for because you thought the street grade was flat but found out otherwise a few feet down the sidewalk. At which point you also discover that the bumper height of said truck does not necessarily match that of a Mercedes.

The concept applies not just to baseball but anything involving integer scores. Basic idea is as follows (see here). Your team plays 162 baseball games, 25 soccer matches or whatever, and of course you keep score of each. You then compute the fraction S^x/(S^x + A^x), where using the baseball case, S = runs scored, A = runs allowed and x = an exponent that varies depending on the data used (i.e. the teams and years used). You do this for each team in the league and also compute each team’s winning percentage (WP = W/G, where W = number of wins and G = games played in the season(s)). A nonlinear regression/optimization returns the optimal value of x, given the data. The resulting fraction is known as the “pythagorean expectation” of winning percentage, claiming to inform us of how many games a given team “should” have won and lost over that time, given their total runs scored and allowed.

Note first that the value of x depends on the data used: the relationship is entirely empirically derived, and exponents ranging from (at least) 1.8 to 2.0 have resulted. There is no statistical theory here whatsoever, and in no description of “the pythag” have I ever seen any mention of such. This is a shame because (1) there can and should be, and (2) it seems likely that most “sabermatricians” don’t have any idea as to how or why. Maybe not all, but I haven’t seen any discuss the matter. Specifically, this is a classic case for application of Poisson-derived expectations.

However the lack of theory is one, but not really the main, point here. More at issue are the highly questionable interpretations of the causes of observed deviations from pythag expectations, where the rolling truck smashes out the grill and lights of the Mercedes.

You should base an analysis like this on the Poisson distribution for at least two very strong reasons. First, interpretations of the pythag always involve random chance. That is, the underlying view is that departures of a given team’s won-loss record from pythag expectation is always attributed to the action of randommness–random chance. Great, if you want to go down that road, that’s exactly what the Poisson distribution is designed to address. Secondly, it will give you additional information regarding the role of chance that you cannot get from “the pythag”.

Indeed, the Poisson gives the expected distribution of integer-valued data around a known mean, under the assumption that random deviations from that mean are solely the result of sampling error, which in turn results from the combination of Complete Spatial Randomness (CSR) complete randomness of the objects, relative to the mean value and the size of the sampling frame. In our context, the sampling frame is a single game and the objects of analysis are the runs scored, and allowed, in each game. The point is that the Poisson is inherently designed to test just exactly what the SABR-toothers are wanting to test. But they don’t use it–they instead opt for the fully ad-hoc pythag estimator (or slight variations thereof). Always.

So, you’ve got a team’s total runs scored and allowed over its season. You divide that by the number of games played to give you the mean of each. That’s all you need–the Poisson is a single parameter distribution, the variance being a function of the mean. Now you use that computer in front of you for what it’s really ideal at–doing a whole bunch of calculations really fast–to simply draw from the runs scored, and runs allowed, distributions, randomly, say 100,000 times or whatever, to estimate your team’s real expected won-loss record under a fully random score distribution process. But you can also do more–you can test whether either the runs scored or allowed distribution fits the Poisson very well, using a chi-square goodness-of-fit test. And that’s important because it tells you basically, whether or not they are homogeneous random processes–processes in which the data generating process is unchanging through the season. In sports terms: it tells you the degree to which the team’s performance over the year, offensive and defensive, came from the same basic conditions (i.e. unchanging team performance quality/ability).

The biggest issue remains however–interpretation. I don’t how it all got started, but somewhere, somebody decided that a positive departure from “the pythag” (more wins than expected) equated to “good luck” and negative departures to “bad luck”. Luck being the operative word here. Actually I do know the origin–it’s a straight forward conclusion from attributing all deviations from expectation to “chance”. The problem is that many of these deviations are not in fact due to chance, and if you analyze the data using the Poisson as described above, you will have evidence of when it is, and is not, the case.

For example, a team that wins more close games than it “should”, games won by say just one or two runs, while getting badly smoked in a small subset of other games, will appear to benefit from “good luck”, according to the pythag approach. But using the Poisson approach, you can identify whether or not a team’s basic quality likely changed at various times during the season. Furthermore, you can also examine whether the joint distribution of events (runs scored, runs allowed), follows random expectation, given their individual distributions. If they do not, then you know that some non-random process is going on. For example, that team that wins (or loses) more than it’s expected share of close games most likely has some ability to win (or lose) close games–something about the way the team plays explains it, not random chance. There are many particular explanations, in terms of team skill and strategy, that can explain such results, and more specific data on a team’s players’ performance can lend evidence to the various possibilities.

So, the whole “luck” explanation that certain elements of the sabermetric crowd are quite fond of and have accepted as the Gospel of James, may be quite suspect at best, or outright wrong. I should add however that if the Indians win the series, it’s skill all the way while if the Cubs win it’ll most likely be due to luck.

# Natural selection, genetic fitness and the role of math–part two

I’ve been thinking some more about this issue–the idea that selection should tend to favor those genotypes with the smallest temporal variations in fitness, for a given mean fitness value (above 1.00). It’s taken some time to work through this and get a grip on what’s going on and some additional points have emerged.

The first point is that although I surely don’t know the entire history, the idea appears to be strictly mathematically derived, from modeling: theoretical. At least, that’s how it appears from the several descriptions that I’ve read, including Orr’s, and this one. These all discuss mathematics–geometric and arithmetic means, absolute and relative fitness, etc., making no mention of any empirical origins.

The reason should be evident from Orr’s experimental description, in which he sets up ultra-simplified conditions in which the several other important factors that can alter genotype frequencies over generations, are made unvarying. The point is that in a real world experimental test you would also have to control for these things, either experimentally or statistically, and that would not be easy. It’s hard to see why anybody would go to such trouble if the theory weren’t there to suggest the possibility in the first place. There is much more to say on the issue of empirical evidence. Given that it’s an accepted idea, and that testing it as the generalization it claims to be is difficult, then the theoretical foundation had better be very solid. Well, I can readily conceive of two strictly theoretically-based reasons of why the idea might well be suspect. For time’s sake, I’ll focus on just one of those here.

The underlying basis of the argument is that, if a growth rate (interest rate, absolute fitness, whatever) is perfectly constant over time, the product of the series gives the total change at the final time point, but if it is made non-constant, by varying it around that rate, then the final value–and thus the geometric mean–will decline. The larger the variance around the point, the greater the decline. For example, suppose a 2% increase of quantity A(0) per unit time interval (g), that is, F = 1.020. Measuring time in generations here, after g = 35 generations, A(35) = F^g = 1.020^35 = 2.0; A is doubled in 35 generations. The geometric (and arithmetic) mean over the 35 years is 1.020, because all the yearly rates are identical. Now cause F to instead vary around 1.02 by setting it as the mean of a normal distribution with some arbitrarily chosen standard deviation, say 0.2. The geometric mean of the series will then drop (on average, asymptotically) to just below 1.0 (~ 0.9993). Since the geometric mean is what matters, genotype A will then not increase at all–it will instead stay about the same.

```pstep = 0.00001; probs = seq(pstep, 1-pstep, pstep)
q = qnorm(p=probs, mean=1.02, sd=0.2)
gm = exp(mean((log(q)))); gm
```

This is a very informative result. Using and extending it, now imagine an idealized population with two genotypes, A and B, in a temporally unvarying selection environment, with equal starting frequencies, A = B = 0.50. Since the environment doesn’t vary, there is no selection on either, that is F.A = F.B = 1.0 and they will thus maintain equal relative frequencies over time. Now impose a varying selection environment where sometimes conditions favor survival of A, other times B. We would then repeat the above exercise, except that now the mean of the distribution we construct is 1.000, not 1.020. The resulting geometric mean fitness of each genotype is now 0.9788 (just replace 1.02 with 1.00 in the above code).

So what’s going to happen? Extinction, that’s what. After 35 generations, each will be down to 0.9788^35 = 0.473 of it’s starting value, on average, and on the way to zero. The generalization is that any population having genotypes of ~ equal arithmetic mean (absolute) fitness and normally distributed values around that mean, will have all genotypes driven to extinction, and at a rate proportional to the magnitude of the variance. If instead, one genotype has an arithmetic mean fitness above 1.00 a threshold value determined by it’s mean and variance, while all others are below it, then the former will be driven to fixation and the latter to extinction. These results are not tenable–this is decidedly not what we see in nature. We instead see lots of genetic variation, including vast amounts maintained over vast expanses of time. I grant that this is a fairly rough and crude test of the idea, but not an unreasonable one. Note that this also points up the potentially serious problem caused by using relative, instead of absolute, fitness, but I won’t get into that now.

Extinction of course happens in nature all the time, but what we observe in nature is the result of successful selection–populations and species that survived. We know, without question, that environments vary–wildly, any and all aspects thereof, at all scales, often. And we also know without question that selection certainly can and does filter out the most fit genotypes in those environments. Those processes are all operating but we don’t observe a world in which alleles are either eliminated or fixed. The above examples cannot be accurate mathematical descriptions of a surviving species’ variation in fitness over time–something’s wrong.

The “something wrong” is the designation of normally distributed variation, or more exactly, symmetrically distributed variation. To keep a geometric mean from departing from it’s no-variance value, one must skew the distribution around the mean value, such that values above it (x) are inverses (1/x) (mean/x) of those below it–that is the only way to create a stable geometric mean while varying the individual values. [EDIT: more accurately, the mean must equal the product of the values below the mean, multiplied by the mean divided by the product of the values above the mean, but the values will be skewed in any case.] Mathematically, the way to do so is to work with the logarithms of the original values–the log of the geometric mean is designated as the mean of normally distributed logarithms of the individual values, of whatever size variance one wants. Exponentiation of the sum of the logarithms will equal the product of the fitness series.

Hopefully, what I’m driving at is emerging. If the variance structure must obey this mathematical necessity to preserve a genotype’s mean fitness at 1.00, while still allowing the individual series values to vary…then why should we not expect the same to hold true when the mean geometric fitness is not equal to 1.00? I would argue that that’s exactly what we should expect, and that Gillespie’s original arguments–and Orr’s, and others’ summaries thereof–are not particularly defensible theoretical expectations of what is likely to be happening in nature. Specifically, the idea that the variance in fitness around an arithmetic mean should necessarily arise from symmetrically (normally) distributed values, is questionable.

As alluded to above, there is (at least) a second theoretical argument as well, but I don’t have time to get into it now (nor for this one for that matter). Suffice it to say that it involves simultaneous temporal changes in total population size and selective environments. All this without even broaching the entire hornet’s nest of empirically testing the idea, a topic reviewed five years ago by Simons. For starters, it’s not clear to me just how conservative “bet hedging” could ever be distinguished from the effects of phenotypic plasticity.

References

Simons, A.M. (2011) Modes of response to environmental change and the elusive empirical evidence for bet hedging. doi:10.1098/rspb.2011.0176

Other references are linked to in the previous post.

# On natural selection, genetic fitness and the role of math

I really am not quite sure what to make of this one.

Last week at the blog Dynamic Ecology it was argued that natural selection behaves like a “risk-averse” money investor. That is, assuming that fitness varies over time (due to e.g. changing environmental variables or other selective factors), natural selection favors situations in which the mean fitness is maximized while the variance is minimized. The idea is explained in this short paper by Orr (2007), whose goal was to explain previous findings (Gillespie, 1973) intuitively. This presumes that knowledge of investor behavior is commonplace, but for my money, an examination of the math details and assumptions is what’s really needed.

This conclusion seems entirely problematic to me.

# Does the Poisson scale up?

I often get obsessed with certain topics, especially statistical and mathematical ones. Lately I’ve been thinking about the Poisson distribution a lot, as it figures heavily in analyses of random populations, and thus also in assessing deviations therefrom, a topic that I’ve been heavily involved for a while, relative to historic and current tree populations. I also find that often a specific question will arise, related to whatever it is I’m doing, that is best (and often quickest) answered in a way that gives me the most confidence, by direct simulation, for which I use R.

The topic of varying scales of variation, and the information that can be obtained by analysis thereof, is a pretty interesting one IMO. When the generating processes for a phenomenon of interest are complex, or poorly understood for whatever reason, one can (and should!) obtain valuable information regarding likelihoods of various hypothetical process drivers, by multi-scale analysis–essentially, obtaining evidence for the scale(s) at which departures from randomness are greatest and using that information to suggest, or constrain, possible explanations. There are a number of ways to go about doing so.

The Poisson distribution is the appropriate descriptor of a homogeneous (constant) rate process whose individual event outcomes are random. An “under-dispersed” population at a particular scale of analysis will be more “regular” in its arrangement than expected by a random process, and it is clear that in such a situation there must necessarily also be under-dispersion at at least some other scales as well, both smaller and larger. To illustrate via an extreme example, suppose some location that gets 36 inches of precipitation (P) per year on average, distributed as exactly three inches per month, every month. The probability of such a result arising, when P varies randomly (Poisson) at any sub-monthly scale, is extremely low. It won’t occur over any extended period of time. The same principle holds, though muted, if there is some monthly variance around 3.0.

In an over-dispersed (“clustered”) population, again at some defined scale, the situation is different. Such a population will also be over-dispersed at smaller scales, but not necessarily at larger ones, at least not at the same intensity, and there should be some unknown scale at which the variation reduces to the Poisson. This means that a Poisson population is not necessarily Poisson at smaller scales, but it should be so at larger scales. That is, it should “scale up” according to Poisson expectation, i.e. with the same rate but a greater absolute number, and variance therein, per sample unit.

But does it? Or rather, what does R think about the matter?

Well, here’s what I get, using the example case of a mean annual P/yr of 36″ and 100,000 simulated monthly, or weekly, sums from randomly sampling the Poisson expectation at sub-interval (weekly or daily) time scales.

```rm(list=ls())
options(digits=3)
# 1. Observed annual mean and corresponding scaled sub-int. means.  Year = 364, month = 28, days.
obs.ann.mn = 36			# observed annual mean, from record
(monthly.mn = obs.ann.mn/13)	# 13 months/yr!
(weekly.mn = obs.ann.mn/52)
(daily.mn = obs.ann.mn/364)

# 2. Poisson CDF expectations, ignore slight variations in days:
# Equal interval CDF probs determined by no. time intervals in a year, eases interpr.
# Set CDF probs to correspond to the various time ints. to give temporal distribution
# NOTE that qpois, for cdf = 1.0, = Inf., so omit last interval
# Poisson
P.Pmonth = qpois(p=(1:12)/13, lambda = monthly.mn) 		# 13 mos
P.Pweek = qpois(p=(1:51)/52, lambda = weekly.mn)		# 52 weeks
P.Pday = qpois(p=(1:363)/364, lambda = daily.mn)		# 364 days
table (P.Pmonth); table(P.Pweek); table(P.Pday)

# 3. Simulations: do repeated samples taken from shorter time periods and summed, match Poisson/gamma expectations at longer periods?
n.trials = 1e5
P.month.week = rep(NA,n.trials)
for (i in 1:n.trials) P.month.week[i] = sum(sample(P.Pweek, 4, replace=T))			# Exactly 4 weeks to our months
q.P.month.week = as.vector(quantile(P.month.week, probs = (1:12)/13)); rm(P.month.week)
P.month.day = rep(NA,n.trials)
for (i in 1:n.trials) P.month.day[i] = sum(sample(P.Pday, 28, replace=T))
q.P.month.day = as.vector(quantile(P.month.day, probs = (1:12)/13)); rm(P.month.day)
P.week.day = rep(NA,n.trials)
for (i in 1:n.trials) P.week.day[i] = sum(sample(P.Pday, 7, replace=T))
q.P.week.day = as.vector(quantile(P.week.day, probs = (1:51)/52)); rm(P.week.day)

mw = data.frame(table (P.Pmonth), table(q.P.month.week))[,-3]; colnames(mw)=c("Precip, monthly", "Poisson Expect.", "Aggr., weekly")
md = data.frame(table (P.Pmonth), table(q.P.month.day))[,-3]; colnames(md)=c("Precip, monthly", "Poisson Expect.", "Aggr., daily")
wd = data.frame(table (P.Pweek), table(q.P.week.day))[,-3]; colnames(wd)=c("Precip, weekly", "Poisson Expect.", "Aggr., daily")
mw; md; wd
```

```Precip, monthly 	Poisson Exp. 	Aggr., weekly
1               3             3
2               3             3
3               3             3
4               2             2
5               1             1
Precip, monthly 	Poisson Exp.	Aggr., daily
1               3            3
2               3            3
3               3            3
4               2            2
5               1            1
Precip, weekly 		Poisson Exp.	 Aggr., daily
0              26           26
1              18           18
2               6            6
3               1            1
```

*However, I also evaluated gamma expectations, mainly as a check and/or curiosity (the gamma interpolates between the Poisson’s integer values). I didn’t always get the exact correspondence as expected, and I’m not really sure why. Close, but not close enough to be due to rounding errors, so that’s kind of interesting, but not enough to pursue further.

Funding for this post was provided in equal part by the French Fish Ectoderm and Statistics Foundation and the American Association for Advancement of Amalgamated and Aggregated Associations. These organizations are solely responsible for any errors herein, and associated management related consequences.

# “We live in a Chi-square society due to political correctness”

So, without getting into the reasons, I’m reading through the entry in the International Encyclopedia of Statistical Science on “Statistical Fallacies: Misconceptions and Myths”, written by one “Shlomo Sawilowsky, Professor, Wayne State University, Detroit MI, USA”. Within the entry, 20 such fallacies are each briefly described.

Sawilowsky introduces the topic by stating:

Compilations and illustrations of statistical fallacies, misconceptions, and myths abound…The statistical faux pas is appealing, intuitive, logical, and persuasive, but demonstrably false. They are uniformly presented based on authority and supported based on assertion…these errors spontaneously regenerate every few years, propagating in peer reviewed journal articles…and dissident literature. Some of the most egregious and grievous are noted below.

Great, let’s get after it then.

He then gets into his list, which proceeds through a set of +/- standard types of issues, including misunderstanding of the Central Limit Theorem, Type I errors, p values, effect sizes and etc. Up comes item 14:

14. Chi-square
(a) We live in a Chi-square society due to political correctness that dictates equality of outcome instead of equality of opportunity. The test of independence version of this statistic is accepted sans voire dire by many legal systems as the single most important arbiter of truth, justice, and salvation. It has been asserted that any statistical difference between (often even nonrandomly selected) samples of ethnicity, gender, or other demographic as compared with (often even inaccurate, incomplete, and outdated) census data is primae faciea evidence of institutional racism, sexism, or other ism. A plaintiff allegation that is supportable by a significant Chi-square is often accepted by the court (judges and juries) praesumptio iuris et de iure. Similarly, the goodness of fit version of this statistic is also placed on an unwarranted pedestal.

Bingo Shlomo!!

Now this is exactly what I want from my encyclopedia entries: a strictly apolitical, logical description of the issue at hand. In fact, I hope to delve deep into other statistical writings of Dr. Sawilowsky to gain, hopefully, even better insights than this one.

Postscript: I’m not really bent out of shape on this, and would indeed read his works (especially this one: Sawilowsky, S. (2003) Deconstructing arguments from the case against hypothesis testing. J. Mod. Appl. Stat. Meth. 2(2):467-474). I can readily overlook ideologically driven examples like this to get at the substance I’m after, but I do wonder how a professional statistician worked that into an encyclopedia entry.

I note also that the supposed “screening fallacy” popular on certain blogs is not included in the list…and I’m not the least bit surprised.

# On clustering, part five

This post constitutes a wrap-up and summary of this series of articles on clustering data.

The main point thereof is that one needs an objective method for obtaining evidence of meaningful groupings of values (clusters) in a given data set. This issue is most relevant to non-experimental science, in which one is trying to obtain evidence of whether the observed data are explainable by random processes alone, versus processes that lead to whatever structure may have been observed in the data.

But I’m still not happy with my description in part four of this series, regarding observed and expected data distributions, and what these imply for data clustering outputs. In going over Ben Bolker’s outstanding book for the zillionth time (I have literally worn the cover off of this book, parts of it freely available here), I find that he explains what I was trying to better, in his description of the negative binomial distribution relative to the concept of statistical over-dispersion, where he writes (p. 124):

…rather than counting the number of successes obtained in a fixed number of trials, as in a binomial distribution, the negative binomial counts the number of failures before a pre-determined number of successes occurs.

This failure-process parameterization is only occasionally useful in ecological modeling. Ecologists use the negative binomial because it is discrete, like the Poisson, but its variance can be larger than its mean (i.e. it can be over-dispersed). Thus, it’s a good phenomenological description of a patchy or clustered distribution with no intrinsic upper limit, that has more variance than the Poisson…The over-dispersion parameter measures the amount of clustering, or aggregation, or heterogeneity in the data…

Specifically, you can get a negative binomial distribution as the result of a Poisson sampling process where the rate lambda itself varies. If lambda is Gamma-distributed (p.131) with shape parameter k and mean u, and x is Poisson-distributed with mean lambda, then the distribution of x will be a negative binomial distribution with mean u and over-dispersion parameter k (May, 1978; Hilborn and Mangel, 1997). In this case, the negative binomial reflects unmeasured (“random”) variability in the population.

The relevance of this quote is that a distribution that is over-dispersed, that is, one that has longer right or left (or both) tails than expected from a Poisson distribution having a given mean, is evidence for a non-constant process structuring the data. The negative binomial distribution describes this non-constancy, in the form of an “over-dispersion parameter” (k). In that case, the process that is varying is doing so smoothly (as defined by a gamma distribution), and the resulting distribution of observations will therefore also be smooth. In a simpler situation, one where there are say, just two driving process states, a bi-modal distribution of observations will result.

Slapping a clustering algorithm on the latter will return two clusters whose distinction is truly meaningful–the two sets of values were likely generated by two different generating parameters. A clustering applied to a negative binomial distribution will be arbitrary with respect to just which values get placed in which cluster, and even to the final number of clusters delineated, but not with respect to the idea that the observations do not result from a single homogeneous process, which is a potentially important piece of information. Observation of the data, followed by some maximum likelihood curve fits of negative binomial distributions, would then inform one that the driving process parameters varied smoothly, rather than discretely/bimodally.

# On clustering, part four

I have not written any post, or series thereof, in which I had more questions or made me think more precisely about purposes and products of statistical approaches/algorithms, than this one. I’m finding this topic of statistical clustering of multivariate data to be far subtler and more interesting than I thought when I waded into it. Justifiably counter-intuitive even; I encourage you to read through this one as I think the root concepts are pretty important.

It’s easy to misunderstand exactly what the output of a statistical/mathematical procedure represents, and thus vital to grasp just what is going on, and why. This might seem like a fully obvious statement, but my experience has often been that we who work in mostly non-experimental fields (e.g. ecology, geology, climatology, etc) frequently do not in fact have a clear and very exact grasp of what our various quantitative methods are doing. I’m not just referring to highly complex methods either, but even to simple methods we often take for granted. This applies to me personally, and to all kinds of papers by others that I’ve read, in various fields.

The first point to be made here is to retract and correct the example I gave in part one of this series. I had said there that if one clusters the ten numbers 0-9, into two groups, using either a k-means or hierarchical clustering procedure, the first group will contain {0 to 4} and the second will contain {5 to 9}. That’s not the problem–they will indeed do so (at least in R). The problem is the statement that this output is meaningless–that it does not imply anything about structure in the data. On thinking this through a little more, I conclude that whether it does or doesn’t depends on the specifics of the data.

It is very true that an inherent potential problem with clustering methods is the imposition of an artificial structure on the data when none actually exists. With the “k-means” algorithm this is a direct result of minimizing sums of the squared differences from the mean within groups, whereas with hierarchical methods (AGNES, etc), it results from iteratively joining the closest items (item to item or item to group), as determined by a constantly updated distance matrix after each join. Either way, the algorithms cause similarly valued items to be placed together in groups.

The problem is with the example I used, specifically, that a set of values appearing to follow a smooth gradient, such as the integers 0:9 in the example, rather than displaying definite groupings/clusters, necessarily indicates a lack of structure, and thus a meaningless clustering result. This is not necessarily the case, which can only be grasped in the context of an apriori expected distribution of values, given an observed overall mean value. These expectations are given by the Poisson and gamma distributions, for integer- and real-valued data respectively.

The most immediate question arising from that last statement should be “Why exactly should the Poisson or gamma distributions define our expectation, over any other distribution?”. The answer is pretty important, as it gets at just what we’re trying to do when we cluster values. I would strongly argue that that purpose has to be the identification of structure in the data, that is, a departure from randomness, which in turn means we need an estimate of the random condition against which to gauge possible departures. Without having this “randomness reference point”, the fact that {0:4} and {5:9} will certainly fall into two different groups when clustered is nothing more than the trivial re-statement that {5:9} are larger values than {0:4}: fully meaningless. But R (and presumably most other statistical analysis programs as well), does not tell you that–it gives you no information on just how meaningful a given clustering output is. Not good: as scientists we’re after meaningful output, not just output.

The answer to the above question is that the Poisson and gamma distributions provide those randomness reference points, and the reason why is important: they are essentially descriptors of expected distributions due to a type of sampling error alone. Both are statements that if one has some randomly distributed item–say tree locations in two dimensions in a forest, or whatever example you like–then a set of pre-defined measurement units, placed at random throughout the sampling space (geographic area in my example), will vary in the number of items contained (Poisson), and in the area encompassed by measurements from random points to the nth closest objects thereto (gamma). Not sampling error in the traditional sense, but the concept is very similar, so that’s what I’ll call it.

Therefore, departures from such expectations, in an observed data set, are necessarily evidence for structure in those data, in the sense that something other than simple sampling variation–some causative agent–is operating to cause the departure. The main goal of science is to seek and quantify these agents…but how are you supposed to seek those somethings until you have evidence that there is some something there to begin with? And similarly, of the many variables that might be important in a given system, how does one decide which of those are most strongly structured? These questions are relevant because this discussion is in the context of large, multivariate, non-experimental systems, with potentially many variables of unknown interaction. We want evidence regarding whether or not the observations are explainable by a single homogeneous process, or not.

Things get interesting, given this perspective, and I can demonstrate why using say, the set of n = 21 integers {0:20}, a perfect gradient with no evidence of aggregation.

The Poisson distribution tells us that we do not expect a distribution of {0:20} for 21 integers having a mean value of 10. We observe and expect (for a cdf at 5% quantile steps):

```   quantile observed expected
0.05        1        5
0.10        2        6
0.15        3        7
0.20        4        7
0.25        5        8
0.30        6        8
0.35        7        9
0.40        8        9
0.45        9        9
0.50       10       10
0.55       11       10
0.60       12       11
0.65       13       11
0.70       14       12
0.75       15       12
0.80       16       13
0.85       17       13
0.90       18       14
0.95       19       15
```

…as obtained by (in R):

```observed = 0:20; n = length(observed)
expected = qpois(p = seq(0, 1, length.out=n), lambda=mean(observed))
df = data.frame(quantile = seq(0, 1, length.out=n), observed, expected)[2:20,]
```

Running a chi-square test on the observed values:

```chisq.test(x=df\$observed)
```

…we get:

data: df\$observed
X-squared = 57, df = 18, p-value = .000006

There is thus only a very tiny chance of getting the observed result from sampling variation alone. Observation of the above table shows that the observed data are stretched–having longer left and right tails, relative to expectation.

But this series is about clustering and how to find evidence for it in the data…and there is no tendency toward clustering evident in these data. Indeed, the opposite is true–the observed distribution is long-tailed, “stretched out” compared to expectation. But…this result must mean that there is indeed some structuring force on the data to cause this result–some departure from the homogeneous state that the Poisson assumes. We can’t know, from just this analysis alone, just how many “real” clusters of data there are, but we do know that it must be more than one (and I hope it is apparent as to why)! More precisely, if we’ve decided to look for clusters in the data, then the chi-square test gives us evidence that more than one cluster is highly likely.

Just how many clusters is a more difficult question, but we could evaluate the first step in that direction (i.e. two clusters) by dividing the values into two roughly even sized groups defined by the midpoint (= 10) and evaluating whether Poisson-distributed values for the two groups, having means of 4.5 and 15, give a better fit to the observations:

```(exp1 = qpois(p = seq(0, 1, length.out=11), lambda=4.5)[2:10])
(exp2 = qpois(p = seq(0, 1, length.out=12), lambda=15)[2:11])
(df2 = data.frame(quantile = seq(0, 1, length.out=n)[2:20], obs=df\$observed, exp = sort(c(exp1,exp2))))
```

…which gives:

```   quant. obs. exp.
0.05   1   2
0.10   2   3
0.15   3   3
0.20   4   4
0.25   5   4
0.30   6   5
0.35   7   5
0.40   8   6
0.45   9   7
0.50  10  10
0.55  11  11
0.60  12  13
0.65  13  14
0.70  14  14
0.75  15  15
0.80  16  16
0.85  17  17
0.90  18  18
0.95  19  20
```

…and then compute the mean of the two chi-square probabilities:

```obs1=0:9; obs2=10:20
p = mean(c(chisq.test(obs1)\$p.value, chisq.test(obs2)\$p.value))
```

…which returns p = 0.362
[Edit note: I had wrong values here originally–because I mistakenly ran the chi square tests on the expected values instead of the observed. Now corrected–still a very high odds ratio, just not as high as before.]

The odds ratio of the two hypotheses (two Poisson-distributed groups having means of 4.5 and 15.0, versus just one group with a mean of 10.0) is thus 0.362 / .000006 = 60,429. Thus, clustering the observed data into these two groups would be a very highly defensible decision, even though the observed data comprise a perfect gradient having no tendency toward aggregation whatsoever!

The extension from the univariate to the multivariate case is straight-forward, involving nothing more than performing the same analysis on each variable and then averaging the resulting set of probabilities or odds ratios.

# On clustering, part three

It’s not always easy to hit all the important points in explaining an unfamiliar topic and I need to step back and mention a couple of important but omitted points.

The first of these is that we can estimate the expected distribution of individual values, from a known mean, assuming a random distribution of values, which, since the mean must be obtained from a set of individual values, means we can compare the expected and observed values, and thus evaluate randomness. The statistical distributions designed for this task are the Poisson and the gamma, for integer- and real-valued data respectively. Much of common statistical analysis is built around the normal distribution, and people are thus generally most familiar with it and prone to use it, but the normal won’t do the job here. This is primarily because it’s not designed to handle skewed distributions, which are a problem whenever data values are small or otherwise limited at one end of the distribution (most often by the value of zero).

Conversely, the Poisson and gamma have no problem with such situations: they are built for the task. This fact is interesting, given that both are defined by just one parameter (the overall mean) instead of two, as is the case for the normal (mean and standard deviation). So, they are simpler, and yet are more accurate over more situations than is the normal–not an everyday occurrence in modeling. Instead, for whatever reason, there’s historically been a lot of effort devoted to transforming skewed distributions into roughly normal ones, usually by taking logarithms or roots, as in e.g. the log-normal distribution. But this is ad-hoc methodology that brings with it other problems, including back transformation.

The second point is hopefully more obvious. This is that although it is easy to just look at a small set of univariate data and see evidence of structure (clustered or overly regular values), large sample sizes and/or multivariate data quickly overwhelm the brain’s ability to do this well, and at any rate we want to assign a probability to this non-randomness.

The third point is maybe the most important one, and relates to why the Poisson and gamma (and others, e.g. the binomial, negative binomial etc.) are very important in analyzing non-experimental data in particular. Indeed, this point relates to the issue of forward versus inverse modeling, and to issues in legitimacy of data mining approaches. I don’t know that it can be emphasized enough how radically different the experimental and non-experimental sciences are, in terms of method and approach and consequent confidence of inference. This is no small issue, constantly overlooked IMO.

If I’ve got an observed data set, originating from some imperfectly known set of processes operating over time and space, I’ve got immediate trouble on my hands in terms of causal inference. Needless to say there are many such data sets in the world. When the system is known to be complex, such that elucidating the mechanistic processes at the temporal and spatial scales of interest is likely to be difficult, it makes perfect sense to examine whether certain types of structures might exist just in the observed data themselves, structures that can provide clues as to just what is going on. The standard knock on data mining and inverse modeling approaches more generally is that of the possibility of false positive results–concluding that apparent structures in the data are explainable by some driving mechanism when in fact they are due to random processes. This is of course a real possibility, but I find this objection to be more or less completely overblown, primarily because those who conduct this type of analysis are usually quite well aware of this possibility thank you.

Overlooked in those criticisms is the fact that by first identifying real structure in the data–patterns explainable by random processes at only a very low probability–one can immediately gain important clues as to just what possible causal factors to examine more closely instead of going on a random fishing expedition. A lot of examples can be given here, but I’m thinking ecologically and in ecology there are many variables that vary in a highly discontinuous way, and this affects the way we have to consider things. This concept applies not only to biotic processes, which are inherently structured by the various aggregational processes inherent in populations and communities of organisms, but to various biophysical thresholds and inflection points as well, whose operation over large scales of space of time are often anything but well understood or documented. As just one rough but informative example, in plant ecology a large fraction of what is going on occurs underground, where all kinds of important discontinuities can occur–chemical, hydrologic, climatic, and of course biological.

So, the search for non-random patterns within observed data sets–before ever even considering the possible drivers of those patterns–is, depending on the level of apriori knowledge of the system in question, a potentially very important activity. In fact, I would argue that this is the most natural and efficient way to proceed in running down cause and effect in complex systems. And it is also one requiring a scientist to have a definite awareness of the various possible drivers of observed patterns and their scales of variation.

So, there’s a reason plant ecologists should know some physiology, some reproductive biology, some taxonomy, some soil science, some climatology, some…

# On clustering, part two

In part one of what has quickly developed into an enthralling series, I made the points that (1) at least some important software doesn’t provide a probability value for cluster outputs, and (2) that while it’s possible to cluster any data set, univariate or multivariate, into clearly distinct groups, so doing doesn’t necessarily mean anything important. Such outputs only tell us something useful if there is some actual structure in the data, and the clustering algorithm can detect it.

But just what is “structure” in the data? The univariate case is simplest because with multivariate data, structure can have two different aspects. But in either situation we can take the standard statistical stance that structure is the detectable departure from random expectation, at some defined probability criterion (p value). The Poisson and gamma distributions define this expectation, the former for count (integer valued) data, and the latter for continuous data. By “expectation” I mean the expected distribution of values across the full data. If we have a calculated overall mean value, i.e. an overall rate, the Poisson and gamma then define this distribution, assuming each value is measured over an identical sampling interval. With the Poisson, the latter takes the form of a fixed denominator, whereas with the gamma it takes the form of a fixed numerator.

Using the familiar example of density (number of objects per unit space or time), the Poisson fixes the unit space while the integer number of objects in each unit varies, whereas the gamma fixes the integer rank of the objects that will be measured to from random starting points, with the distance to each such object (and corresponding area thereof) varying. The two approaches are just flip sides of the same coin really, but with very important practical considerations related to both data collection and mathematical bias. Without getting heavily into the data collection issue here, the Poisson approach–counting the number of objects in areas of pre-defined size–can get you into real trouble in terms of time efficiency (regardless of whether density tends to be low or high). This consideration is very important in opting for distance-based sampling and the use of the gamma distribution over area-based sampling and use of the Poisson.

But returning to the original problem as discussed in part one, the two standard clustering approaches–k-means and hierarchical–are always going to return groupings that are of low probability of random occurrence, no matter what “natural structure” in the data there may actually be. The solution, it seems to me, is to instead evaluate relative probabilities: the probability of the values within each group being Poisson or gamma distributed, relative to the probability of the overall distribution of values. In each case these probabilities are determined by a goodness-of-fit test, namely a Chi-square test for count (integer) data and a Kolmogorov-Smirnov test for continuous data. If there is in fact some natural structure in the data–that is, groups of values that are overly similar (or dissimilar) to each other than that defined by the Poisson or gamma–then this relative probability (or odds ratio if you like), will be maximized at the clustering solution that most closely reflects the actual structure in the data, this solution being defined by (1) the number of groups, and (2) the membership of each. It is a maximum likelihood approach to the problem.

If there is little or no actual structure in the data, then these odds ratios computed across different numbers of final groups will show no clearly defensible maximal value, but rather a broad, flat plateau in which all the ratios are similar, varying from each other only randomly. But when there is real structure therein, there will be a ratio that is quantifiably higher than all others, a unimodal response with a peak value. The statistical significance of this maximum can be evaluated with the Likelihood Ratio test or something similar, though I haven’t thought very hard about that issue yet.

Moving from the univariate case, to the multivariate, ain’t not no big deal really, in terms of the above discussion–it just requires averaging those odds ratios over all variables. But multivariate data does introduce a second, subtle aspect into what we mean by the term “data structure”, in the following respect. It is a possible situation wherein no variable in the data shows clear evidence of structure, per the above approach, when in fact there very much is such, but of a different kind. That outcome would occur whenever particular pairs (or larger groups) of variables are correlated with each other (above random expectation), even though the values for each such variable are in fact Poisson/gamma distributed overall. That is, there is a statistically defensible relationship between variables across sample units, but no detectable variation in values within each variable, across those sample units.

Such an outcome would provide definite evidence of behavioral similarity among variables even in the absence of a structuring of those variables by some latent (unmeasured) variable. I think it would be interesting to know how often such a situation occurs in different types of ecological and other systems, and I’m pretty sure nobody’s done any such analysis. Bear in mind however that I also once thought, at about 4:30 AM on a sleep deprived week if I remember right, that it would be interesting to see if I could beat the Tahoe casinos at blackjack based on quick probability considerations.

I hope the above has made at least some sense and you have not damaged your computer by say, throwing your coffee mug through the screen, or yelled something untoward, at volume, within earshot of those who might take offense. The Institute hereby disavows any responsibility, liability or other legal or financial connection to such events, past or future.

There will be more!

# On clustering, part one

In ecology and other sciences, grouping similar objects together for further analytical purposes, or just as an end in itself, is a fundamental task, one accomplished by cluster analysis, one of the most fundamental tools in statistics. In all but the smallest sample sizes, the number of possible groupings very rapidly gets enormous, and it is necessary therefore to both (1) have some way of efficiently avoiding the vast number of clearly non-optimal clusters, and (2) choosing the best solution from among those that seem at least reasonable.

First some background. There are (at least) three basic approaches to clustering. Two of these are inherently hierarchical in nature: they either aggregate individual objects into ever-larger groups (agglomerative methods), or successively divide the entire set into ever smaller ones (divisive methods). Hierarchical methods are based on a distance matrix that defines the distance (in measurement space) between every possible pair of objects, as determined by the variables of interest (typically multivariate) and the choice of distance measure, of which there are several depending on one’s definitions of “distance”. This distance matrix increases in size as a function of (n-1)(n/2), or roughly a squared function of n, and so for large datasets these methods quickly become untenable, unless one has an enormous amount of computer memory available, which typically the average scientist does not.

The k-means clustering algorithm works differently–it doesn’t use a distance matrix. Instead it chooses a number of random cluster starting points (“centers”) and then measures the distance to all objects from those points, and agglomerates stepwise according to which objects are closest to which centers. This greatly reduces the memory requirement for large data sets, but a drawback is that the output depends on the initial choice of centers; one should thus try many different starting combinations, and even then, the best solution is not guaranteed. Furthermore, one sets the number of final clusters desired beforehand, but there is no guarantee that the optimal overall solution will in fact correspond to that choice, and so one has to repeat the process for all possible cluster numbers that one deems reasonable, with “reasonable” often being less than obvious.

When I first did a k-means cluster analysis, years ago, I did it in SPSS and I remember being surprised that the output did not include a probability value, that is, the likelihood of obtaining a given clustering by chance alone. There was thus no way to determine which among the many possible solutions was in fact the best one, which seemed to be a pretty major shortcoming, possibly inexcusable. Now I’m working in R, and I find…the same thing. In R, the two workhorse clustering algorithms, both in the main stats package are kmeans and hclust, corresponding to k-means and hierarchical clustering, respectively. In neither method is the probability of the solution given as part of the output. So, it wasn’t just SPSS–if R doesn’t provide it, then it’s quite possible that no statistical software program (SAS, S-Plus, SigmaStat, etc.) does so, although I don’t know for sure.

There is one function in R that attempts to identify what it calls the “optimal clustering”, function optCluster in the package of the same name. But that function, while definitely useful, only appears to provide a set of different metrics by which to evaluate the effectiveness of any given clustering solution, as obtained from 16 possible clustering methods, but with no actual probabilities attached to any of them. What I’m after is different, more defensible and definitely more probabilistic. It requires some careful thought regarding just what clustering should be all about in the first place.

If we talk about grouping objects together, we gotta be careful. This piece at Variance Explained gives the basic story of why, using examples from a k-means clustering. A principal point is that one can create clusters from any data set, but the result doesn’t necessarily mean anything. And I’m not just referring to the issue of relating the variable being clustered to other variables of interest in the system under study. I’m talking about inherent structure in the data, even univariate data.

This point is easy to grasp with a simple example. If I have the set of 10 numbers from 0 to 9, a k-means clustering into two groups will place 0 to 4 in one group and 5 to 9 in the other, as will most hierarchical clustering trees trimmed to two groups. Even if some clustering methods were to sometimes place say, 0 to 3 in one group and 4 to 9 in the other, or similar outcome (which they conceivably might–I haven’t tested them), the main point remains: there are no “natural” groupings in those ten numbers–they are as evenly spaced as is possible to be, a perfect gradient. No matter how you group them, the number of groups and the membership of each will be an arbitrary and trivial result. If, on the other hand, you’ve got the set {0,1,2,7,8,9} it’s quite clear that 0-2 and 7-9 define two natural groupings, since the members of each group are all within 1 unit of the means thereof, and with an obvious gap between the two.

This point is critical, as it indicates that we should seek a clustering evaluation method that is based in an algorithm capable of making this discrimination between a perfect gradient and tightly clustered data. Actually it has to do better than that–it has to be able to distinguish between perfectly spaced data, randomly spaced data, and clustered data. Randomly spaced data will have a natural degree of clustering by definition, and we need to be able to distinguish that situation from truly clustered data, which might not be so easy in practice.

There are perhaps several ways to go about doing this, but the one that is most directly obvious and relevant is based on the Poisson distribution. The Poisson defines the expected values in a set of sub-samples, given a known value determined from the entire object collection, for the variable of interest. Thus, from the mean value over all objects (no clustering), we can determine the probability that the mean values for each of the n groups resulting from a given clustering algorithm (of any method), follow the expectation defined by the Poisson distribution determined by that overall mean (the Poisson being defined by just one parameter). The lower that probability is, the more likely that the clusters returned by the algorithm do in fact represent a real feature of the data set, a natural aggregation, and not just an arbitrary partitioning of random or gradient data. Now maybe somebody’s already done this, I don’t know, but I’ve not seen it in any of the statistical software I’ve used, including R’s two workhorse packages stats and cluster.

More hideous detail to come so take cover and shield your eyes.

# Ask the experts, part n

Well we’re long overdue for another installment of “Ask the Self-Appointed Experts“, or at least for the question part. In today’s edition a follower from Two Forks Montana wrestles with the following conundrum, inviting others to help, or at least reassure him that he is not confused alone. He writes:

I know this issue is all over AM talk radio but the inmates pretty clearly run the asylum there and I’m more confused on the following issue than ever.

It is well known that, given a known rate process, the gamma distribution defines the expected values from random starting points to the nth “nearest” object, and so inversely, we can estimate unknown rates by such values. For example, in a forest of randomly distributed trees, the circular area, a, defined by the distance to the nth closest tree, will estimate tree density. But as Skellam (1952), Moore (1954) and Pollard (1971) showed analytically, these estimates are biased, in inverse magnitude to the value of n, specifically, as n/(n-1) for n > 1. Thus, the distance to, say, the 2nd closest tree will correspond to the area represented by one tree, not two. All very well and good.

Now, the mean of the integration of the gamma distribution from 0 to 1, for a known rate, should return the mean area a, but when I closely approximate the integral (in R, which can’t integrate), I seem to get bias-corrected values reflecting the rates, rather than the biased values reflecting the areas (a) to the nth object. I’m flummoxed and not a little aggravated. Do they know what they’re doing there at R headquarters, or is it me that’s got all turned about the wrong way? If I can’t even trust the values from the statistical distributions in R, then just what can I trust there? I tried taking my mind off the matter by following the Winnipeg Jets (WJPCA, 2015), but man, one can just take only so much of that and I sure as hell ain’t going to follow Edmonton. The ice fishing seems to help, at least until the alcohol wears off, but really there should be an 800 number I think. If you can put me onto some clues I would be most grateful.

My R code and results (for the n = 2 object):

```probs = seq(from = 0.000001, to = 0.999999, by = 0.000001)		# evenly spaced probability steps
mean.area = mean(qgamma(p=probs, shape=2, rate = 1, lower.tail = T))	# approximate the pdf integral, as the mean of the sampled distribution
[1] 1.999993
```

1.999993, WTF !??!

References:
Skellam, J.G. 1952. Studies in statistical ecology: I. Spatial Pattern. Biometrika 39:346-362
Moore, P.G. 1954. Spacing in plant populations. Ecology 35:222-227.
Pollard, J.H. 1971. On distance estimators of density in randomly distributed forests. Biometrics 27:991-1002
WJPCA, 2015. Your 2015-2016 Winnipeg Jets: Fast, Furious and Fun. Winnipeg Jets Promotional Coordinating Association.

Sincerely,
Stewart Stansbury, Two Forks MT (“not quite Canada but roughly as empty”), USA

p.s. Jets picture enclosed

Jets player checking imaginary opponent hard into the glass, to the delight of all

# Rate process estimation: Poisson vs gamma

This post is about estimating a rate process, particularly when certain required input data are missing or critical assumptions are unmet. Although it’s quite unrelated to recent posts (here and here) I’m actually headed somewhere definite with the collection of them, if you can bear with me. If you can’t bear with me, I can bear that.

As we know, estimating rate processes is a very common task in science, where “rate” is defined generally as a measured change in one variable per unit measure of another, usually expressed as a ratio or fraction. In the discussion that follows I’ll use object density–the number of objects of interest per unit area–to illustrate the concepts, but they are applicable to rates of any type.

Obtaining rate estimates, with rare exceptions, requires empirical sampling over the domain of interest, followed by some mathematical operation on the samples, often an averaging. Sampling, however, can take two opposing approaches, as determined by the measurement units of the denominator and the numerator. Using our density example, in the first approach we tally the objects occurring within samples of some apriori defined area (the denominator thus fixed), and then simply average the values over all samples. In the second distance sampling (DS) approach, we instead measure to the nth closest objects from random starting points, with the value of n chosen apriori, and thus with the numerator fixed and the denominator varying. Low values of n are typically chosen for convenience, and distances are converted to density using a suitable density estimator.

This DS approach has definite advantages. For one, it is time efficient, as no plot delineations are required, and decisions on which objects to tally or which have already been so, are typically easier. A second advantage is that the total sample size is independent of object density, and a third is an often superior scatter of samples through the study area, leading to a better characterization of spatial pattern.

The data from the two approaches are modeled by two well-established statistical models, the Poisson and the gamma, respectively. With the Poisson, the number of objects falling in each of a collection of plots will follow a Poisson distribution, the exact values determined by the overall density. With the second approach, the distances to the nth closest objects follow a gamma distribution.

Either approach requires that objects be randomly located throughout the study area for unbiased estimates, but there is a second potential bias source with distance sampling (DS), and hence a down-side. This bias has magnitude of (n+1)/n, and so e.g., measuring to the n = 1st closest object will bias the density estimate by a factor of 2x, to the n = 2nd closest by 3/2, etc. This bias is easily removed simply by multiplying by the inverse, n/(n+1), or equivalently, measuring to the 2nd closest objects will give the area corresponding to unit density, that is, the area at which object density equals one. Which is kind of interesting, assuming you’ve made it this far and nothing more enthralling occupies your mind at the moment.

The shape of the gamma distribution varies with n. When n = 1, the gamma assumes a negative exponential shape, and when n > 1 it is unimodal, strongly skewed at low values of n, but of decreasing skew (more normal) at higher n. The upshot is that one can diagnose n if it is unknown, because these distributions will all differ, at least somewhat, especially in their higher moments, such as the variance and skew. However, discriminatory power decreases with increasing n, and the approach also assumes the objects occur spatially randomly, which they of course may in fact not.

If our rate denominator measurement space is defined on more than one dimension–as it is in our example (area, two dimensions)–we can subdivide it into equally sized sectors and measure distances to the nth closest object within each sector. Sectors here would equate to angles emanating from the sample point, and this “angle-order” method increases the sample size at each sample point. Supposing four sectors, for a given value of n, the four measurements at a point give (four choose two) = six possible ratios between the objects, these ratios being defined as the further distance over the closer. The distributions of these six values, over a collection of sample points, are then collectively discriminatory for n.

Usually both the rate and its homogeneity are unknown, and we need the latter to get the former. If a non-random pattern exists, the non-randomness can be quantified (in several ways) if we know n, and density estimates made using suitable corrections. If we don’t know n but do know that the objects are in fact randomly arranged, we can still infer density, although not with nearly as much precision as when we do know n. Interestingly, the bias arising from a clustered non-random pattern is of the same direction as that from an under-estimate of n, both leading to density under-estimates. Similarly, a tendency toward spatial pattern regularity gives a bias in the same direction as an overestimate of n.

These last facts can be useful, because our main interest is not n or the spatial pattern, but rather the rate (density), and the effects of uncertainties in them are not additive, and can even be compensating, depending on the situation. This serves to greatly constrain the total uncertainty in the rate estimate, even when these critical pieces of input information are lacking. I’m going to try to expound on that in the next post in this utterly enthralling series.

Support for this post has been provided by the Society for Public Education of Not Very Well Recognized or Cared About Issues. Nothing in the post should be construed, or mis-construed, as in any way necessarily reflecting the views, opinions, sentiments, thoughts, conceptual leanings, quasi-conscious daydreaming or water cooler banter of said Society, or really of anyone in particular other than me, and even that is open to debate. You may now return to your regularly scheduled life.