Adjusting the various contentions of the elements

General views of the Fashioned, be it matter aggregated into the farthest stars of heaven, be it the phenomena of earthly things at hand, are not merely more attractive and elevating than the special studies which embrace particular portions of natural science; they further recommend themselves peculiarly to those who have little leisure to bestow on occupations of the latter kind. The descriptive natural sciences are mostly adapted to particular circumstances: they are not equally attractive at every season of the year, in every country, or in every district we inhabit. The immediate inspection of natural objects, which they require, we must often forego, either for long years, or always in these northern latitudes; and if our attention be limited to a determinate class of objects, the most graphic accounts of the travelling naturalist afford us little pleasure if the particular matters, which have been the special subjects of our studies, chance to be passed over without notice.

As universal history, when it succeeds in exposing the true causal connection of events, solves many enigmas in the fate of nations, and explains the varying phases of their intellectual progress—-why it was now impeded, now accelerated—-so must a physical history of creation, happily conceived, and executed with a due knowledge of the state of discovery, remove a portion of the contradictions which the warring forces of nature present, at first sight, in their aggregate operations. General views raise our conceptions of the dignity and grandeur of nature; and have a peculiarly enlightening and composing influence on the spirit; for they strive simultaneously to adjust the contentions of the elements by the discovery of universal laws, laws that reign in the most delicate textures which meet us on earth, no less than in the Archipelagos of thickly clustered nebulae which we see in heaven, and even in the awful depths of space-—those wastes without a world.

General views accustom us to regard each organic form as a portion of a whole; to see in the plant and in the animal less the individual or dissevered kind, than the natural form, inseparably linked with the aggregate of organic forms. General views give an irresistible charm to the assurance we have from the late voyages of discovery undertaken towards either pole, and sent from the stations now fixed under almost every parallel of latitude, of the almost simultaneous occurrence of magnetic disturbances or storms, and which furnish us with a ready means of divining the connection in which the results of later observation stand to phenomena recorded as having occurred in bygone times; general views enlarge our spiritual existence, and bring us, even if we live in solitude and seclusion, into communion with the whole circle of life and activity — with the earth, with the universe.

Alexander von Humboldt, 1845
Kosmos: A General Survey of the Physical Phenomena of the Universe, Vol. I, pp. 23-24

Warner, 1833

Just one reason why I will never tire of reading history and exploration, extracted from:
Anonymous (1891). A Memorial and Biographical History of Northern California. Lewis Publishing Co., Chicago IL.

Colonel J. J. Warner, now of Los Angeles, a member of the Ewing trapping expedition, which passed north through these valleys in 1832, and back again in 1833, says:

“In the fall of 1832, there were a number of Indian villages on King’s River, between its mouth and the mountains; also on the San Joaquin River, from the base of the mountains down to and some distance below the great slough. On the Merced River, from the mountains to its junction with the San Joaquin, there were no Indian villages; but from about this point on the San Joaquin, as well as on its principal tributaries, the Indian villages were numerous, many of them containing some fifty to one hundred dwellings, built with poles and thatched with rushes. With some few exceptions, the Indians were peaceably disposed. On the Tuolumne, Stanislaus and Calaveras rivers there were no Indian villages above the mouths, as also at or near their junction with the San Joaquin. The most hostile were on the Calaveras River. The banks of the Sacramento River, in its whole course through the valley, was studded with Indian villages, the houses of which, in the spring, during the day-time, were red with the salmon the aborigines were curing.

At this time there were not, on the San Joaquin or Sacramento river, or any of their tributaries, nor within the valleys of the two rivers, any inhabitants but Indians. On no part of the continent over which I had then, or have since, traveled, was so numerous an Indian population, subsisting on the natural products of the soil and waters, as in the valleys of the San Joaquin and Sacramento. There was no cultivation of the soil by them; game, fish, nuts of the forest and seeds of the field constituted their entire food. They were experts in catching fish in many ways, and in snaring game in diverse modes.

On our return, late in the summer of 1833,we found the valleys depopulated. From the head of the Sacramento to the great bend and slough of the San Joaquin we did not see more than six or eight live Indians, while large numbers of their bodies and skulls were to be seen under almost every shade-tree near water, where the uninhabited and deserted villages had been converted into grave-yards; and on the San Joaquin River, in the immediate neighborhood of the larger class of villages, which the preceding year were the abodes of large numbers of these Indians, we found not only many graves, but the vestiges of a funeral pyre. At the mouth of King’s River we encountered the first and only village of the stricken race that we had seen after entering the great valley; this village contained a large number of Indians temporarily stopping at that place.

We were encamped near the village one night only, and during that time the death angel, passing over the camping-ground of the plague stricken fugitives, waved his wand, summoning from a little remnant of a once numerous people a score of victims to muster in the land of the Manitou; and the cries of the dying, mingling with the wails of the bereaved, made the night hideous in that veritable valley of death.

History N CA cover

On clustering, part five

This post constitutes a wrap-up and summary of this series of articles on clustering data.

The main point thereof is that one needs an objective method for obtaining evidence of meaningful groupings of values (clusters) in a given data set. This issue is most relevant to non-experimental science, in which one is trying to obtain evidence of whether the observed data are explainable by random processes alone, versus processes that lead to whatever structure may have been observed in the data.

But I’m still not happy with my description in part four of this series, regarding observed and expected data distributions, and what these imply for data clustering outputs. In going over Ben Bolker’s outstanding book for the zillionth time (I have literally worn the cover off of this book, parts of it freely available here), I find that he explains what I was trying to better, in his description of the negative binomial distribution relative to the concept of statistical over-dispersion, where he writes (p. 124):

…rather than counting the number of successes obtained in a fixed number of trials, as in a binomial distribution, the negative binomial counts the number of failures before a pre-determined number of successes occurs.

This failure-process parameterization is only occasionally useful in ecological modeling. Ecologists use the negative binomial because it is discrete, like the Poisson, but its variance can be larger than its mean (i.e. it can be over-dispersed). Thus, it’s a good phenomenological description of a patchy or clustered distribution with no intrinsic upper limit, that has more variance than the Poisson…The over-dispersion parameter measures the amount of clustering, or aggregation, or heterogeneity in the data…

Specifically, you can get a negative binomial distribution as the result of a Poisson sampling process where the rate lambda itself varies. If lambda is Gamma-distributed (p.131) with shape parameter k and mean u, and x is Poisson-distributed with mean lambda, then the distribution of x will be a negative binomial distribution with mean u and over-dispersion parameter k (May, 1978; Hilborn and Mangel, 1997). In this case, the negative binomial reflects unmeasured (“random”) variability in the population.

The relevance of this quote is that a distribution that is over-dispersed, that is, one that has longer right or left (or both) tails than expected from a Poisson distribution having a given mean, is evidence for a non-constant process structuring the data. The negative binomial distribution describes this non-constancy, in the form of an “over-dispersion parameter” (k). In that case, the process that is varying is doing so smoothly (as defined by a gamma distribution), and the resulting distribution of observations will therefore also be smooth. In a simpler situation, one where there are say, just two driving process states, a bi-modal distribution of observations will result.

Slapping a clustering algorithm on the latter will return two clusters whose distinction is truly meaningful–the two sets of values were likely generated by two different generating parameters. A clustering applied to a negative binomial distribution will be arbitrary with respect to just which values get placed in which cluster, and even to the final number of clusters delineated, but not with respect to the idea that the observations do not result from a single homogeneous process, which is a potentially important piece of information. Observation of the data, followed by some maximum likelihood curve fits of negative binomial distributions, would then inform one that the driving process parameters varied smoothly, rather than discretely/bimodally.

On clustering, part four

I have not written any post, or series thereof, in which I had more questions or made me think more precisely about purposes and products of statistical approaches/algorithms, than this one. I’m finding this topic of statistical clustering of multivariate data to be far subtler and more interesting than I thought when I waded into it. Justifiably counter-intuitive even; I encourage you to read through this one as I think the root concepts are pretty important.

It’s easy to misunderstand exactly what the output of a statistical/mathematical procedure represents, and thus vital to grasp just what is going on, and why. This might seem like a fully obvious statement, but my experience has often been that we who work in mostly non-experimental fields (e.g. ecology, geology, climatology, etc) frequently do not in fact have a clear and very exact grasp of what our various quantitative methods are doing. I’m not just referring to highly complex methods either, but even to simple methods we often take for granted. This applies to me personally, and to all kinds of papers by others that I’ve read, in various fields.

The first point to be made here is to retract and correct the example I gave in part one of this series. I had said there that if one clusters the ten numbers 0-9, into two groups, using either a k-means or hierarchical clustering procedure, the first group will contain {0 to 4} and the second will contain {5 to 9}. That’s not the problem–they will indeed do so (at least in R). The problem is the statement that this output is meaningless–that it does not imply anything about structure in the data. On thinking this through a little more, I conclude that whether it does or doesn’t depends on the specifics of the data.

It is very true that an inherent potential problem with clustering methods is the imposition of an artificial structure on the data when none actually exists. With the “k-means” algorithm this is a direct result of minimizing sums of the squared differences from the mean within groups, whereas with hierarchical methods (AGNES, etc), it results from iteratively joining the closest items (item to item or item to group), as determined by a constantly updated distance matrix after each join. Either way, the algorithms cause similarly valued items to be placed together in groups.

The problem is with the example I used, specifically, that a set of values appearing to follow a smooth gradient, such as the integers 0:9 in the example, rather than displaying definite groupings/clusters, necessarily indicates a lack of structure, and thus a meaningless clustering result. This is not necessarily the case, which can only be grasped in the context of an apriori expected distribution of values, given an observed overall mean value. These expectations are given by the Poisson and gamma distributions, for integer- and real-valued data respectively.

The most immediate question arising from that last statement should be “Why exactly should the Poisson or gamma distributions define our expectation, over any other distribution?”. The answer is pretty important, as it gets at just what we’re trying to do when we cluster values. I would strongly argue that that purpose has to be the identification of structure in the data, that is, a departure from randomness, which in turn means we need an estimate of the random condition against which to gauge possible departures. Without having this “randomness reference point”, the fact that {0:4} and {5:9} will certainly fall into two different groups when clustered is nothing more than the trivial re-statement that {5:9} are larger values than {0:4}: fully meaningless. But R (and presumably most other statistical analysis programs as well), does not tell you that–it gives you no information on just how meaningful a given clustering output is. Not good: as scientists we’re after meaningful output, not just output.

The answer to the above question is that the Poisson and gamma distributions provide those randomness reference points, and the reason why is important: they are essentially descriptors of expected distributions due to a type of sampling error alone. Both are statements that if one has some randomly distributed item–say tree locations in two dimensions in a forest, or whatever example you like–then a set of pre-defined measurement units, placed at random throughout the sampling space (geographic area in my example), will vary in the number of items contained (Poisson), and in the area encompassed by measurements from random points to the nth closest objects thereto (gamma). Not sampling error in the traditional sense, but the concept is very similar, so that’s what I’ll call it.

Therefore, departures from such expectations, in an observed data set, are necessarily evidence for structure in those data, in the sense that something other than simple sampling variation–some causative agent–is operating to cause the departure. The main goal of science is to seek and quantify these agents…but how are you supposed to seek those somethings until you have evidence that there is some something there to begin with? And similarly, of the many variables that might be important in a given system, how does one decide which of those are most strongly structured? These questions are relevant because this discussion is in the context of large, multivariate, non-experimental systems, with potentially many variables of unknown interaction. We want evidence regarding whether or not the observations are explainable by a single homogeneous process, or not.

Things get interesting, given this perspective, and I can demonstrate why using say, the set of n = 21 integers {0:20}, a perfect gradient with no evidence of aggregation.

The Poisson distribution tells us that we do not expect a distribution of {0:20} for 21 integers having a mean value of 10. We observe and expect (for a cdf at 5% quantile steps):

   quantile observed expected
     0.05        1        5
     0.10        2        6
     0.15        3        7
     0.20        4        7
     0.25        5        8
     0.30        6        8
     0.35        7        9
     0.40        8        9
     0.45        9        9
     0.50       10       10
     0.55       11       10
     0.60       12       11
     0.65       13       11
     0.70       14       12
     0.75       15       12
     0.80       16       13
     0.85       17       13
     0.90       18       14
     0.95       19       15

…as obtained by (in R):

observed = 0:20; n = length(observed)
expected = qpois(p = seq(0, 1, length.out=n), lambda=mean(observed))
df = data.frame(quantile = seq(0, 1, length.out=n), observed, expected)[2:20,]

Running a chi-square test on the observed values:

chisq.test(x=df$observed)

…we get:

data: df$observed
X-squared = 57, df = 18, p-value = .000006

There is thus only a very tiny chance of getting the observed result from sampling variation alone. Observation of the above table shows that the observed data are stretched–having longer left and right tails, relative to expectation.

But this series is about clustering and how to find evidence for it in the data…and there is no tendency toward clustering evident in these data. Indeed, the opposite is true–the observed distribution is long-tailed, “stretched out” compared to expectation. But…this result must mean that there is indeed some structuring force on the data to cause this result–some departure from the homogeneous state that the Poisson assumes. We can’t know, from just this analysis alone, just how many “real” clusters of data there are, but we do know that it must be more than one (and I hope it is apparent as to why)! More precisely, if we’ve decided to look for clusters in the data, then the chi-square test gives us evidence that more than one cluster is highly likely.

Just how many clusters is a more difficult question, but we could evaluate the first step in that direction (i.e. two clusters) by dividing the values into two roughly even sized groups defined by the midpoint (= 10) and evaluating whether Poisson-distributed values for the two groups, having means of 4.5 and 15, give a better fit to the observations:

(exp1 = qpois(p = seq(0, 1, length.out=11), lambda=4.5)[2:10])
(exp2 = qpois(p = seq(0, 1, length.out=12), lambda=15)[2:11])
(df2 = data.frame(quantile = seq(0, 1, length.out=n)[2:20], obs=df$observed, exp = sort(c(exp1,exp2))))

…which gives:

   quant. obs. exp.
     0.05   1   2
     0.10   2   3
     0.15   3   3
     0.20   4   4
     0.25   5   4
     0.30   6   5
     0.35   7   5
     0.40   8   6
     0.45   9   7
     0.50  10  10
     0.55  11  11
     0.60  12  13
     0.65  13  14
     0.70  14  14
     0.75  15  15
     0.80  16  16
     0.85  17  17
     0.90  18  18
     0.95  19  20

…and then compute the mean of the two chi-square probabilities:

obs1=0:9; obs2=10:20
p = mean(c(chisq.test(obs1)$p.value, chisq.test(obs2)$p.value))

…which returns p = 0.362
[Edit note: I had wrong values here originally–because I mistakenly ran the chi square tests on the expected values instead of the observed. Now corrected–still a very high odds ratio, just not as high as before.]

The odds ratio of the two hypotheses (two Poisson-distributed groups having means of 4.5 and 15.0, versus just one group with a mean of 10.0) is thus 0.362 / .000006 = 60,429. Thus, clustering the observed data into these two groups would be a very highly defensible decision, even though the observed data comprise a perfect gradient having no tendency toward aggregation whatsoever!

The extension from the univariate to the multivariate case is straight-forward, involving nothing more than performing the same analysis on each variable and then averaging the resulting set of probabilities or odds ratios.

On throwing a change up when a fastball’s your best pitch

Sports are interesting, and one of the interesting aspects about them, among many, is that the very unlikely can sometimes happen.

The Louisville Cardinals baseball team went 50-12 this year through the regular season and first round (“regional”) of the NCAA baseball playoff. Moreover, they were an astounding 36-1 at home, the only loss coming by three runs at the hands of last year’s national champion, Virginia. Over the last several years they have been one of the best teams in the country, making it to the College World Series twice, though not yet winning it. They were considered by the tournament selection committee to be the #2 team in the country, behind Florida, but many of the better computer polls had Louisville as #1.

The college baseball playoff is one of the most interesting tournaments out there, from a structural perspective. Because it’s baseball, it’s not a one-loss tournament, at any of the four levels thereof, at least since 2003. Those four levels are: (1) the sixteen regionals of four teams each, (2) the eight “super regionals” determined by the regional champs, and (3) two rounds at the College World Series in Omaha, comprised of the eight super regional champs. A team can in fact lose as many as four games total over the course of the playoff, and yet still win the national championship. It’s not easy to do though, because a loss in the first game, at either the regional level, or in round one of the CWS, requires a team to win four games to advance, instead of three. In the 13 years of this format, only Fresno State has pulled that feat off, in 2008.

In winning their regional and being one of the top eight seeds, Louisville hosted the winner of the Nashville regional, which was won in an upset over favorite Vanderbilt, by UC Santa Barbara of the Big West Conference. That conference is not as good top to bottom as is the Atlantic Coast Conference (ACC) that Louisville plays in, but neither is it any slouch, containing perennial power CSU Fullerton, and also Long Beach State, who gave third ranked Miami fits in its regional. More generally, the caliber of the baseball played on the west coast, including the PAC-12 and the Big West, is very high, though often slighted by writers and pollsters in favor of teams from the southeast (ACC and Southeast (SEC) conferences in particular). Based on the results of the regional and super regional playoff rounds, the slighting this year was serious: only two of the eight teams in the CWS are from the ACC/SEC, even though teams from the two conferences had home field advantage in fully 83 percent (20/24) of all the first and second round series. Five schools west of the Mississippi River are in, including the top three from the Big 12 conference.

In the super regional, the first team to win twice goes on to the CWS in Omaha. To make a long and interesting story short, UCSB won the first game 4-2 and thus needed just one more win to knock out Louisville and advance to the CWS for the first time in their history. Down 3-0, in the bottom of the ninth inning, they were facing one of the best closers in all of college baseball, just taken as the 27th overall pick in the MLB amateur draft by the Chicago White Sox. Coming in with 100+ mph fastballs, he got the first batter out without problem. However, the second batter singled, and then he began to lose his control and he did exactly what you shouldn’t do: walked the next two batters to load the bases. The UCSB coach decided to go to his bench to bring in a left-handed hitting pinch-hitter, a freshman with only 26 at-bats on the season, albeit with one home run among his nine hits on the year.

And the rest, as they say, is history:

(All the games from this weekend are available for replay here)

On clustering, part three

It’s not always easy to hit all the important points in explaining an unfamiliar topic and I need to step back and mention a couple of important but omitted points.

The first of these is that we can estimate the expected distribution of individual values, from a known mean, assuming a random distribution of values, which, since the mean must be obtained from a set of individual values, means we can compare the expected and observed values, and thus evaluate randomness. The statistical distributions designed for this task are the Poisson and the gamma, for integer- and real-valued data respectively. Much of common statistical analysis is built around the normal distribution, and people are thus generally most familiar with it and prone to use it, but the normal won’t do the job here. This is primarily because it’s not designed to handle skewed distributions, which are a problem whenever data values are small or otherwise limited at one end of the distribution (most often by the value of zero).

Conversely, the Poisson and gamma have no problem with such situations: they are built for the task. This fact is interesting, given that both are defined by just one parameter (the overall mean) instead of two, as is the case for the normal (mean and standard deviation). So, they are simpler, and yet are more accurate over more situations than is the normal–not an everyday occurrence in modeling. Instead, for whatever reason, there’s historically been a lot of effort devoted to transforming skewed distributions into roughly normal ones, usually by taking logarithms or roots, as in e.g. the log-normal distribution. But this is ad-hoc methodology that brings with it other problems, including back transformation.

The second point is hopefully more obvious. This is that although it is easy to just look at a small set of univariate data and see evidence of structure (clustered or overly regular values), large sample sizes and/or multivariate data quickly overwhelm the brain’s ability to do this well, and at any rate we want to assign a probability to this non-randomness.

The third point is maybe the most important one, and relates to why the Poisson and gamma (and others, e.g. the binomial, negative binomial etc.) are very important in analyzing non-experimental data in particular. Indeed, this point relates to the issue of forward versus inverse modeling, and to issues in legitimacy of data mining approaches. I don’t know that it can be emphasized enough how radically different the experimental and non-experimental sciences are, in terms of method and approach and consequent confidence of inference. This is no small issue, constantly overlooked IMO.

If I’ve got an observed data set, originating from some imperfectly known set of processes operating over time and space, I’ve got immediate trouble on my hands in terms of causal inference. Needless to say there are many such data sets in the world. When the system is known to be complex, such that elucidating the mechanistic processes at the temporal and spatial scales of interest is likely to be difficult, it makes perfect sense to examine whether certain types of structures might exist just in the observed data themselves, structures that can provide clues as to just what is going on. The standard knock on data mining and inverse modeling approaches more generally is that of the possibility of false positive results–concluding that apparent structures in the data are explainable by some driving mechanism when in fact they are due to random processes. This is of course a real possibility, but I find this objection to be more or less completely overblown, primarily because those who conduct this type of analysis are usually quite well aware of this possibility thank you.

Overlooked in those criticisms is the fact that by first identifying real structure in the data–patterns explainable by random processes at only a very low probability–one can immediately gain important clues as to just what possible causal factors to examine more closely instead of going on a random fishing expedition. A lot of examples can be given here, but I’m thinking ecologically and in ecology there are many variables that vary in a highly discontinuous way, and this affects the way we have to consider things. This concept applies not only to biotic processes, which are inherently structured by the various aggregational processes inherent in populations and communities of organisms, but to various biophysical thresholds and inflection points as well, whose operation over large scales of space of time are often anything but well understood or documented. As just one rough but informative example, in plant ecology a large fraction of what is going on occurs underground, where all kinds of important discontinuities can occur–chemical, hydrologic, climatic, and of course biological.

So, the search for non-random patterns within observed data sets–before ever even considering the possible drivers of those patterns–is, depending on the level of apriori knowledge of the system in question, a potentially very important activity. In fact, I would argue that this is the most natural and efficient way to proceed in running down cause and effect in complex systems. And it is also one requiring a scientist to have a definite awareness of the various possible drivers of observed patterns and their scales of variation.

So, there’s a reason plant ecologists should know some physiology, some reproductive biology, some taxonomy, some soil science, some climatology, some…

On clustering, part two

In part one of what has quickly developed into an enthralling series, I made the points that (1) at least some important software doesn’t provide a probability value for cluster outputs, and (2) that while it’s possible to cluster any data set, univariate or multivariate, into clearly distinct groups, so doing doesn’t necessarily mean anything important. Such outputs only tell us something useful if there is some actual structure in the data, and the clustering algorithm can detect it.

But just what is “structure” in the data? The univariate case is simplest because with multivariate data, structure can have two different aspects. But in either situation we can take the standard statistical stance that structure is the detectable departure from random expectation, at some defined probability criterion (p value). The Poisson and gamma distributions define this expectation, the former for count (integer valued) data, and the latter for continuous data. By “expectation” I mean the expected distribution of values across the full data. If we have a calculated overall mean value, i.e. an overall rate, the Poisson and gamma then define this distribution, assuming each value is measured over an identical sampling interval. With the Poisson, the latter takes the form of a fixed denominator, whereas with the gamma it takes the form of a fixed numerator.

Using the familiar example of density (number of objects per unit space or time), the Poisson fixes the unit space while the integer number of objects in each unit varies, whereas the gamma fixes the integer rank of the objects that will be measured to from random starting points, with the distance to each such object (and corresponding area thereof) varying. The two approaches are just flip sides of the same coin really, but with very important practical considerations related to both data collection and mathematical bias. Without getting heavily into the data collection issue here, the Poisson approach–counting the number of objects in areas of pre-defined size–can get you into real trouble in terms of time efficiency (regardless of whether density tends to be low or high). This consideration is very important in opting for distance-based sampling and the use of the gamma distribution over area-based sampling and use of the Poisson.

But returning to the original problem as discussed in part one, the two standard clustering approaches–k-means and hierarchical–are always going to return groupings that are of low probability of random occurrence, no matter what “natural structure” in the data there may actually be. The solution, it seems to me, is to instead evaluate relative probabilities: the probability of the values within each group being Poisson or gamma distributed, relative to the probability of the overall distribution of values. In each case these probabilities are determined by a goodness-of-fit test, namely a Chi-square test for count (integer) data and a Kolmogorov-Smirnov test for continuous data. If there is in fact some natural structure in the data–that is, groups of values that are overly similar (or dissimilar) to each other than that defined by the Poisson or gamma–then this relative probability (or odds ratio if you like), will be maximized at the clustering solution that most closely reflects the actual structure in the data, this solution being defined by (1) the number of groups, and (2) the membership of each. It is a maximum likelihood approach to the problem.

If there is little or no actual structure in the data, then these odds ratios computed across different numbers of final groups will show no clearly defensible maximal value, but rather a broad, flat plateau in which all the ratios are similar, varying from each other only randomly. But when there is real structure therein, there will be a ratio that is quantifiably higher than all others, a unimodal response with a peak value. The statistical significance of this maximum can be evaluated with the Likelihood Ratio test or something similar, though I haven’t thought very hard about that issue yet.

Moving from the univariate case, to the multivariate, ain’t not no big deal really, in terms of the above discussion–it just requires averaging those odds ratios over all variables. But multivariate data does introduce a second, subtle aspect into what we mean by the term “data structure”, in the following respect. It is a possible situation wherein no variable in the data shows clear evidence of structure, per the above approach, when in fact there very much is such, but of a different kind. That outcome would occur whenever particular pairs (or larger groups) of variables are correlated with each other (above random expectation), even though the values for each such variable are in fact Poisson/gamma distributed overall. That is, there is a statistically defensible relationship between variables across sample units, but no detectable variation in values within each variable, across those sample units.

Such an outcome would provide definite evidence of behavioral similarity among variables even in the absence of a structuring of those variables by some latent (unmeasured) variable. I think it would be interesting to know how often such a situation occurs in different types of ecological and other systems, and I’m pretty sure nobody’s done any such analysis. Bear in mind however that I also once thought, at about 4:30 AM on a sleep deprived week if I remember right, that it would be interesting to see if I could beat the Tahoe casinos at blackjack based on quick probability considerations.

I hope the above has made at least some sense and you have not damaged your computer by say, throwing your coffee mug through the screen, or yelled something untoward, at volume, within earshot of those who might take offense. The Institute hereby disavows any responsibility, liability or other legal or financial connection to such events, past or future.

There will be more!

On clustering, part one

In ecology and other sciences, grouping similar objects together for further analytical purposes, or just as an end in itself, is a fundamental task, one accomplished by cluster analysis, one of the most fundamental tools in statistics. In all but the smallest sample sizes, the number of possible groupings very rapidly gets enormous, and it is necessary therefore to both (1) have some way of efficiently avoiding the vast number of clearly non-optimal clusters, and (2) choosing the best solution from among those that seem at least reasonable.

First some background. There are (at least) three basic approaches to clustering. Two of these are inherently hierarchical in nature: they either aggregate individual objects into ever-larger groups (agglomerative methods), or successively divide the entire set into ever smaller ones (divisive methods). Hierarchical methods are based on a distance matrix that defines the distance (in measurement space) between every possible pair of objects, as determined by the variables of interest (typically multivariate) and the choice of distance measure, of which there are several depending on one’s definitions of “distance”. This distance matrix increases in size as a function of (n-1)(n/2), or roughly a squared function of n, and so for large datasets these methods quickly become untenable, unless one has an enormous amount of computer memory available, which typically the average scientist does not.

The k-means clustering algorithm works differently–it doesn’t use a distance matrix. Instead it chooses a number of random cluster starting points (“centers”) and then measures the distance to all objects from those points, and agglomerates stepwise according to which objects are closest to which centers. This greatly reduces the memory requirement for large data sets, but a drawback is that the output depends on the initial choice of centers; one should thus try many different starting combinations, and even then, the best solution is not guaranteed. Furthermore, one sets the number of final clusters desired beforehand, but there is no guarantee that the optimal overall solution will in fact correspond to that choice, and so one has to repeat the process for all possible cluster numbers that one deems reasonable, with “reasonable” often being less than obvious.

When I first did a k-means cluster analysis, years ago, I did it in SPSS and I remember being surprised that the output did not include a probability value, that is, the likelihood of obtaining a given clustering by chance alone. There was thus no way to determine which among the many possible solutions was in fact the best one, which seemed to be a pretty major shortcoming, possibly inexcusable. Now I’m working in R, and I find…the same thing. In R, the two workhorse clustering algorithms, both in the main stats package are kmeans and hclust, corresponding to k-means and hierarchical clustering, respectively. In neither method is the probability of the solution given as part of the output. So, it wasn’t just SPSS–if R doesn’t provide it, then it’s quite possible that no statistical software program (SAS, S-Plus, SigmaStat, etc.) does so, although I don’t know for sure.

There is one function in R that attempts to identify what it calls the “optimal clustering”, function optCluster in the package of the same name. But that function, while definitely useful, only appears to provide a set of different metrics by which to evaluate the effectiveness of any given clustering solution, as obtained from 16 possible clustering methods, but with no actual probabilities attached to any of them. What I’m after is different, more defensible and definitely more probabilistic. It requires some careful thought regarding just what clustering should be all about in the first place.

If we talk about grouping objects together, we gotta be careful. This piece at Variance Explained gives the basic story of why, using examples from a k-means clustering. A principal point is that one can create clusters from any data set, but the result doesn’t necessarily mean anything. And I’m not just referring to the issue of relating the variable being clustered to other variables of interest in the system under study. I’m talking about inherent structure in the data, even univariate data.

This point is easy to grasp with a simple example. If I have the set of 10 numbers from 0 to 9, a k-means clustering into two groups will place 0 to 4 in one group and 5 to 9 in the other, as will most hierarchical clustering trees trimmed to two groups. Even if some clustering methods were to sometimes place say, 0 to 3 in one group and 4 to 9 in the other, or similar outcome (which they conceivably might–I haven’t tested them), the main point remains: there are no “natural” groupings in those ten numbers–they are as evenly spaced as is possible to be, a perfect gradient. No matter how you group them, the number of groups and the membership of each will be an arbitrary and trivial result. If, on the other hand, you’ve got the set {0,1,2,7,8,9} it’s quite clear that 0-2 and 7-9 define two natural groupings, since the members of each group are all within 1 unit of the means thereof, and with an obvious gap between the two.

This point is critical, as it indicates that we should seek a clustering evaluation method that is based in an algorithm capable of making this discrimination between a perfect gradient and tightly clustered data. Actually it has to do better than that–it has to be able to distinguish between perfectly spaced data, randomly spaced data, and clustered data. Randomly spaced data will have a natural degree of clustering by definition, and we need to be able to distinguish that situation from truly clustered data, which might not be so easy in practice.

There are perhaps several ways to go about doing this, but the one that is most directly obvious and relevant is based on the Poisson distribution. The Poisson defines the expected values in a set of sub-samples, given a known value determined from the entire object collection, for the variable of interest. Thus, from the mean value over all objects (no clustering), we can determine the probability that the mean values for each of the n groups resulting from a given clustering algorithm (of any method), follow the expectation defined by the Poisson distribution determined by that overall mean (the Poisson being defined by just one parameter). The lower that probability is, the more likely that the clusters returned by the algorithm do in fact represent a real feature of the data set, a natural aggregation, and not just an arbitrary partitioning of random or gradient data. Now maybe somebody’s already done this, I don’t know, but I’ve not seen it in any of the statistical software I’ve used, including R’s two workhorse packages stats and cluster.

More hideous detail to come so take cover and shield your eyes.

A massive mess of old tree data

I’m going to start focusing more on science topics here, as time allows. I’ll start by focusing for a while on some forest ecology topics that I’ve been working on, and/or which are closely related to them.

I’m working on some forest dynamics questions involving historical, landscape scale forest conditions and associated fire patterns. I just got done assembling a tree demography database of about 130,000 trees collected in about 1700 plots, in the early 20th century, on the Eldorado and Stanislaus National Forests (ENF, SNF), the two National Forests that occupy the mid- to upper-elevations on the relatively gradual western slope of the central Sierra Nevada. The data were collected primarily between 1911 and 1923 as censuses of large plots (by today’s standards, each ~2 or 4 acres) as part of the first USFS timber inventories, when it was still trying to figure out just what it had on its hands, and how it would manage it over time. An enormous amount of work was involved in this effort, but only a small part of these data has apparently survived.

The data are “demographic” in that the diameter and taxon were recorded for most trees, making them useful for a number of analytical purposes in landscape, community and population ecology. They come from two datasets that I discovered between 1997 and 2001, one in the ENF headquarters building, and the other in the National Archives facility in San Bruno CA. For each, I photocopied the data at that time, and had some of it entered into a database, hoping that I would eventually get time to analyze them. For the ENF data, this was a fortunate decision, because the ENF, as I later learned, has managed in the mean time to lose the entire data set, most likely along with a bunch of other valuable stuff that was in the office housing it. I thus now have the only known backup. Anyway, that time finally came, but the data were in such a mess that I first had to spend about three months checking and cleaning them before they could be analyzed. The data will soon be submitted as a data paper to the journal Ecology, it being one of the very few journals that has adopted this new paper format. In a data paper, one simply presents and describes a data set deemed to be of value to the general scientific community. There is in fact a further mountain of data and other information beyond these, but whether they’ll ever see the light of publication is uncertain.

An example first page of one of many old field reports and data summaries involved

An example first page of one of many old field reports and data summaries involved

We, and others, are interested in these data for estimating landscape scale forest conditions before they were heavily altered by humans via changed natural fire regimes, logging, and grazing (primarily). These changes began in earnest after about 1850, and have generally increased with time. This knowledge can help inform some important current questions involving forest restoration and general ecosystem stability, including fire and hydrologic regimes, timber production potential, biological diversity, and some spin off topics like carbon dynamics. They can directly address some claims that have been made recently regarding the pre-settlement fire regimes in California and elsewhere, in certain papers.

The data assembly was much slower and more aggravating than expected–I won’t go into it but I’ll never do it again–but the analysis is, and will be, very interesting for quite some time, as much can be done with it. Some of the summary or explanatory documentation associated with the data is entirely fascinating, as is some of the other old literature and data that I’ve been reading over as part of the project. In fact I’m easily distracted into reading more of it than is often strictly necessary, but so doing has reminded me that a qualitative, verbal description can be of much greater value than actual data, scientific situation depending. Possibly the most interesting and important aspect to this is the degree to which really important information has been either lost, completely forgotten about, or never discovered to begin with. This is not trivial–I’m talking about a really large amount of detailed data and extensive, detailed summary documentation. Early views and discussions regarding fire and forest management, and the course these should take in CA, are extensive and very revealing, as we now look back 100 years later on the effects of important decisions made then. There are also lessons in federal archiving and record keeping.

I’ll be posting various things as time allows, including discussions of methods and approaches in this type of research. I’m also applying for a grant to cover the cost of free pizza at the end, although to be honest I’ve not had great success on same in the past. You might be surprised at the application numbers and success rates on that kind of thing.

Bron Yr Aur

He says he had once wanted to be a biologist. Well, science’s loss was music’s gain.

C6 tuning: EADGBE–>CACGCE. What??!! Come on now man, give us half a chance here! There have been a few great acoustic guitarists in rock and roll, but none better IMO. Could listen to him endlessly, and indeed, have.

Bron Yr Aur, Jimmy Page.
Instructional
Physical Graffiti

Rank Stranger

I wandered again to my home in the mountains
Where in youth’s early dawn I was happy and free
I looked for my friends but I never could find them
I found they were all rank strangers to me

Everybody I met seemed to be a rank stranger
No mother nor dad, not a friend could I see
They knew not my name, and I knew not their faces
I found they were all rank strangers to me

Ralph Stanley, Rank Stranger

My take:

“Cincinnati, March 22nd, 1814.”

I am by no means whatsoever an expert on American government policies regarding Native Americans. So just where the following extract fits in to the bigger picture thereof I don’t really know, but based on considerations such as date, location, and people involved, it seems to describe an important set of decisions, possibly precedent setting. It is taken from a letter from General William Henry Harrison, to the Secretary of War, during the War of 1812. Harrison had been Territorial Governor of Indiana before the war, and had served in Anthony Wayne’s army back in the 1794 campaign through western Ohio that led to the Treaty of Greenville in 1795, two very important events in establishing policies between the United States and Native Americans, generally.

Harrison may well have had a better understanding of the recent geographic history of Native American tribes–and certainly regarding their various warfare methods–in the large midwestern area centered on what is now Indiana, and it’s principal river (the Wabash), than any other person of the time. He was also the main actor in dealing with Tecumseh, arguably the greatest Native American strategist ever, in what must have been a fascinating real-life drama. The focus of the letter is on just which tribes had legitimate, long-standing land tenure claims, and thus, the right to negotiate and sell their lands, thereby countering the grand unification strategy of Tecumseh. The full letter is reproduced here: McAfee (1816). History of the Late War in the Western Country, pp 53-58; the [] and bolds being my edits.

Continue reading

Range o’ Light

Not sure I ever needed to be reading anything else really, although I have pulled some rather great historical stuff out of Google Books recently so hurray for the internet I guess. And I don’t know what hoops Stephen Whitney had to jump through do to get that picture of lodgepole pine bark on his cover but man do I love it.

IMG_20160319_0003
IMG_20160319_0002
IMG_20160319_0005
IMG_20160319_0027
IMG_20160319_0059
IMG_20160319_0026
IMG_20160319_0024
IMG_20160321_0004
IMG_20160319_0045
IMG_20160319_0006
IMG_20160321_0001
IMG_20160319_0037
IMG_20160319_0015
IMG_20160319_0004
IMG_20160319_0021
IMG_20160327_0006
IMG_20160327_0005
IMG_20160327_0007
IMG_20160319_0052
IMG_20160319_0036