I’ve discussed no baseball here yet, which is kind of surprising, given that I’ve been a big fan all my life. I played a lot growing up, through high school and even a little in college and afterwards. If I had the time, I would likely start a blog just devoted strictly to baseball (and not just analysis either), because I have a lot to say on a lot of topics. But alas…
To me, the real interest in any sport comes from actually playing the game, not watching it, and I watch very little baseball (now) because the games are just too time consuming (though I still have a hard time refraining in October). When I do watch, I’m not obsessively analytical–that takes the fun out of it for me. It’s an athletic contest, not a statistics class; I want to see the center fielder go full speed and lay out for a catch, or a base thief challenge the pitcher or whatever, not sit there with numbers in my head. Analysis is for later, and I do like it, so I wade in at times, thereby joining the SABR-metric (or “sabermetric”) revolution of the last 3-4 decades (the Society for American Baseball Research (SABR), initiated much of this). And baseball offers endless analytical opportunities, for (at least) two reasons.
First, there is a very large set of data of all kinds from present and past seasons to evaluate. The availability of these data is due to the invaluable efforts of a couple of volunteer projects. The first is Retrosheet, an ongoing, systematic effort to digitize old data from scorebooks and newspapers all the way back to the game’s origins in the 1870s. Baseball Reference (BR), has created a really nice online database with very fast search algorithms and a great interface, using these data. Moreover, a revolution in new, more relevant and informative data types is now occurring, involving the detailed tracking of the ball and of the movements of pitchers, batters and fielders, using cameras, radar and geometry. Since 2008 for example, the measured speed and computed trajectory of every MLB pitch has been recorded and is freely available from MLB (though not in a very user-friendly form (XML files)). These, and batter and fielder movement data (not publicly available), are being used heavily by analysts employed by most MLB teams.
Second, there are several different strategies that can be employed in trying to win baseball games, which is a recipe for simulation analyses of many types. It is this analytical angle that interests me the most. Interestingly however, although sabermetric analyses have been positively all over the analysis of existing data, generating countless papers, books and web articles, they have made little use of simulation as a whole. This does not surprise me too much though; many are convinced that analyzing “real data” is necessarily superior to simulation (wrong!!). Conversely, simulation is at the very core of a whole number of computer-based baseball games, a couple of which are very well designed and could undoubtedly be used to generate a number of important insights if used for analytical purposes, IMO. In fact, baseball simulation has a long history, popularized by the Strat-O-Matic baseball board/dice game of the 1960s, but likely going back much further still.
What’s kind of interesting on that issue, is that simulation need not be all that complex; it need not rely on tens of thousands of Monte Carlo runs that only a computer can accomplish after some heavy duty programming. In fact, important insights can be gained via some simple considerations and math, as shown by the following example.
A long-standing question regarding offensive strategy is where in the lineup you should put your “best” hitter. Typically this refers to a guy who hits home runs but is also just a good overall contact hitter: i.e. the guy most likely to drive in runners on the bases. Most teams have maybe one or two such guys, rarely three or more (e.g. this year’s Anaheim Angels). The consensus view is that runs scored will be maximized if this hitter bats in either the third or fourth lineup slot, but I’ve also seen arguments for the fifth slot, and the Angels have such a player (Mike Trout) now batting second. The answer will depend partly on the nature of the rest of the hitters in the lineup, but we can nevertheless explore some generalities with some simple, but entirely reasonable, considerations.
Imagine that a team has two or three guys who can get on base via hits and walks 35% of the time over the course of a season (on base percentage, obp = .350). These guys do not hit a lot of home runs; they are mostly good contact hitters, mostly singles but some doubles, and who are also patient and draw some walks. A not uncommon scenario. Then you have one Miguel Cabrera type of hitter, who hits everything and will drive in a lot of runs, especially if men are on base. Should he bat 3rd, 4th (or elsewhere) in the lineup, to maximize the team’s runs scored over a season?
Arguments for hitting 3rd include (1) that he is guaranteed to hit in the first inning; it’s important to score early, psychologically, and this also leaves no chance that he will lead off the second inning (with nobody on base), and (2) that over the course of a 162 game season, each lineup slot gets 162 x (1/9) = 18 fewer plate appearances (PAs) than the slot above it. [This assumes that the last PA of any given game is a random draw from the nine lineup slots (not quite true, especially in the National League where the pitcher hits, but a good first approximation).] Arguments for hitting 4th are based primarily on the greater likelihood that more men will be on base, at least for the first time through the lineup, although I think there’s also still a certain amount of “well that’s just the way you play the game” to it.
Of these, we can address especially the issue of how often this hitter will hit with men on base–an important question–with some simple considerations. Consider the following.
The first inning is the most predictable of any in terms of who will bat: we know for sure that the top three slots will; whether or not following slots will bat depends on the slot number and what all the hitters in slots ahead of them do. For a typical game in 2014 (with scoring still down), MLB teams are going through their lineups just over four times in a game on average, that is 4 x 9 = 36 PAs. Hence an analysis of situation probabilities for the first inning represents about 25% of the total offensive situations that will be encountered by our team’s best hitter, and that’s likely to be significant. [It’s actually well higher than 25%, as explained further down, but the concept is the important part for now].
So let’s examine probabilities of the number of runners likely to be on base in the first inning, and what that means relative to our slugger hitting say, 3rd vs 4th; very straight forward. Specifically:
## 1. Best hitter hits in 3rd slot: obp1 = .350; obp2 = .350 # The on base percentages of the first two hitters, independent of each other # 1a. Compute probabilities that a defined number of runners will be on base when the 3rd hitter comes to the plate (ob0.3 = (1-obp1)*(1-obp2)) # p(0 runners on) (ob2.3 = obp1*obp2) # p(2 runners on) (ob1.3 = 1-(ob2.3 + ob0.3)) # p(1 runner on) ## 2. Best hitter hits in 4th slot: obp1 = .350; obp2 = .350; obp3 = .350 # The on base percentages of the first three hitters, independent of each other # 2a. Compute probs that defined number of runners will be on when the 4th hitter comes to the plate: (ob0.4 = (1-obp1)*(1-obp2)*(1-obp3)) # p(0 runners on) (ob3.4 = obp1*obp2*obp3) # p(3 runners on) (ob1.4 = obp1*(1-obp2)*(1-obp3) + # p(1 runner on) obp2*(1-obp1)*(1-obp3) + obp3*(1-obp1)*(1-obp2)) (ob2.4 = 1 - (ob0.4 + ob3.4 + ob1.4)) # p(2 runners on) ## 3. Compare: options(digits=3) results = data.frame(0:3, c(ob0.3,ob1.3,ob2.3,NA), c(ob0.4,ob1.4,ob2.4,ob3.4)) results[,3] = round(results[,3],3) colnames(results)=c("Runners", "Slot 3", "Slot 4") results ## 4. Results: Runners Slot 3 Slot 4 1 0 0.423 0.275 2 1 0.455 0.444 3 2 0.122 0.239 4 3 0 0.043
Ah, informative results! We see that if the slugger bats 3rd, about 42% of the time he’ll come to bat with nobody on base, and about 12% of the time with two men on base. If he hits 4th, those numbers are ~ 27.5% and 24%, respectively. The frequency that one man will be on is about equal between them, and for the bases loaded situation, hitting 4th is again slightly superior. So clearly by these numbers, hitting our best hitter 4th is the way to go.
But as mentioned above, the lineup slot that the good hitter is placed in will matter more than just in the first innning. On average, it will matter for just under two full rotations through the lineup. This is because, for a typical nine inning game, having thus eight more innings after the first, the hitter leading off any given inning after the first will (again as a first approximation), be a random draw from among the nine slots. Thus, the slot #1 hitter will lead off an inning an average of 1 + (8/9) = 1.89 times per game. This in turn lends some predictability to how often the #3 and #4 lineup slots will hit with the potential for men to be on base.
From these, we can now compute the total number of PAs in a full season that our slugger will be estimated to bat with the different possible numbers of men on base, not counting any PAs for which we cannot estimate an expected number of baserunners (which will be, roughly, 4 – 1.89 = 2.11 PAs per game). This is given simply by the table results above, x 162 x 1.89, which gives:
Runners Slot 3 Slot 4 1 0 129 84 2 1 139 136 3 2 38 73 4 3 0 13
If I adjust for the estimated 18 PAs lost over a full season by the #4 hitter, due to the #3 hitter making the final out of the game (162 x 1/9), the changes are small:
Runners Slot 3 Slot 4 1 0 129 79 2 1 139 128 3 2 38 69 4 3 0 12
So, that’s 50 fewer plate appearances with nobody on base, 31 more with two men on, and 12 more with the bases loaded, giving a net difference in total baserunners on of -11(1) + 31(2) + 12(3) = 87, most of whom (55) will be in scoring position. With a hitter of say, Miguel Cabrera‘s ability, that’s got to make a difference in some game outcomes. And with them, playoff implications, given that a single win difference or two can readily affect who wins division titles and wild card births.
One important note here. I assumed that none of the first two (or three) hitters score before the slugger hits, i.e. that every base advancement is always just one base at a time. This is of course completely unrealistic, but for the purposes here it doesn’t really matter, because a runner who is not on base due to having already scored is not in any way a liability. However, it’s not completely unimportant either. We can imagine for example an idealized situation in which the first two (or three) batters always hit home runs. In that situation there would be nothing to be gained by hitting the slugger behind these batters, because nobody will ever be on base–it would then be better to move him down the order so that some men are sometimes on base.
Baseball reality is of course much more subtle than these scenarios, and we can thus imagine that for any particular team’s roster, determining an optimal lineup ordering will require many thousands of Monte Carlo simulation runs, in which all relevant player offensive statistics (including baserunning stats) are input and many possible lineup orders permuted, not just where in the lineup to bat your best hitter, but where best to bat all of them. That kind of thing’s been done with MLB-averaged data covering many years, but never that I know of with player specific data over shorter time frames, much less in relation to specific pitching staffs’ statistics. Like many sabermetric analyses, such generalized analysis is of limited usefulness when it comes to addressing team- or player-specific questions.
Anyway, I’m pretty sure there will be more baseball posts in the future 🙂