On baseball (finally!)

I’ve discussed no baseball here yet, which is kind of surprising, given that I’ve been a big fan all my life. I played a lot growing up, through high school and even a little in college and afterwards. If I had the time, I would likely start a blog just devoted strictly to baseball (and not just analysis either), because I have a lot to say on a lot of topics. But alas…

To me, the real interest in any sport comes from actually playing the game, not watching it, and I watch very little baseball (now) because the games are just too time consuming (though I still have a hard time refraining in October). When I do watch, I’m not obsessively analytical–that takes the fun out of it for me. It’s an athletic contest, not a statistics class; I want to see the center fielder go full speed and lay out for a catch, or a base thief challenge the pitcher or whatever, not sit there with numbers in my head. Analysis is for later, and I do like it, so I wade in at times, thereby joining the SABR-metric (or “sabermetric”) revolution of the last 3-4 decades (the Society for American Baseball Research (SABR), initiated much of this). And baseball offers endless analytical opportunities, for (at least) two reasons.

Billy Hamilton of the Cincinnati Reds

Billy Hamilton of the Cincinnati Reds

First, there is a very large set of data of all kinds from present and past seasons to evaluate. The availability of these data is due to the invaluable efforts of a couple of volunteer projects. The first is Retrosheet, an ongoing, systematic effort to digitize old data from scorebooks and newspapers all the way back to the game’s origins in the 1870s. Baseball Reference (BR), has created a really nice online database with very fast search algorithms and a great interface, using these data. Moreover, a revolution in new, more relevant and informative data types is now occurring, involving the detailed tracking of the ball and of the movements of pitchers, batters and fielders, using cameras, radar and geometry. Since 2008 for example, the measured speed and computed trajectory of every MLB pitch has been recorded and is freely available from MLB (though not in a very user-friendly form (XML files)). These, and batter and fielder movement data (not publicly available), are being used heavily by analysts employed by most MLB teams.

Second, there are several different strategies that can be employed in trying to win baseball games, which is a recipe for simulation analyses of many types. It is this analytical angle that interests me the most. Interestingly however, although sabermetric analyses have been positively all over the analysis of existing data, generating countless papers, books and web articles, they have made little use of simulation as a whole. This does not surprise me too much though; many are convinced that analyzing “real data” is necessarily superior to simulation (wrong!!). Conversely, simulation is at the very core of a whole number of computer-based baseball games, a couple of which are very well designed and could undoubtedly be used to generate a number of important insights if used for analytical purposes, IMO. In fact, baseball simulation has a long history, popularized by the Strat-O-Matic baseball board/dice game of the 1960s, but likely going back much further still.

What’s kind of interesting on that issue, is that simulation need not be all that complex; it need not rely on tens of thousands of Monte Carlo runs that only a computer can accomplish after some heavy duty programming. In fact, important insights can be gained via some simple considerations and math, as shown by the following example.

A long-standing question regarding offensive strategy is where in the lineup you should put your “best” hitter. Typically this refers to a guy who hits home runs but is also just a good overall contact hitter: i.e. the guy most likely to drive in runners on the bases. Most teams have maybe one or two such guys, rarely three or more (e.g. this year’s Anaheim Angels). The consensus view is that runs scored will be maximized if this hitter bats in either the third or fourth lineup slot, but I’ve also seen arguments for the fifth slot, and the Angels have such a player (Mike Trout) now batting second. The answer will depend partly on the nature of the rest of the hitters in the lineup, but we can nevertheless explore some generalities with some simple, but entirely reasonable, considerations.

Imagine that a team has two or three guys who can get on base via hits and walks 35% of the time over the course of a season (on base percentage, obp = .350). These guys do not hit a lot of home runs; they are mostly good contact hitters, mostly singles but some doubles, and who are also patient and draw some walks. A not uncommon scenario. Then you have one Miguel Cabrera type of hitter, who hits everything and will drive in a lot of runs, especially if men are on base. Should he bat 3rd, 4th (or elsewhere) in the lineup, to maximize the team’s runs scored over a season?

Arguments for hitting 3rd include (1) that he is guaranteed to hit in the first inning; it’s important to score early, psychologically, and this also leaves no chance that he will lead off the second inning (with nobody on base), and (2) that over the course of a 162 game season, each lineup slot gets 162 x (1/9) = 18 fewer plate appearances (PAs) than the slot above it. [This assumes that the last PA of any given game is a random draw from the nine lineup slots (not quite true, especially in the National League where the pitcher hits, but a good first approximation).] Arguments for hitting 4th are based primarily on the greater likelihood that more men will be on base, at least for the first time through the lineup, although I think there’s also still a certain amount of “well that’s just the way you play the game” to it.

Of these, we can address especially the issue of how often this hitter will hit with men on base–an important question–with some simple considerations. Consider the following.

The first inning is the most predictable of any in terms of who will bat: we know for sure that the top three slots will; whether or not following slots will bat depends on the slot number and what all the hitters in slots ahead of them do. For a typical game in 2014 (with scoring still down), MLB teams are going through their lineups just over four times in a game on average, that is 4 x 9 = 36 PAs. Hence an analysis of situation probabilities for the first inning represents about 25% of the total offensive situations that will be encountered by our team’s best hitter, and that’s likely to be significant. [It’s actually well higher than 25%, as explained further down, but the concept is the important part for now].

So let’s examine probabilities of the number of runners likely to be on base in the first inning, and what that means relative to our slugger hitting say, 3rd vs 4th; very straight forward. Specifically:

## 1. Best hitter hits in 3rd slot:
obp1 = .350; obp2 = .350	# The on base percentages of the first two hitters, independent of each other

# 1a. Compute probabilities that a defined number of runners will be on base when the 3rd hitter comes to the plate
(ob0.3 = (1-obp1)*(1-obp2)) 	# p(0 runners on)
(ob2.3 = obp1*obp2) 		# p(2 runners on)
(ob1.3 = 1-(ob2.3 + ob0.3)) 	# p(1 runner on)

## 2. Best hitter hits in 4th slot:
obp1 = .350; obp2 = .350; obp3 = .350		# The on base percentages of the first three hitters, independent of each other

# 2a. Compute probs that defined number of runners will be on when the 4th hitter comes to the plate:
(ob0.4 = (1-obp1)*(1-obp2)*(1-obp3)) 	 	# p(0 runners on)
(ob3.4 = obp1*obp2*obp3) 			# p(3 runners on)
(ob1.4 = obp1*(1-obp2)*(1-obp3) +  		# p(1 runner on)
  obp2*(1-obp1)*(1-obp3) + 
(ob2.4 = 1 - (ob0.4 + ob3.4 + ob1.4)) 		# p(2 runners on)

## 3. Compare:
results = data.frame(0:3, c(ob0.3,ob1.3,ob2.3,NA), c(ob0.4,ob1.4,ob2.4,ob3.4))
results[,3] = round(results[,3],3)
colnames(results)=c("Runners", "Slot 3", "Slot 4")

## 4. Results:
  Runners Slot 3 Slot 4
1       0  0.423  0.275
2       1  0.455  0.444
3       2  0.122  0.239
4       3      0  0.043

Ah, informative results! We see that if the slugger bats 3rd, about 42% of the time he’ll come to bat with nobody on base, and about 12% of the time with two men on base. If he hits 4th, those numbers are ~ 27.5% and 24%, respectively. The frequency that one man will be on is about equal between them, and for the bases loaded situation, hitting 4th is again slightly superior. So clearly by these numbers, hitting our best hitter 4th is the way to go.

But as mentioned above, the lineup slot that the good hitter is placed in will matter more than just in the first innning. On average, it will matter for just under two full rotations through the lineup. This is because, for a typical nine inning game, having thus eight more innings after the first, the hitter leading off any given inning after the first will (again as a first approximation), be a random draw from among the nine slots. Thus, the slot #1 hitter will lead off an inning an average of 1 + (8/9) = 1.89 times per game. This in turn lends some predictability to how often the #3 and #4 lineup slots will hit with the potential for men to be on base.

From these, we can now compute the total number of PAs in a full season that our slugger will be estimated to bat with the different possible numbers of men on base, not counting any PAs for which we cannot estimate an expected number of baserunners (which will be, roughly, 4 – 1.89 = 2.11 PAs per game). This is given simply by the table results above, x 162 x 1.89, which gives:

  Runners Slot 3 Slot 4
1       0    129     84
2       1    139    136
3       2     38     73
4       3      0     13

If I adjust for the estimated 18 PAs lost over a full season by the #4 hitter, due to the #3 hitter making the final out of the game (162 x 1/9), the changes are small:

  Runners Slot 3 Slot 4
1       0    129     79
2       1    139    128
3       2     38     69
4       3      0     12

So, that’s 50 fewer plate appearances with nobody on base, 31 more with two men on, and 12 more with the bases loaded, giving a net difference in total baserunners on of -11(1) + 31(2) + 12(3) = 87, most of whom (55) will be in scoring position. With a hitter of say, Miguel Cabrera‘s ability, that’s got to make a difference in some game outcomes. And with them, playoff implications, given that a single win difference or two can readily affect who wins division titles and wild card births.

One important note here. I assumed that none of the first two (or three) hitters score before the slugger hits, i.e. that every base advancement is always just one base at a time. This is of course completely unrealistic, but for the purposes here it doesn’t really matter, because a runner who is not on base due to having already scored is not in any way a liability. However, it’s not completely unimportant either. We can imagine for example an idealized situation in which the first two (or three) batters always hit home runs. In that situation there would be nothing to be gained by hitting the slugger behind these batters, because nobody will ever be on base–it would then be better to move him down the order so that some men are sometimes on base.

Baseball reality is of course much more subtle than these scenarios, and we can thus imagine that for any particular team’s roster, determining an optimal lineup ordering will require many thousands of Monte Carlo simulation runs, in which all relevant player offensive statistics (including baserunning stats) are input and many possible lineup orders permuted, not just where in the lineup to bat your best hitter, but where best to bat all of them. That kind of thing’s been done with MLB-averaged data covering many years, but never that I know of with player specific data over shorter time frames, much less in relation to specific pitching staffs’ statistics. Like many sabermetric analyses, such generalized analysis is of limited usefulness when it comes to addressing team- or player-specific questions.

Anyway, I’m pretty sure there will be more baseball posts in the future 🙂


24 thoughts on “On baseball (finally!)

  1. Wow, you are far more into the game than I. And a neat set of analyses.

    So, let’s analyze a theoretical… under today’s rules, what is the minimum number of pitches thrown (both teams) to complete a nine inning game? As a hint (if my logic is right) you can actually box score the event.

    • Clem –
      I think the answer you’re looking for is 52. An example would be 8 1/2 innings of perfect pitching, with each batter making an out on the first pitch, followed by a home run on the first pitch in the bottom of the ninth. Lots of variations possible, for example a single followed by a double play, as long as every batter sees only one pitch and nobody is left on base.

      A pedant might argue for the answer 0! The trick is to get the first 51 outs without throwing a pitch. How can this happen? If a pitcher does not throw a pitch promptly (with no runners on base), the umpire can call a ball for delay of game. Do that four times, then pick the runner off first base. The game ends when the pitcher tries to pick off the 52nd runner (who reached base the same way). Throw is wild to the first baseman, and before the ball is recovered, the runner comes around to score.

    • Harold’s already gotten it, but your question brings up an interesting story. Last year I was having an online discussion with someone regarding the minimum number of hitters that a team can send to the plate in a regulation game and still win the game. The other person consulted Baseball Reference.com, where he found that it had never happened, at least since 1914 I think it was, and furthermore just twice had a team managed to send one more than that number to the plate and win. Lo and behold, it happened for just the 3rd timetwo nights later, in a game between the Astros and Padres I think it was, and I discovered it as I was just sort of casually and randomly looking over box scores! Crazier still, another game the very next night came within an inning of doing it again. I was completely floored by the coincidence involved, started looking over my shoulder a lot. 🙂

    • I checked at baseball-reference.com (love that site) and found the following full games with a win and the minimum 25 plate appearances:
      Boston over Washington, 8/16/1915
      Atlanta over Pittsburgh 7/25/1992
      Houston over San Diego 6/27/2012 (your game).

      There were also two 9-inning games where the visiting team won with the minimal 28 PA.

      Baltimore beat KC 7/30/1971 with only 13 plate appearances, in a game shortened to 4 1/2 innings. That’s one of those “records which can’t be broken”, because it’s impossible to send fewer batters to the plate.

    • Ah, so it really was the minimum possible number (25), not 26! I could’ve sworn otherwise. And it was 2012, not last year, for the Houston win. Thank goodness somebody’s looking at the details around here! :). Amazing that MLB went what, almost 77 years between the first and second occurrence. With pitching up and hitting way down these days, perhaps we’ll even see it again soon.

      BR.com is awesome. Whoever designed that (Sean Foreman?) is a database genius. I’m building some databases with R using the raw Retrosheet data, that can do some things that BR.com cannot, but eventually that thing will do everything.

    • Great Harold! Yep, 52 is the number I was thinking of. The ‘delay of game’ rule I was unaware of. Thanks!

      I started thinking about this last summer when I saw a pitcher get out of an inning with 4 pitches. I suppose that happens from time to time. Can the stat links tell how often a pitcher has worked a three pitch (half) inning?

    • 3-pitch innings — I don’t recall ever seeing one of those. I couldn’t find any way to search for this at baseball-reference.com, although I did learn that in 2014, of 84716 outs by batter, 9113 required only one pitch. One would guess, given the large number of 1-2-3 innings, that there would be occasional 3-pitch innings.

      So I took the starter out of the game, and went to the bullpen, which produced the save. It turns out that 3-pitch innings happen a few times each year. Baseball Almanac has a list, although they note that only the recent years (since pitch count has been logged) are comprehensive. And yes, a few of these have been by my Red Sox, proving that (a) I don’t watch every inning, or (b) I don’t have perfect recall. [As it happens, both are true.]

    • I don’t believe BR.com has pitch by pitch data available yet, although they certainly have it, because it’s in the Retrosheet data they use. Baseball Almanac–nice site that I’d sort of forgotten about. But contrary to what it says, pitch counts (and results) are in fact officially recorded and have been since 1988, which is why so many entries on that list date from that year. When I get time I’ll run a check to see how many times it’s been done since ’88 and by who.

      “One of the most interesting three pitch instances took place on August 20, 1979, when six-year veteran infielder Jerry Terrell took the mound in the ninth inning for his first career pitching appearance and joined the 3 pitch inning club!” 🙂 Gotta love that.

      And don’t they exile you from Boston if you don’t watch every inning Harold? Or just force you to watch Yankee games as punishment?

    • “Don’t they exile you from Boston if you don’t watch every inning? Or force you to watch Yankee games as punishment?”
      This season, watching the Red Sox games is punishment. They coulda been a contender.

    • Ha ha–yeah what the heck happened this year anyway? And to see Ellsbury in pinstripes just has to add insult to injury I’m pretty sure.

    • Speaking of punishment, the 19-inning Sox-Angels game on Saturday went to about 3:30 AM EDT. And they lost in the end, too. Confession: I didn’t watch from the start (9 PM EDT), picked it up sometime after midnight. But still…

    • I’d missed that one. I can commiserate: the Tigers lost to the Jays in 19 yesterday also. Rumor has it that the Tiger bullpen has applied for federal disaster relief.

      Also, the Braves-Nationals game Saturday night ended at almost 2:30 AM: 3.5 hr rain delay followed by 11 inning game!

    • At least the “right” team won in the Tigers-Blue Jays game. (Anything which reduces the Yankees’ wild-card chances is the “right” result.)

      Brought to mind the most frustrating game I ever attended – Red Sox at Angels, when Bob Ojeda walked in the winning run in the bottom of the 14th, around 1 am. And I had to go to work the next day.

      P.S. Baseball-Reference corrects me: it was the bottom of the 15th. I’d forgotten that the winning run was scored by 4 walks, 2 of them intentional. And that the Sox had tied it up in the top of the 9th, while missing a great chance to go ahead.

    • Oh man, that’s seriously painful just hearing about, let alone having to watch it in person. At least Ojeda didn’t do it in Fenway, which would have been grounds for crucifixion if I’m not mistaken.

    • Crucifixion? Hardly. I think you have a wrong impression of Red Sox fans. It was only a regular-season game. A few days in the stocks on Boston Common, that should do it.

      Now, in the post-season, though…Grady Little can never show his face in this town again. Nor Bill Buckner or Calvin Schiraldi — and that was almost 30 years ago. Red Sox fans have long memories.

    • LOL–or make him sit by himself in some corner of Fenway.

      I still remember sitting in the lower deck in left field and watching Kaline drive one into the upper deck above us to help seal a win and take the division title from the Sox in ’72. But three years later it was all Sox, the Kaline/Lolich era was over and the Tigers were tanking fast. But the very next year this curly haired kid with a Bahstin accent showed up and took the city by storm.

    • Ah yes, The Bird, one of those brilliant baseball meteors. [Speaking of which, I need to remember to spend some time with the Perseids tonight. Hope it’s not too cloudy, it was wonderfully clear last night.]

      I found this delightful anecdote in his Wiki bio:

      In one of Bill James’ baseball books, he quoted the Yankees Graig Nettles as telling about an at-bat against Fidrych, who, as usual, was talking to the ball before pitching to Nettles. Immediately Graig jumped out of the batter’s box and started talking to his bat. He reportedly said, “Never mind what he says to the ball. You just hit it over the outfield fence!” Nettles struck out. “Damn,” he said. “Japanese bat. Doesn’t understand a word of English.”

    • Oh I like that!! Hadn’t heard that one before.

      It’s difficult to describe to those not there the excitement that Mark generated. “Electric” gets over-used, but it definitely applies in this case. I still remember the ABC Monday Night Baseball telecast in June ’77 when he struck out Randolph, Munson, Jackson, Nettles and Dent over the first three innings. Pre-closer days; Mark pitched the ninth and preserved a 2-1 win. He beat the Yanks on the monday night telecast the previous year as well. There were lots of games like this, because the Tigers didn’t have much of an offense.

      edit: not really pre-closer days, that was wrong. It’s just that Bird finished a lot of games, as did a lot of pitchers did back then.

    • Also, Mark said he was not actually talking to the ball, but rather to himself, reminding himself to keep it low, follow through, extend etc. It’s just that he held the ball in such a way that it appeared he was talking to it.

    • I didn’t realize that Fidrych was actually talking to himself and not the ball. Quirky either way, but appealingly so. By the way, I ran across this new biography, which might be of interest to you.

      I wish I had known that he resided in Northborough, MA — that’s not far from where I live, and I would have liked to have met him.

    • Thanks Harold; most definitely interested in that. Seems it should sell pretty well; he was one of the modern game’s great personalities, up there with Yogi, Bill Lee, Don Stanhouse, Jim Bouton, and a few others.

  2. On second thought, there would be dozens, perhaps hundreds of different box scores that could be dreamed up with the same result…. so ignore that part.

  3. Jim Bouldin wrote, “To me, the real interest in any sport comes from actually playing the game, not watching it, and I watch very little baseball (now) because the games are just too time consuming … ”

    Most sports were created for TV and movies although neither had been invented. A great example of this is 16 Days of Glory. Take out all the dead time and weave the rest into a story with well written commentary. Highlights. Just the highlights.

Have at it

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s