Wednesday, November 28, 2007
First, I want to summarize one more test that I ran using the Pythagorean Theorem of Baseball model of expected performance that I've used in previous posts. I started with a team with 750 runs scored and 750 runs allowed and then gave it 100 runs to add to its runs scored or subtract from its runs allowed in any combination. Then I checked to see which combination would give the highest expected performance.
I had expected beforehand (foolishly as it turns out) that somewhere around 50 additional runs scored and 50 additional runs saved would be the optimal expected winning percentage. As it turns out, you achieve the optimal result by deducting all 100 runs from runs allowed. The difference is not large, only one win over a 162 game season. This gives credence to the idea that a runs saved is more valuable than a run scored, though only marginally so.
This got me thinking about why this was the case. It was only then that it occurred to me that the only way to expect to win 100% of your games was to allow zero runs. No matter how many runs a team scores, if it allows even one run over the course of a season there is a chance that it will lose a game. This is why our examination of the problem using the PToB values the run saved slightly more than the run score: it puts your team closer that the perfect scenario.
Another way of looking at it is that increasing how many runs you score acts as inflation in the run economy of baseball: it devalues all other runs. Conversely, allowing fewer runs is deflationary: each run is now worth more. Therefore, an absolute difference of 100 runs is a lot more significant when the overall run totals are lower. It's exactly the same difference as the difference between Jane making $5,000 more than Dick in a mythical two-person economy in which there are only $50,000 total and Jane making $5,000 more than Dick in a $10,000 two-person economy. The differences are identical, but the difference is worth a lot more in the $10,000 universe. Since saving more runs decreases the total amount of runs in the baseball universe, you need a larger increase in runs scored to have the same impact on winning as a given decrease in runs allowed.
This ignores, however, a key aspect of the baseball landscape: you cannot save runs beyond zero runs allowed. On the other hand, you can continue to score runs ad infinitum. In other words, the value of saving runs is offset by the fact that it will quickly become very hard to make further gains in that area. It is possible, in any given baseball game, to pitch far less than perfectly and still achieve the perfect outcome for runs allowed: zero. Shutouts are fantastically more common than perfect games. Thus, even if you continue to improve your pitching, you should reach a point of diminishing returns where even though you are pitching better, it is not reflected in your runs allowed total. With hitting, you can theoretically keep improving it until you reach the perfect offense, one that never makes an out and therefore scores an infinite number of runs.
Now, on to my second observation. (Yes, that's right. The preceding 6,000,000,000 words are only my first point.)
I was going to turn to historical data and run an experiment to see what the impact of scoring and allowing additional runs was historically. Specifically, for each trial I was going to select a real baseball team of ages past and then randomly select a game from that team's season from which to subtract one run from their opponent's run total. I would do this 25 times for each team (allowing the same game to be picked twice, resulting in further deduction) and then measure what that team's new record would be (ties counting as 0.5 wins and 0.5 losses). I would then run a sufficiently large number of trials and see what the aggregate impact was. Then, I would repeat the process for runs scored, measuring what the impact was.
Then, another thought occurred to me that saved from a lot of useless work. What if instead of picking games at random, I instead enumerated all possible combinations of 25 games (again, with multiplicity) for each trial?* In that case, each trial in the runs allowed test would have a corresponding trial in the runs scored test that had the exact same 25 games picked (and vice versa). And of course, the impact of deducting a run from an opponent's total in an given game is exactly the same as adding a run to your own. Therefore, if we enumerate every possible selection of games, the results for the runs saved test would be exactly the same as the results for the runs scored test! Right?
Sort of. There's only one problem with the above logic. We haven't decided what would happen when one of the selected games for the runs allowed test was already a shutout victory. Do we skip that game and then that team loses one of its runs saved? Or do we pick another game to preserve the runs saved total? There are good arguments both ways, but I think only two observations are important for this exercise.
First, by dropping the run saved instead of searching for another game, we preserve the exact one-to-one relationship with the runs scored test. This will cause the runs scored and runs allowed test to have exactly the same results, but we now aren't dealing with identical totals of runs scored and runs saved. Secondly, if we pick another valid game from which to save a run, we preserve the runs scored and runs saved totals, but the one-to-one relationship is destroyed and the runs saved test necessarily finishes with a higher expected win total. This must be true because the runs scored test will apply some of its runs to a set of games (shutout wins) that can never increase the win total and the runs saved test will instead apply those runs saved to games that might still be won, sometimes increasing the win total.
This brings us back to the initial point. Because of the inflationary/deflationary effect of adding and removing runs from the baseball run economy, saving a given number of runs is marginally more valuable than scoring the same additional amount because saved runs must go to games you have a chance of losing, while the additional runs scored might occur in a game you already have no chance of losing.
So what impact should this have on the question of how much of "the game" is pitching and how much is hitting (and fielding and base running)?
First, I again note that the difference between a runs scored and a run saved is not large, especially when we aren't at the extremes of the two ranges.
Second, when one sets out to acquire players one does not actually acquire a given decrease in runs allowed or increase in runs scored. Rather, the players themselves generally contribute only to the individual components of run scoring and prevention. A player doesn't simply score a run (other than by hitting a home run). Rather, he hits singles, doubles, triples, and home runs, draws walks, and steals bases. A pitcher doesn't simply save runs. Rather, he throws strikes, induces groundballs, and performs other tasks that are simply components of run prevention.
This impacts our discussion because as we have noted there is a distinct lower bound on how many runs you can allow. If I have a rotation of pitchers that throw 10 hit shutouts every time out and I replace them with a rotation of pitchers that throw 81-pitch, 27-strikeout perfect games every time out, I will win exactly zero more games despite the fact that I have drastically improved my pitching. In the baseball universe, the small edge that run prevention has over run acquisition is muted by the fact that it is far easier to hit the point of diminishing returns with pitching than it is with hitting. With hitting, you never know: that 18-run outburst just might win you a game 18-16. However, the perfect game can never improve upon the results of a seven walk, four hit shutout.
So, in the end, I stand by my original line of thinking. Even though we have demonstrated that a run saved is marginally more valuable than a run scored, this difference is muted by the diminishing returns at the extremes for run prevention. Furthermore, this difference is not enough to push pitching over 50% of "the game," as Hank Steinbrenner would have us believe.
* Math note: keep in mind that the reason we do the whole "let's pick a bunch of games at random" thing is that it allows us to approximate the result we would get if we enumerated every possibility. Enumerating all the possibilities takes a prohibitively large amount of time, but doing 1,000,000,000 random samples of those possibilities doesn't take very long at all on today's computers and should be a sufficient approximation. However, when looking at the problem theoretically, we can still consider the set of all possible combinations and avoid introducing the approximation where it isn't needed.
Saturday, November 24, 2007
I'm excited to see Torii Hunter as an Angel. K Law, as always, makes a good statistical argument which is all you can do at this point, but special teams that end up on top often supersede their statistical averages. For that we'll have to wait and see.This comment elicited a raised eyebrow from me. On the one hand, I would doubt, though I only speculate, that Mr. thoyt06 has done the research to demonstrate his claim that the teams that end up on top often exceed their expected performance. On the other hand, he's right, but not for the reason he probably thinks he is.
Let's take a step back. If we have three teams with the same level of expected performance, which team will end up "on top" at the end of any series of trials? By definition, the team with the best actual performance will also be the team that exceeds its expected performance by the largest amount. This must be so, because the teams were expected to finish at the same level. In order to finish higher or lower than the other teams, that team will need to beat or fail to meet expectations.
Of course, in real life, all teams do not have the same expected level of performance. However, it is still the case that exceeding expectations will boost your chances of being "on top." For example, if an expected 95 win team under-performs by 5 wins (not at all uncommon) and an expected 87 win team over performs by 4 wins, the expected 87 win team will be "on top," despite the fact that it is objectively not as good as the expected 95 win team.
So the question isn't whether or not teams that end up "on top" tend to exceed expectations. On the contrary, they almost have to exceed expectations. The question really is: do some teams consistently outperform their expectations? In other words, can we identify discernible qualities or strategies that "special" teams employ that cause them to exceed expectations? If so, then it is possible that the Angels are a "special" organization and that Hunter is a "special" player.
But how do you demonstrate this? You can't use past results, because as we have shown here, you will almost always find that the teams that won exceeded expectations, simply because exceeding expectations increases the likelihood of being "on top." If you create an alternate model that identifies teams that are likely to exceed expectations, then all you've really done is created a better expectation. It may change your view of a team's strategy, but it won't change the basic premise: teams that are "on top" are likely to have exceeded expectations.
No, the ultimate problem here, particularly as fans, is to look back and attribute past over-performance to something other than chance because we want to believe that our guys are "special." We make the argument that the rules of expected performance are different for "special" teams because it allows us to claim that our guys are inherently better than your guys. They didn't win because fate smiled on them. They won because they had more "heart," or "grit," or "guts," or whatever vacuous term you choose to use to explain an unexpected result in a positive light.
So should we just "wait and see" if the Angels are a "special" team? No, not really. If the Angels outperform expectation one of two things will be true. Either they just got lucky, or they are smarter than the rest of us and have a better model of expected performance. Either way, it won't be because they have a special, magical quality to their team.
Wednesday, November 14, 2007
For clarity, I do not assert that the work here is groundbreaking or particularly original. Undoubtedly, someone has already performed the task of verifying the good old PToB. Nonetheless, for those of you who have not seen this before, this should give you plenty of food for thought.
First, let's examine just how accurate the model is. To do this, I calculated the expected winning percentage of each Major League Baseball team from 1900 through 2006. I then compared this with each team's actual record to see how many wins difference there was between the actual win total and the expected win total. This resulted in a total of 2160 team-seasons for analysis. Here are the results, in histogram form:
Here we see the frequency with which deviations from expected win total were distributed. Each bar represents the total of all team-seasons whose deviation from its expected win total was within 0.5 wins of the deviation represented by the bar. This effectively puts each team-season into a bin and then counts the number of team-seasons in each bin. In this case, we see that the +1 bin contains more than 200 occurrences, indicating that from 1900 through 2006 more than 200 teams finished with 0.5 to 1.5 more wins than their expected win total.
The important thing to take away from this is the shape of the distribution: the data are quite normally distributed about zero. This indicates that the PToB evenly distributes its error on either side of the actual win total for a given team-season. This is reassuring result because it means that the PToB does not appear to be inherently biased towards over- or under-estimating win totals.
In fact, with in this sample, the mean deviation from actual win total was -0.0359 with a standard deviation of 4.04 wins. This means that 68% of team-seasons will have actual win totals within roughly 4 wins of their expected win total.
Now we know the extent to which we can trust the PToB model. However, we need to go a bit farther than that to place confidence in out previous conclusions. Specifically, we need to demonstrate that the PToB is not biased towards run scoring or run prevention. For example, if the model consistently over-estimated the win total for teams with high runs scored totals and under-estimated the win total for teams with low runs allowed totals, then this would indicate that it was not properly valuing run scoring and run prevention relative to each other.
To put it another way, if teams that allow fewer runs than other teams consistently beat their expected win total, then we would have to ask ourselves why this was the case. We would be forced to conclude that run prevention was not being properly valued in the PToB; obviously we would need to place more emphasis on run-prevention to correct for the constant under-estimation of win totals for teams that allow few runs. On the other hand, if we cannot find these patterns, then this is an excellent indication that the PToB does indeed value run scoring and run prevention correctly relative to each other. This in turn would make it an excellent tool for answering our original question: is a run scored equal to a run saved?
Let's look at some more data:
Here we see a scatter plot of runs scored versus deviation from expected win total. See a pattern? I sure don't. This is pretty much a text book example of two data sets that are not correlated: all we have is a giant blob of points with no apparent relationship. Indeed, by doing a regression on the data, we find that a team's runs scored total can explain only 0.15% (r-squared of 0.0015) of team's deviation from expected winning percentage. To say that this is not in any way significant would be an understatement.
Let's do the same for runs allowed:
Again, we see a formless blob. Regression results are also similar: runs allowed explain only 0.39% (r-squared of 0.0039) of a team's deviation from its expected win total.
One last test: let's see if the ratio of runs scored to runs allowed shows any significant trend. If it did we could theorize that the PToB was biased towards teams with significant gaps between runs scored and runs allowed.
Same result: another formless blob. RS/RA accounts for only 0.56% (r-squared of 0.0056) of a team's deviation from its expected win total.
So what can we make of all this? Essentially, the Pythagorean Theorem of Baseball model does not show a discernible bias towards teams' runs scored and runs allowed totals. In fact, a team's skill at preventing or scoring runs tells us next to nothing about how it will deviate from its expected win total. This is strong evidence in support of the conclusions that we drew from analyzing the effect of varying runs scored and runs allowed on expected winning percentage. Since the average deviation from expected win total is centered around zero and shows no bias towards runs scored or runs allowed, the expected winning percentages that the PToB provides us are a good way to measure the effects of run scoring and run prevention on real life win totals.
Monday, November 12, 2007
The simplest way to examine this question is to use a modified version of Bill James' Pythagorean Theorem of Baseball to analyze how scoring and allowing runs influences a team's expected winning percentage. The "theorem," which gets its name from its resemblance to the Pythagorean Theorem proper, relates a team's winning percentage to its runs scored and runs allowed via the formula:
W% = RS^2 / ( RS^2 + RA^2 )
where W% is the team's expected winning percentage, RS is the number of runs the team scored, and RA is the number of runs a team allowed. Naturally, the relationship is not perfect (indeed, for this exercise I am using a slightly modified exponent for increased accuracy), but it does capture the essence of the relationship between run scoring and winning. For example, in 2007 the Yankees scored 968 runs and allowed 777. We would have expected them to win 98.3 games and lose 63.7. In reality, they won 94 and lost 68. They underperformed expectations by 4.3 wins. Historically, this is a fairly standard deviation from expectations.
From this formula, we can examine what happens to a team's expected winning percentage when we vary runs scored and runs allowed. Let's jump into the data.
Here you can see a spreadsheet where I've calculated the expected winning percentages for all combinations of RS and RA between 500 and 1000 in increments of 25 runs. The first thing that you should notice is the line where RS is equal to RA. As it should be, this line shows us that a team that scores as much as it allows should always expect a .500 winning percentage. If you did not expect this, it may be time for a refresher on basic math.
Now then, by picking one of the cells on the spreadsheet, we can see the effect on expected winning percentage if we save an additional 25 runs by moving up one cell. We can see the effect on expected winning percentage if we score an additional 25 runs by moving right one cell.
For example, if a team scored 900 runs and allowed 800, we would expect a winning percentage of 0.553, a roughly 90 win team over the course of a season. If that team were to save 25 more runs to become a 900 RS/775 RA team, its new expected winning percentage would be 0.567, a 92 win team. If that team were to add 25 more runs to become a 925 RS/800 RA team, its new expected winning percentage would be 0.565, also a roughly 92 win team. In this case, it appears that a run scored does equal a run saved.
Let's look at it a bit differently. Here you see a similar spreadsheet, but with different data. This spreadsheet shows the ratio of 25 additional runs saved to 25 additional runs gained from the current RS/RA. Here we see that the effects of runs scored and runs saved on expected winning percentage vary depending on our baseline of RS and RA. Interestingly, teams that outscore their opponents already benefit more from saving additional runs. Teams that are outscored by their opponent benefit more from scoring additional runs.
As a caveat, I note that while the differences at the margins appear extreme (the ratio approaches 2:1 depending on which side of 0.500 you are), the largest difference between scoring or saving an additional 25 runs is only 0.008, 1.3 wins over a 162 game season. Furthermore, as with many statistical models, the extremes are where the Pythagorean model itself breaks down.
From this data, it should be safe to conclude that, at least in terms of expected winning percentage, a run scored is on average equal to a run saved. Certainly, there is variation, but that variation is centered around a ratio of 1 RS to 1 RA. It would appear that Mr. Steinbrenner is incorrect.
There are definitely problems with the method. Primarily, we have examined expected winning percentage. If our expected winning percentage model does not itself capture the relationship between RS and RA, then our results will be poor. One of the ways we can examine this is to see if there is a relationship between over- or under-performing expectations and RS or RA. If there is, then this might indicate that our predictor is doing a poor job of capturing this relationship. Hopefully, I can follow up this post with an examination of this issue at a later date.
In order to address the problems of expected winning percentage versus actual winning percentage, my next view of the problem will try to answer the run scored versus run saved question from historical data. Stay tuned!
Saturday, November 10, 2007
Let me be perfectly clear about this: good scouting is absolutely essential to a well run baseball team. Scouting provides data that raw statistics will struggle to uncover. Scouts are very important and I have no quarrel with them.
Unfortunately, a disturbing trend has developed in the mainstream presentation of scouting data. Too often, analysts that are supposed to be providing the public with a scout's view of players have instead become nothing more than poor statistical analysts, justifying their use of small sample sizes with the vague notion that they are scouting. This type of analysis is not only completely useless, adding nothing to the discussion, but also damaging to scouting as a whole. Scouting should not become a tool for giving undue weight to a small sample of performances. It is supposed to supersede the small sample by providing data that cannot be gleaned from statistics alone.
I suppose an example or two is in order. Keven Goldstein is a writer for Baseball Prospectus. In fact, he's supposed to be their scouting guru. He was brought to BPro with the hopes of expanding their coverage beyond just statistical analysis. I like Kevin's columns and I read almost everything he writes. He writes a column every Monday that offers a small blurb on ten different prospects of note. Here is an example from his latest:
It's nice to know what Arrieta is up to, but if you're looking for any useful information here, you should be sorely disappointed. There isn't any. There is not one shred of scouting data in this blurb. The only information presented to the reader is some statistical data from a sample size so absurdly small that it is totally, utterly, meaningless. Arrieta might be a good prospect, but there is 100% no reason from this paragraph to suppose that he is. If Goldstein knows what makes Arrieta great, he hasn't included that information here, and it defeats the purpose of his presence of the BPro staff.
RHP Jake Arrieta, Phoenix Desert Dogs (Orioles)
Arrieta is becoming an offseason Ten Pack regular, as the Orioles keep pitching him an inning at a time, and he keeps putting up zeroes. At this point it’s gone from “nice start” to “downright impressive,” as Arrieta had his best outing yet on Saturday, striking out all three batters he faced. So far, the Orioles fifth-round pick who got first-round money has put together 12 scoreless innings over 10 appearances, while allowing just six hits and striking out 13. It’s a little too early to call him a steal, and his disappointing final college season is still in the back of people’s minds, but his timetable is on the verge of getting accelerated.
A small sample size is a small sample size. Unless you present compelling evidence above and beyond the data itself that it should be given significance, you cannot glean any useful information from a small sample. Let's look at another example:
This paragraph is only slightly better. We hear about why Bard was highly regarded, but then Goldstein uses another small sample size to explain that Bard is in serious trouble as a prospect. There's only one problem: he gives not one single ounce of scouting evidence to suggest that Bard is struggling. Again, it doesn't matter if you say you are a scout, a small sample size is a small sample size and it adds absolutely nothing to the discussion by definition unless it can be supported by extrastatistical evidence. That is what scouting is supposed to do. Goldstein and his scouting sources may know why Bard is struggling, but until that information is presented, Goldstein's paragraph amounts to little more than saying, "Bard is in trouble. Trust me, I know because I talk to scouts." How useful is that?
RHP Daniel Bard, Honolulu Sharks (Red Sox)
Friday’s Boston prospect rankings, like any prospect list, generated a lot of email. Most of it concerned guys who didn’t make it, like Brandon Moss or Craig Hansen, but nobody asked about Daniel Bard. Twelve months ago, that wouldn’t have been the case, because last year at this time, Bard was a highly regarded first-round pick who could touch 100 mph, although he had some issues when it came to command and secondary stuff. This year, the wheels fell off. Beginning the year at High-A and then spending the majority of the year at Low-A after a demotion, Bard finished the year with a 7.08 ERA and 78 walks in 75 innings. Using the Hawaii Winter League as an opportunity to find the magic once again, the good news is that Bard has a 0.69 ERA in 13 innings while allowing just seven hits. The bad news is that he’s walked 11. It doesn’t matter how hard you throw if you have no idea where it is going.
Scouting analysis can be done. Let's rewrite the Bard paragraph with some fictitious, though plausible, scouting analysis. The italicized part indicates my rewrite.
Now, I can't really write like a scout, but that is essentially what scouting information should look like. We aren't relying on any statistics at all. There is no small sample size.
RHP Daniel Bard, Honolulu Sharks (Red Sox)
Friday’s Boston prospect rankings, like any prospect list, generated a lot of email. Most of it concerned guys who didn’t make it, like Brandon Moss or Craig Hansen, but nobody asked about Daniel Bard. Twelve months ago, that wouldn’t have been the case, because last year at this time, Bard was a highly regarded first-round pick who could touch 100 mph, although he had some issues when it came to command and secondary stuff. This year, the wheels fell off. Beginning the year at High-A and then spending the majority of the year at Low-A after a demotion, Bard's mechanics deteriorated. He began shortening his stride, causing a drop in his velocity. To compensate for this, he began to "muscle up" when throwing ball, exerting greater effort with his upper body. His left shoulder was no longer positioned properly when the ball was released, causing him to lose any semblance of command or control. It doesn’t matter how hard you throw if you have no idea where it is going.
Scouting analysis can work, but it must be disciplined. The moment it deteriorates into a parade of small sample sizes, it loses all of its value. I sincerely hope that Mr. Goldstein recognizes this so that we can reap the full benefit of his scouting connections.
There is hope. The one absolutely essential scouting columnist on the Internet is The Hardball Times' Carlos Gomez. Gomez provides a true scout's view of players, including some excellent takes on Joba Chamberlain versus Phil Hughes and Ian Kennedy versus Clay Buchholz. Gomez's analysis breaks down each player physically, analyzing how they do what they do and how what they do leads to results. Assuming that Mr. Gomez just isn't talking out of his ass, every writer who wants to write about scouting should aspire to his level of work. If that happens, we'll finally start reaping the benefits of beer and tacos.
**EDIT** Fixed minor spelling mistake.
Thursday, November 8, 2007
“This game is 70 percent pitching, and even more in the postseason.”I'm pretty sure that this is not just false, but provably false. In fact, it's so false that Hank is probably off by a factor of two.
Here's how it breaks down. To win a baseball game, one must outscore one's opponent. It doesn't matter what the score is, so winning 12-11 is as good as winning 1-0. From this, we will assume that a run scored is as valuable as a run prevented.*
From this, it follows that run prevention as a whole is as important as run acquisition (doesn't that sound cool) as a whole. Run prevention is divided neatly into two parts: action before a batting event (pitching) and action after a batting event (fielding). Since run prevention and run acquisition are equal with respect to winning ballgames, if we assume that fielding has any importance at all, then pitching must be less than 50% of "the game."
So that's my reasoning. If we assume that preventing a run increases your chances of winning by roughly the same amount as scoring a run, then this logic is nigh infallible. I really hope Hank doesn't have as much influence as his Dad did.**
* Due to the fact that you cannot fall below zero runs scored, this will not be quite true, but it is a safe assumption that the difference is negligible for this exercise. For example, if you have a game that would otherwise be tied 4-4, you can win by either preventing a run or by scoring another run. The events have equal value. Perhaps I shall test this assumption in the near future.
** I will note in Hank's defense that if he means that pitching is more valuable than the other aspects of the game, then he could be correct, if good/great pitching were appreciably more scarce than hitting, base running, and fielding. This is unlikely to be the case, and it's hard to see that this is his meaning from his choice of words.