Before I delve too deeply into what we actually have as far as modeling player performance, I wanted to take a brief interlude to muse about what we would have, if it were possible.
Remember, the goal of developing a player model is to create a system that allows us to answer questions that we have about player value, strategic team decisions, tactical game decisions, etc. The better our system performs, the better we could answer these questions.
The system has three basic parts: the model itself, the input to the model, and the output from the model. By playing with the input to the model, we can isolate variables and test hypothesis. The results of the model will provide us data for future decision making.
The perfect model for baseball would be physically based. We would input data about a player's height, weight, arm strength, leg strength, how muscles attached to bone, how their tendons were holding up, etc. The model would then be able to exactly simulate this players performance. We would input the data for Mark McGuire and Randy Johnson and the model would tell us exactly how McGuire would perform against Johnson because it would be able to simulate 1,000,000 match ups between them perfectly.
Naturally, this is totally impossible (for now). We just don't have the data that we need, nor probably the computational power to be that precise. Trying to simulate synapses firing, muscle response, hand-eye coordination... It's a daunting task.
Similar to trying to simulate Manny Ramirez's mindset though, it's also unnecessary. Instead we can start our model from the moment that the player exerts influence on the model and stop it at the moment they cease to influence the model.
Instead of modeling Randy Johnson, we could model his fastball. We could measure its velocity, where it leaves his hand, the rotation on the ball, etc. How we got to that point would be largely inconsequential. We would have data that's much easier to measure and work with. We could also grossly simplify the physical simulation. Instead of running through a massive, complex simulation of Newtonian physics, we could instead measure what happens to baseballs that are thrown with a particular velocity, from a particular location, with a particular spin and see what the results of the play are. Our model would map the state of the model from the moment Randy Johnson ceases to exert influence on the baseball game to all the possible results of that play (with associated likelihoods, of course).
This is where we're headed. Instead of trying to deal with blunt instruments that are prone to error, like hits and walks and home runs, we want to deal with really precise measurements, like how hard the ball leaves Albert Pujols' bat or how much break Phil Hughes' curveball has. Modeling these events frees us from ambiguities introduced by other actors in the system, like Carlos Beltran robbing Pujols of a double in the gap, or an umpire calling Hughes' wicked 12-6 curve a ball when it should be a strike. When we eliminate the factors beyond our subjects' control from the equation we get closer and closer to the perfect model.
Eventually, this data will be available; I have no doubt of this. We already have people analyzing play-by-play data moving in this direction. There is too much money at stake in baseball for teams not to invest heavily in collecting the most granular data they possibly can. The teams that do it first will have an incredible leg up.
We don't have this luxury, not yet. For now, we get to deal with hits and walks and home runs. Our player model will have to have ways of dealing with all the noise that goes along with them.