One of the great things about being me is that every once in a while I get to say "John Carmack is wrong."
[I actually said something more colorful at work.]
In the programming GDC keynote this year, Carmack mentioned that (game) audio could be solved today given total dedication of processing power, and that in two years it will be solved altogether. While technically true (we can simulate a lot with DSP), such a statement ignores the growing trend of data as the limiting factor in games. And while graphics may be the sexiest thing in gaming right now, audio is, and will continue to be the biggest part. At least in good games.
Here a little scary fact about game audio. Let's assume that a game will play up to 256 mono-channel sounds simultaneously. Given the current standard of games today, let's also assume the sample rate is 44.1kHz, slightly above the Nyquist limit for human hearing, and the sample size is 16 bits (2 bytes).
One second of audio using 256 unique sounds will require (channel count) x (sample rate) x (sample size) = 256*44100*2 bytes = 22.5 megabytes per second.
Of course, the worst case doesn't really apply to the general case, so lets say an average gameplay second will have 40 unique sounds playing at once. Now the bandwidth drops from 22.5 MB/s to 3.5 MB/s. Current hardware supported compression schemes appropriate for games can drop that to about 1 MB/s.
Seems pretty reasonable. Unfortunately, if we want to play more than 40 seconds worth of sound, we're going to need a bit more data. A lot more data, in fact. Halo had a total of 2.5 GB of data, uncompressed, with a mixture of 22kHz and 44kHz of data. Compressed, that's about 700 MB total, which works out to about 150-200 MB per level.
200 MB doesn't sound so bad. But since no one really budgets that well for sound memory, we'll need to be able to stream that data in from disk. And unless we can perfectly predict what sounds will play when (which we can't, given a dynamicly satisfying environment), we'll need to have either a rather large sound memory cache or a really good random access disk transfer rate. Unfortunately, most storage devices optimize for sequential access, and no one ever wants to give memory to sound, so that kind of makes us SOL. I'm overstating the problem a bit, but the common way of solving this problem (reducing the amount of sound data) isn't that satisfying.
Even if the content/data problem is solved, a convincing sound environment has computational expenses that are an order of magnitude more complex than graphics. To do that, we need reverberation, or as I put it, the echo problem.
Reverberation describes the reaction of an acoustic environment to sound. A good example is how different it sounds when you sing in the bathroom versus when you sing in the car. A more complete explanation of sound propagation can be found
here.
To determine the reverberation effects of a single sound we need to determine all the paths between a sound source and the listener. This includes the direct path plus all audible reflected paths from the sound to the listener. If you really want to be accurate, you'll need to do this separately for different frequency spectrums of the sound. (Lower frequencies can travel farther due to reflections than higher frequencies.) The equivalent problem in graphics is global illumination, which current games pre-calculate for static lighting in some fashion due to the sheer complexity of the problem. Neither of these can be determined in real-time even with the entirety of processing power available. Hardly something that is "basically solved."
[That's not to say that we don't have a good enough approximation. The current generation of sound hardware conforms to the I3DL2 spec, so we can have at least some modicum of auditory goodness without drastically impacting performance.]
This is why I like being the sound programmer :)