Why is it apparently so hard for operating systems to have a low latency (<10 ms round trip) audio stack that isn't a massive headache?

understand there are a lot of things happening when you open and play an audio file on your computer. before that, though, you should understand that audio data is different than other data on a computer. our ears are incredibly sensitive sensory pathways able to discern very minor changes in audio. the response is that “good audio” on a computer requires a lot of data to represent. a 500 page novel that might take you weeks to finish occupies the same amount of disk space as a 3 minute high-quality song

regarding latency

10ms is one hundred thousand billion years to a computer. you are asking why that amount of time elapses between when you open an audio file and you start hearing music. lots of things happening:

if you are starting cold, you need to load the data comprising the audio file from your disk into memory. on a hard disk, a handful of milliseconds will pass during which the r/w head seeks and reads from the right disk cylinders. a very long time. with an SSD, this is reduced to microseconds. some time also is needed to do checksumming/etc

once you start reading from disk for a bit you hit a point where there’s enough buffered audio data to start actually playing it and ensure that by the time that buffer runs out, more data will be available to resume. this idea is sort of like pipelining, but you can understand it in terms of a video stream: you are waiting for a youtube video to buffer. if you hit play immediately, you will overrun your buffer and have to wait. if you wait longer, then you will have more video to play and more time will pass allowing even more data to be buffered in. there is a sweet spot where you wait exactly  the minimum amount of time you need to as to ensure a continuous pipeline. the computer does this

also, consider the fact that a high quality audio file is going through data at a rate of (for example) 320 kilobits per second. that is a lot! the average reading speed is 200 words per minute, or a little more than 3 words per second. assuming the average word size is 6 letters, you read at a “bitrate” of about 160 bits per second. the initial buffer size needed to hit that sweet spot where buffering can occur seamlessly is large

now we get into the fun stuff

there are lots of different ways of digitally encoding audio. you have wave files, for example, that contain the uncompressed PCM values in an ordered list. you have flac files which are functionally the same as wave files, except they are compressed. you have mp3/ogg/whatever files that are compressed versions of data which itself must be further decoded to get to the actual PCM values

the PCM values are actually what you’re interested in: they are, in essence, a big list of gradually ascending and descending numbers that describe a discontinuous waveform your audio hardware must interpret and attempt to represent. i could write about 400 paragraphs about how all of this works, instead you should just watch this absolutely excellent video that explains everything about this (as well as how audio quality works w/r/t file formats):

http://xiph.org/video/vid2.shtml

seriously, watch that video. anyway, moving on:

most people do not use .wav files for many reasons. literally any other filetype used for encoding data is compressed in some way (and for good reason too). so, returning to our story, we now have that compressed audio data, read from our disk, in memory. now our CPU must load up libraries and executables needed to decompress that data into usable PCM data. afterwords, it has to actually do all the mathematical operations and such to glean the first few seconds of PCM data out of the compressed file. this takes another few milliseconds and fits into our previous pipelining model. if you are an audio engineer, you buy special PCIe devices to do this for you and those milliseconds are reduced to microseconds (your bank account follows a similar pattern)

now you initialize and prime your DAC (digital to analog converter) to begin playing audio according to both metadata found at the head of your audio file as well as output hardware-specific internal stuff like calculating the necessary output impedance you need to use in order to drive the coils in whatever is sitting at the other end of the aux cable plugged in. the initial metadata priming/init stuff takes nanoseconds but the impedance matching can take a lot longer: high end headphones would have very high-resistance coils. my pair of relatively expensive beyerdynamic headphones have 600 ohms of resistance (!) impedance matching for this may take a millisecond or so as some pretty large capacitors must fill up (capacitors charge according to a logarithmic function, not what you want when you’re trying to save time)

now you must incorporate a PIO (polling i/o) or DMA (direct memory access) mechanism to pass the PCM values to your DAC. again, audio hardware will do this for you if you buy it. otherwise your CPU/motherboard will be utilizing on-board hardware for this which is usually pretty slow. PIO especially is terrible for this. hundreds of microseconds pass while that “sweet amount” of buffer data moves from memory into your DAC’s internal scan-out buffers

FINALLY your dac begins parsing the discrete discontinuous data points, understanding them as values relative to a voltage scale that is between +1.6V and -1.6V. it takes this list of points & the bitrate value set during the init stage a few paragraphs back and, with a realtime processor, draws the analog waveform. this is sort of simple to understand: the bitrate determines that X number of samples must be played per second. the realtime chip in your DAC has a reliable idea of what a second is and can facilitate this. it determines that a number of microseconds must pass between samples. at t = 0, the DAC must hold the output voltage exactly equal to whatever the first PCM sample translates to. at t = <however many microseconds>, the output voltage must be at whatever the next sample translates to. between these two events, your DAC must steadily raise or lower the output voltage to turn a list of discontinuous PCM data points into a continuous waveform. how it chooses to do this is the question engineers working at arcam and audioquest and other DAC manufactures are trying to answer properly (hint: it’s not just a linear rise/fall from one point to the other). it’s very difficult. DSP is extremely difficult

at this point, even in the best case scenario (lots of expensive onboard audio equipment present + other modern stuff like SSDs/multicore CPUs) at least 8 or 9 milliseconds have passed. but are we done yet? we are not!

a general rule of thumb is that for very long conductors (anything long enough that you can see it with your naked eye) “voltage” “travels” at about half the speed of light. why i used the word voltage there, and why i put quotations around those words is a question you don’t want to get into. since light travels about 1ft a nanosecond, and usually high-end headphones have tightly coiled cables that can stretch out to (lets say) about 40 feet, another 80 nanoseconds pass for the voltage to propagate. maybe another 200 ns pass for that voltage to actually induce a magnetic field strong enough to move the diaphragm in your speaker/headphone enough to make an audible difference.

are we done yet? almost. there is one last, hilarious factor:

we have been speaking in terms of very small timeframes, appropriate for digital and analog computer hardware. we are now no longer in that realm. your music is at the final leg of its journey. it must travel from the local peak of the speaker diaphragm to your eardrum, and from your eardrum to your brain

this takes a significant amount of time. how much, i don’t know, but i would guess that it would be on the scale of either 100s of microseconds to milliseconds. from your eardrum to your brain, perhaps it would be the former. i don’t know

so to conclude, a whole lot needs to happen for your computer to play audio. you should understand everything i’ve stated in this post as a necessary step your machine must take, but not necessarily in sequential order (a lot of these things may happen simultaneously). the more pragmatic answer to your question can be derived; i’ve tried to work it into this post:

nobody tries to minimize their system’s audio latency as an exclusive and absolute endeavour. it is always in the pursuit of something else, usually a system capable of very high quality audio playback (lots of factors). to achieve this, you introduce a lot of new factors that drive your latency up that would not be present otherwise. this might be why there is a seeming disparity between what should be achievable and what actually exists

that is my thought, at least