When we hear a sound, we instantly know where it’s coming from. This is because the sound interacts with our body as it travels to our ears. It bounces off, propagates through, diffracts around and is absorbed by various body parts, such as the head, torso, and the intricate structures of the outer ear. These interactions change the sound that arrives at our ears, adding important information known as binaural cues. Our brain can pick out these cues, and based on which ones are present, it can determine the direction that the sound must have come from.

Unfortunately, we can’t just measure binaural cues individually. But we can measure the entire set of cues, all mixed together, for a sound originating from a particular source point. We call these sets of cues Head Related Transfer Functions (HRTFs). The measurement process for an HRTF is simple:

  1. Get your subject to sit very still and place little microphones in their ears. The microphones look a lot like in-ear headphones, only instead of pumping sound into your ear canal, they record the sound that would normally enter it.
  2. Place a speaker at the desired source location, pointing it towards your subject.
  3. Play some test patterns through the speakers and record what the microphones hear.

Usually we measure HRTFs in an anechoic chamber: A room with special materials on the walls, floor and ceiling designed to trap and absorb sound, so it doesn’t bounce off and echo around the room. We pick special test patterns that allow us to see how the binaural cues affect any sound that might come from that source point. This way, we can apply the HRTF to any audio we like, so that when we play the sound back through headphones, we get the impression that it’s originating from the same source point. This is spatial audio. If we measure the HRTF at enough source points, then we can synthesise sound coming from practically any source point we like!

The applications for this technology are numerous. Different HRTFs can be applied to different users in a VoIP call, placing each user in a different position within a virtual soundscape. It then becomes much easier to pick out single voices when multiple people are talking at once (this utilised a phenomenon known as the Cocktail Party Effect) HRTFs can also be applied to radar blips and played into a fighter pilot’s headset. A pilot may not have time to look at their instrument panel in a tight situation, but they don’t need to if they can hear where the radar blips are coming from.

The main consumer applications for spatial audio are immersive technologies. Spatial audio enables surround sound through a pair of ordinary stereo headphones, without disturbing your neighbours. But why stop at simulating 5.1 or 7.1 surround systems when you can synthesise an infinite number of source points for a continuous soundscape! Recording audio with microphones in a dummy head can give a startlingly realistic effect (try it here) This technique is limited in that if the user rotates their head in real life, the dummy head in the recording stays put, meaning the sounds move with the user’s head. A more advanced technique, ambisonics, records a full 360 degree sound field with an array of microphones, much like a 360 degree camera. Then, software selects the components of the recording that align with the user’s head orientation during playback. Sennheiser is using ambisonics to allow audio technicians to test room acoustics quicker, and adding full 360 degree audio to 360 degree video recordings. Fully synthetic spatial audio can be used to enhance immersion both in traditional video games and VR experiences.

But there’s a catch. Since everyone’s body is different, we’re all used to hearing different binaural cues. The way sound changes when it travels from a source point to my ears can be radically different to the way sound changes when it travels from that same source point to your ears. Essentially, everyone has a unique set of HRTFs. Which means that in order to properly experience spatial audio, everyone has to go and get their HRTFs measured. But as outlined previously, this takes a lot of special equipment, requires you to sit still with tiny microphones in your ears for a fairly long time. So how can we get around this? Academic researchers have had many ideas, some of which I have summarised below.

Since binaural cues are generated by interactions with body parts, it makes sense that people with similar bodies should have similar HRTFs. Researchers have shown a correlation between various bodily dimensions and components of HRTFs. They also demonstrated that if these dimensions were added to a database of many subjects’ HRTFs, and users of spatial audio systems measured their own dimensions to find their best match in the database, they could significantly improve the accuracy and realism of spatial audio.

Other researchers have been experimenting with computer simulated HRTFs. Instead of using a live subject in an anechoic chamber, playing real sounds from a limited number of source points, they use computers to simulate sound waves bouncing off a computer model of a subject, allowing them to generate HRTFs for any source points they desire (using the principle of acoustic reciprocity makes this much easier computationally). The main problem with this method is the time it takes, even on modern computers, to accurately simulate sound waves bouncing off such a complex structure as a human body. Another big challenge is obtaining accurate enough models.

Where do we go from here? How to conveniently and accurately measure, generate or match HRTFs is still very much an open question. Could the answer lie in a combination approach? Recently, researchers used off the shelf DSLR cameras to capture a 3D model of a human head and torso, which was then used to simulate an approximate HRTF on a standard desktop computer in a matter of minutes. The approximation proved valid only for low frequencies, but perhaps this data would enable finer grained selection by a system that matches users to HRTFs based on body dimensions. Perhaps we’ll see apparatus capable of measuring HRTFs much more quickly than the traditional process. We could see audiologists placing a sphere of microphones around subjects, and impulse generators in their ears, using the principle of acoustic reciprocity to measure HRTFs for every desired sample point at once.