Do you hear what I hear? The science of everyday sounds.

I became a professor last year, which is quite a big deal here. On April 17th, I gave my Inaugural lecture, which is a talk on my subject area to the general public. I tried to make it as interesting as possible, with sound effects, videos, a live experiment and even a bit of physical comedy. Here’s the video, and below I have a (sort of) transcript.

The Start

 

What did you just hear, what’s the weather like outside? Did that sound like a powerful, wet storm with rain, wind and thunder, or did it sound fake, was something not quite right? All you had was nearly identical, simple signals from each speaker, and you only received two simple, nearly identical signals, one to each ear.  Yet somehow you were able to interpret all the rich details, know what it was and assess the quality.

Over the next hour or so, we’ll investigate the research that links deep understanding of sound and sound perception to wonderful new audio technologies. We’ll look at how market needs in the commercial world are addressed by basic scientific advances. We will explore fundamental challenges about how we interact with the auditory world around us, and see how this leads to new creative artworks and disruptive innovations.

Sound effect synthesis

But first, lets get back to the storm sounds you heard. Its an example of a sound effect, like what might be used in a film. Very few of the sounds that you hear in film or TV, and more and more frequently, in music too, are recorded live on set or on stage.

Such sounds are sometimes created by what is known as Foley, named after Jack Foley, a sound designer working in film and radio from the late 1920s all the way to the early 1960s. In its simplest form, Foley is basically banging pots and pans together and sticking a microphone next to it. It also involves building mechanical contraptions to create all sorts of sounds. Foley sound designers are true artists, but its not easy, its expensive and time consuming. And the Foley studio today looks almost exactly the same as it did 60 years ago. The biggest difference is that the photos of the Foley studios are now in colour.

foley in the pastfoley today

But most sound effects come from sample libraries. These consist of tens or hundreds of thousands of high quality recordings. But they are still someone else’s vision of the sounds you might need. They’re never quite right. So sound designers either ‘make do’ with what’s there, or expend effort trying to shape them towards some desired sound. The designer doesn’t have the opportunity to do creative sound design. Reliance on pre-recorded sounds has dictated the workflow. The industry hasn’t evolved, we’re simply adapting old ways to new problems.

In contrast, digital video effects have reached a stunning level of realism, and they don’t rely on hundreds of thousands of stock photos, like the sound designers do with sample libraries. And animation is frequently created by specifying the scene and action to some rendering engine, without designers having to manipulate every little detail.

There might be opportunities for better and more creative sound design. Instead of a sound effect as a chunk of bits played out in sequence, conceptualise the sound generating mechanism, a procedure or recipe that when implemented, produces the desired sound. One can change the procedure slightly, shaping the sound. This is the idea behind sound synthesis. No samples need be stored. Instead, realistic and desired sounds can be generated from algorithms.

This has a lot of advantages. Synthesis can produce a whole range of sounds, like walking and running at any speed on any surface, whereas a sound effect library has only a finite number of predetermined samples. Synthesized sounds can play for any amount of time, but samples are fixed duration. Synthesis can have intuitive controls, like the enthusiasm of an applauding audience. And synthesis can create unreal or imaginary sounds that never existed in nature, a roaring dragon for instance, or Jedi knights fighting with light sabres..

Give this to sound designers, and they can take control, shape sounds to what they want. Working with samples is like buying microwave meals, cheap and easy, but they taste awful and there’s no satisfaction. Synthesis on the other hand, is like a home-cooked meal, you choose the ingredients and cook it the way you wish. Maybe you aren’t a fine chef, but there’s definitely satisfaction in knowing you made it.

This represents a disruptive innovation, changing the marketplace and changing how we do things. And it matters; not just to professional sound designers, but to amateurs and to the consumers, when they’re watching a film and especially, since we’re talking about sound, when they are listening to music, which we’ll come to later in the talk.

That’s the industry need, but there is some deep research required to address it. How do you synthesise sounds? They’re complex, with lots of nuances that we don’t fully understand. A few are easy, like these-

I just played that last one to get rid of the troublemakers in the audience.

But many of those are artificial or simple mechanical sounds. And the rest?

Almost no research is done in isolation, and there’s a community of researchers devising sound synthesis methods. Many approaches are intended for electronic music, going back to the work of Daphne Oram and Delia Derbyshire at the BBC Radiophonics Workshop, or the French Musique Concrete movement. But they don’t need a high level of realism. Speech synthesis is very advanced, but tailored for speech of course, and doesn’t apply to things like the sound of a slamming door. Other methods concentrate on simulating a particular sound with incredible accuracy. They construct a physical model of the whole system that creates the sound, and the sound is an almost incidental output of simulating the system. But this is very computational and inflexible.

And this is where we are today. The researchers are doing fantastic work on new methods to create sounds, but its not addressing the needs of sound designers.

Well, that’s not entirely true.

The games community has been interested in procedural audio for quite some time. Procedural audio embodies the idea of sound as a procedure, and involves looking at lightweight interactive sound synthesis models for use in a game. Start with some basic ingredients; noise, pulses, simple tones. Stir them together with the right amount of each, bake them with filters that bring out various pitches, add some spice and you start to get something that sounds like wind, or an engine or a hand clap. That’s the procedural audio approach.

A few tools have seen commercial use, but they’re specialised and integration of new technology in a game engine is extremely difficult. Such niche tools will supplement but not replace the sample libraries.

A few years ago, my research team demonstrated a sound synthesis model for engine and motor sounds. We showed that this simple software tool could be used by a sound designer to create a diverse range of sounds, and it could match those in the BBC sound effect library, everything from a handheld electric drill to a large boat motor.

 

This is the key. Designed right, one synthesis model can create a huge, diverse range of sounds. And this approach can be extended to simulate an entire effects library using only a small number of versatile models.

That’s what you’ve been hearing. Every sound sample you’ve heard in this talk was synthesised. Artificial sounds created and shaped in real-time. And they can be controlled and rendered in the same way that computer animation is performed. Watch this example, where the synthesized propeller sounds are driven by the scene in just the same way as the animation was.

It still needs work of course. You could hear lots of little mistakes, and the models missed details. And what we’ve achieved so far doesn’t scale. We can create hundred of sounds that one might want, but not yet thousands or tens of thousands.

But we know the way forward. We have a precious resource, the sound effect libraries themselves. Vast quantities of high quality recordings, tried and tested over decades. We can feed these into machine learning systems to uncover the features associated with every type of sound effect, and then train our models to find settings that match recorded samples.

We can go further, and use this approach to learn about sound itself. What makes a rain storm sound different from a shower? Is there something in common with all sounds that startle us, or all sounds that calm us? The same approach that hands creativity back to sound designers, resulting in wonderful new sonic experiences, can also tell us so much about sound perception.

Hot versus cold

I pause, say “I’m thirsty”. I have an empty jug and pretend to pour

Pretend to throw it at the audience.

Just kidding. That’s another synthesised sound. It’s a good example of this hidden richness in sounds. You knew it was pouring because the gesture helped, and there is an interesting interplay between our visual and auditory senses. You also heard bubbles, splashes, the ring of the container that its poured into. But do you hear more?

I’m going to run a little experiment. I have two sound samples, hot water being poured and cold water being poured. I want you to guess which is which.

Listen and try it yourself at our previous blog entry on the sound of hot and cold water.

I think its fascinating that we can hear temperature. There must be some physical phenomenon affecting the sound, which we’ve learned to associate with heat. But what’s really interesting is what I found when I looked online. Lots of people have discussed this. One argument goes ‘Cold water is more viscuous or sticky, and so it gives high pitched sticky splashes.’ That makes sense. But another argument states ‘There are more bubbles in a hot liquid, and they produce high frequency sounds.’

Wait, they can’t both be right. So we analysed recordings of hot and cold water being poured, and it turns out they’re both wrong! The same tones are there in both recordings, so essentially the same pitch. But the strengths of the tones are subtly different. Some sonic aspect is always present, but its loudness is a function of temperature. We’re currently doing analysis to find out why.

And no one noticed! In all the discussion, no one bothered to do a little critical analysis or an experiment. It’s an example of a faulty assumption, that because you can come up with a solution that makes sense, it should be the right one. And it demonstrates the scientific method; nothing is known until it is tested and confirmed, repeatedly.

Intelligent Music Production

Its amazing what such subtle changes can do, how they can indicate elements that one never associates with hearing. Audio production thrives on such subtle changes and there is a rich tradition of manipulating them to great effect. Music is created not just by the composer and performers. The sound engineer mixes and edits it towards some artistic vision. But phrasing the work of a mixing engineer as an art form is a double-edged sword, we aren’t doing justice to the technical challenges. The sound engineer is after all, an engineer.

In audio production, whether for broadcast, live sound, games, film or music, one typically has many sources. They each need to be heard simultaneously, but can all be created in different ways, in different environments and with different attributes. Some may mask each other, some may be too loud or too quiet. The final mix should have all sources sound distinct yet contribute to a nice clean blend of the sounds. To achieve this is very labour intensive and requires a professional engineer. Modern audio production systems help, but they’re incredibly complex and all require manual manipulation. As technology has grown, it has become more functional but not simpler for the user.

In contrast, image and video processing has become automated. The modern digital camera comes with a wide range of intelligent features to assist the user; face, scene and motion detection, autofocus and red eye removal. Yet an audio recording or editing device has none of this. It is essentially deaf; it doesn’t listen to the incoming audio and has no knowledge of the sound scene or of its intended use. There is no autofocus for audio!

Instead, the user is forced to accept poor sound quality or do a significant amount of manual editing.

But perhaps intelligent systems could analyse all the incoming signals and determine how they should be modified and combined. This has the potential to revolutionise music production, in effect putting a robot sound engineer inside every recording device, mixing console or audio workstation. Could this be achieved? This question gets to the heart of what is art and what is science, what is the role of the music producer and why we prefer one mix over another.

But unlike replacing sound effect libraries, this is not a big data problem. Ideally, we would get lots of raw recordings and the produced content that results. Then extract features from each track and the final mix in order to establish rules for how audio should be mixed. But we don’t have the data. Its not difficult to access produced content. But the initial multitrack recordings are some of the most highly guarded copyright material. This is the content that recording companies can use over and over, to create remixes and remastered versions. Even if we had the data, we don’t know the features to use and we don’t know how to manipulate those features to create a good mix. And mixing is a skilled craft. Machine learning systems are still flawed if they don’t use expert knowledge.

There’s a myth that as long as we get enough data, we can solve almost any problem. But lots of problems can’t be tackled this way. I thought weather prediction was done by taking all today’s measurements of temperature, humidity, wind speed, pressure… Then tomorrow’s weather could be guessed by seeing what happened the day after there were similar conditions in the past. But a meteorologist told me that’s not how it works. Even with all the data we have, its not enough. So instead we have a weather model, based on how clouds interact, how pressure fronts collide, why hurricanes form, and so on. We’re always running this physical model, and just tweaking parameters and refining the model as new data comes in. This is far more accurate than relying on mining big data.

You might think this would involve traditional signal processing, established techniques to remove noise or interference in recordings. Its true that some of what the sound engineer does is correct artifacts due to issues in the recording process. And there are techniques like echo cancellation, source separation and noise reduction that can address this. But this is only a niche part of what the sound engineer does, and even then the techniques have rarely been optimised for real world applications.

There’s also multichannel signal processing, where one usually attempts to extract information regarding signals that were mixed together, like acquiring a GPS signal buried in noise. But in our case, we’re concerned with how to mix the sources together in the first place. This opens up a new field which involves creating ways to manipulate signals to achieve a desired output. We need to identify multitrack audio features, related to the relationships between musical signals, and develop audio effects where the processing on any sound is dependent on the other sounds in the mix.

And there is little understanding of how we perceive audio mixes. Almost all studies have been restricted to lab conditions; like measuring the perceived level of a tone in the presence of background noise. This tells us very little about real world cases. It doesn’t say how well one can hear lead vocals when there are guitar, bass and drums.

Finally, best practices are not understood. We don’t know what makes a good mix and why one production will sound dull while another makes you laugh and cry, even though both are on the same piece of music, performed by competent sound engineers. So we need to establish what is good production, how to translate it into rules and exploit it within algorithms. We need to step back and explore more fundamental questions, filling gaps in our understanding of production and perception. We don’t know where the rules will be found, so multiple approaches need to be taken.

The first approach is one of the earliest machine learning methods, knowledge engineering. Its so old school that its gone out of fashion. It assumes experts have already figured things out, they are experts after all. So lets look at the sound engineering literature and work with experts to formalise their approach. Capture best practices as a set of rules and processes. But this is no easy task. Most sound engineers don’t know what they did. Ask a famous producer what he or she did on a hit song and you often get an answer like ‘I turned the knob up to 11 to make it sound phat.” How do you turn that into a mathematical equation? Or worse, they say it was magic and can’t be put into words.

To give you an idea, we had a technique to prevent acoustic feedback, that high pitched squeal you sometimes hear when a singer first approaches a microphone. We thought we had captured techniques that sound engineers often use, and turned it into an algorithm. To verify this, I was talking to an experienced live sound engineer and asked when was the last time he had feedback at one of the gigs where he ran the sound. ‘Oh, that never happens for me,’ he said. That seemed strange. I knew it was a common problem. ‘Really, never ever?’ ‘No, I know what I’m doing. It doesn’t happen.’ ‘Not even once?’ ‘Hmm, maybe once but its extremely rare.’ ‘Tell me about it.’ ‘Well, it was at the show I did last night…’! See, it’s a tricky situation. The sound engineer does have invaluable knowledge, but also has to protect their reputation as being one of a select few that know the secrets of the trade.

So we’re working with domain experts, generating hypotheses and formulating theories. We’ve been systematically testing all the assumptions about best practices and supplementing them with lots of listening tests. These studies help us understand how people perceive complex sound mixtures and identify attributes necessary for a good sounding mix. And we know the data will help. So we’re also curating multitrack audio, with detailed information about how it was recorded, often with multiple mixes and evaluations of those mixes.

By combining these approaches, my team have developed intelligent systems that automate much of the audio and music production process. Prototypes analyse all incoming sounds and manipulate them in much the same way a professional operates the controls at a mixing desk.

I didn’t realise at first the importance of this research. But I remember giving a talk once at a convention in a room that had panel windows all around. The academic talks are usually half full. But this time it was packed, and I could see faces outside all pressed up against the windows. They all wanted to find out about this idea of automatic mixing. Its  a unique opportunity for academic research to have transformational impact on an entire industry. It addresses the fact that music production technologies are often not fit for purpose. Intelligent mixing systems automate the technical and mundane, allowing sound engineers to work more productively and creatively, opening up new opportunities. Audio quality could be improved, amateur musicians can create high quality mixes of their content, small venues can put on live events without needing a professional engineer, time and preparation for soundchecks could be drastically reduced, and large venues and broadcasters could significantly cut manpower costs.

Its controversial. We once entered an automatic mix in a student recording competition as a sort of Turing Test. Technically, we were cheating, because all the mixes were supposed to be made by students, but in our case it was made by an ‘artificial intelligence’ created by a student. We didn’t win of course, but afterwards I asked the judges what they thought of the mix, and then told them how it was done. The first two were surprised and curious when I told them how it was done. But the third judge offered useful comments when he thought it was a student mix. But when I told him that it was an ‘automatic mix’, he suddenly switched and said it was rubbish and he could tell all along.

Mixing is a creative process where stylistic decisions are made. Is this taking away creativity, is it taking away jobs? Will it result in music sounding more the same? Such questions come up time and time again with new technologies, going back to 19th century protests by the Luddites, textile workers who feared that time spent on their skills and craft would be wasted as machines could replace their role in industry.

These are valid concerns, but its important to see other perspectives. A tremendous amount of audio production work is technical, and audio quality would be improved by addressing these problems. As the graffiti artist Banksy said;

“All artists are willing to suffer for their work. But why are so few prepared to learn to draw?” – BaNKSY

Girl-with-a-Balloon-by-Banksy

Creativity still requires technical skills. To achieve something wonderful when mixing music, you first have to achieve something pretty good and address issues with masking, microphone placement, level balancing and so on.

The real benefit is not replacing sound engineers. Its dealing with all those situations when a talented engineer is not available; the band practicing in the garage, the small pub or restaurant venue that does not provide any support, or game audio, where dozens of incoming sounds need to be mixed and there is no miniature sound guy living inside the games console.

High resolution audio

The history of audio production is one of continual innovation. New technologies arise to make the work easier, but artists also figure out how to use that technology in new creative ways. And the artistry is not the only element music producers care about. They’re interested, some would say obsessed, with fidelity. They want the music consumed at home to be as close as possible to the experience of hearing it live. But we consume digitial audio. Sound waves are transformed into bits and then transformed back to sound when we listen. We sample sound many times a second and render each sample with so many bits. Luckily, there is a very established theory on how to do the sampling.

We only hear frequencies up to about 20 kHz. That’s a wave which repeats 20,000 times a second. There’s a famous theorem by Claude Shannon and Harry Nyquist which states that you need twice that number of samples a second to fully represent a signal up to 20 kHz, so sample at 40,000 samples a second, or 40 kHz. So the standard music format, 16 bit samples and 44.1 kHz sampling rate, should be good enough.

Inaugural shared_Page_11

But most music producers want to work with higher quality formats and audio companies make equipment for recording and playing back audio in these high resolution formats. Some people swear they hear a difference, others say it’s a myth and people are fooling themselves. What’s going on? Is the sampling theorem, which underpins all signal processing, fundamentally wrong? Have we underestimated the ability of our own ears and in which case the whole field of audiology is flawed? Or could it be that the music producers and audiophiles, many of whom are renowned for their knowledge and artistry, are deluded?

Around the time I was wondering about this, I went to a dinner party and was sat across from a PhD student. His PhD was in meta-analysis, and he explained that it was when you gather all the data from previous studies on a question and do formal statistical analysis to come up with more definitive results than the original studies. It’s a major research method in evidence-based medicine, and every few weeks a meta-analysis makes headlines because it shows the effectiveness or lack of effectiveness of treatments.

So I set out to do a meta-analysis. I tried to find every study that ever looked at perception of high resolution audio, and get their data. I scoured every place they could have been published and asked everyone in the field, all around the world. One author literally found his old data tucked away in the back of a filing cabinet. Another couldn’t get permission to provide the raw data, but told me enough about it for me to write a little program that ran through all possible results until it found the details that would reproduce the summary data as well. In the end, I found 18 relevant studies and could get data from all of them except one. That was strange, since it was the most famous study. But the authors had ‘lost’ the data, and got angry with me when I asked them for details about the experiment.

The results of the meta-analysis were fascinating, and not at all what I expected. There were researchers who thought their data had or hadn’t shown an effect, but when you apply formal analysis, it’s the opposite. And a few experiments had major flaws. For instance, in one experiment many of the high resolution recordings were actually standard quality, which means there never was a difference to be perceived. In another, test subjects were given many versions of the same audio, including a direct live feed, and asked which sounds closer to live. People actually ranked the live feed as sounding least close to live, indicating they just didn’t know what to listen for.

As for the one study where the authors lost their data? Well, they had published some of it, but it basically went like this. 55 participants listened to many recordings many times and could not discriminate between high resolution and standard formats. But men discriminated more than women, older far more than younger listeners, audiophiles far more than nonexperts. Yet only 3 people ever guessed right more than 6 times out of 10. The chance of all this happening by luck if there really was no difference is less likely than winning the lottery. Its extremely unlikely even if there was a difference to be heard. Conclusion: they faked their data.

And this was the study which gave the most evidence that people couldn’t hear anything extra in high resolution recordings. In fact the studies with the most flaws were those that didn’t show an effect. Those that found an effect were generally more rigourous and took extra care in their design, set-up and analysis. This was counterintuitive. People are always looking for a new cure or a new effect. But in this case, there was a bias towards not finding a result. It seems researchers wanted to show that the claims of hearing a difference are false.

The biggest factor was training. Studies where subjects, even those experienced working with audio, just came in and were asked to state when two versions of a song were the same, rarely performed better than chance. But if they were told what to listen for, given examples, were told when they got it right or wrong, and then came back and did it under blind controlled conditions, they performed far better. All studies where participants were given training gave higher results than all studies where there was no training. So it seems we can hear a difference between standard and high resolution formats, we just don’t know what to listen for. We listen to music everyday, but we do it passively and rarely focus on recording quality. We don’t sit around listening for subtle differences in formats, but they are there and they can be perceived. To audiophiles, that’s a big deal.

In 2016 I published this meta-analysis in the Journal of the Audio Engineering Society, and it created a big splash. I had a lot of interviews in the press, and it was discussed on social media and internet forums. And that’s when I found out, people on the internet are crazy! I was accused of being a liar, a fraud, paid by the audio industry, writing press releases, working the system and pushing an agenda. These criticisms came from all sides, since differences were found which some didn’t think existed, but they also weren’t as strong as others wanted them to be. I was also accused of cherry-picking the studies, even though one of the goals of the paper was to avoid exactly that, which is why I included every study I could find.

But my favorite comment was when someone called me an ‘intellectually dishonest placebophile apologist’. Whoever wrote that clearly spent time and effort coming up with a convoluted insult.

It wasn’t just people online who were crazy. At an audio engineering society convention, two people were discussing the paper. One was a multi-grammy award winning mixing engineer and inventor, the other had a distinguished career as chief scientist at a major audio company.

What started as discussion escalated to heated argument, then shouting, then pushing and shoving. It was finally broken up when a famous mastering engineer intervened. I guess I should be proud of this.

I learned what most people already know, how very hard it is to change people’s minds once an opinion has been formed. And people rarely look at the source. Instead, they rely on biased opinions discussing that source. But for those interested in the issue whose minds were not already made up, I think the paper was useful.

I’m trying to figure out why we hear this difference. Its not due to problems with the high resolution audio equipment, that was checked in every study that found a difference. There’s no evidence that people have super hearing or that the sampling theorem is violated. But we need to remove all the high frequencies in a signal before we convert it to digital, even if we don’t hear them. That brings up another famous theorem, the uncertainty principle. In quantum mechanics, it tells us that we can’t resolve a particle’s position and momentum at the same time. In signal processing, it tells us that restricting a signal’s frequency content will make us less certain about its temporal aspects. When we remove those inaudible high frequencies, we smear out the signal. It’s a small effect, but this spreading the sound a tiny bit may be audible.

The End

The sounds around us shape our perception of the world. We saw that in films, games, music and virtual reality, we recreate those sounds or create unreal sounds to evoke emotions and capture the imagination. But there is a world of fascinating phenomena related to sound and perception that is not yet understood. Can we create an auditory reality without relying on recorded samples? Could a robot replace the sound engineer, should it? Investigating such questions has led to a deeper understanding of auditory perception, and has the potential to revolutionise sound design and music production.

What are the limits of human hearing? Do we make far greater use of auditory information than simple models can account for? And if so, can we feed this back for better audio production and sound design?

Inaugural shared_Page_13

To answer these questions, we need to look at the human auditory system. Sound waves are transferred to the inner ear, which contains one of the most amazing organs in the human body, the cochlea. 3,500 inner hair cells line the cochlea, and resonate in response to frequencies across the audible range. These hair cells connect to a nerve string containing 30,000 neurons which can fire 600 pulses a second. So the brainstem receives up to 18 million pulses per second. Hence the cochlea is a very high resolution frequency analyser with digital outputs. Audio engineers would pay good money for that sort of thing, and we have two of them, free, inside our heads!

The pulses carry frequency and temporal information about sounds. This is sent to the brain’s auditory cortex, where hearing sensations are stored as aural activity images. They’re compared with previous aural activity images, other sensory images and overall context to get an aural scene representing the meaning of hearing sensations. This scene is made available to other processes in the brain, including thought processes such as audio assessment. It’s all part of 100 billion brain cells with 500 trillion connections, a massively powerful machine to manage body functions, memory and thinking.

These connections can be rewired based on experiences and stimuli. We have the power to learn new ways to process sounds. The perception is up to us. Like we saw with hot and cold water sounds, with perception of sound effects and music production, with high resolution audio, we have the power to train ourselves to perceive the subtlest aspects. Nothing is stopping us from shaping and appreciating a better auditory world.

Credits

All synthesised sounds created using FXive.

Sound design by Dave Moffat.

Synthesised sounds by Thomas Vassallo, Parham Bahadoran, Adan Benito and Jake Lee

Videos by Enrique Perez Gonzalez (automatic mixing) and Rod Selfridge (animation).

Special thanks to all my current and former students and researchers, collaborators and colleagues. See the video for the full list.

And thanks to my lovely wife Sabrina and daughter Eliza.

Advertisements

Digging the didgeridoo

The Ig Nobel prizes are tongue-in-cheek awards given every year to celebrate unusual or trivial achievements in science. Named as a play on the Nobel prize and the word ignoble, they are intended to ‘“honor achievements that first make people laugh, and then make them think.” Previously, when discussing graphene-based headphones graphene-based headphones, I mentioned Andre Geim, the only scientist to have won both a Nobel and Ig Nobel prize.

I only recently noticed that the 2017 Ig Nobel Peace Prize went to an international team that demonstrated that playing a didgeridoo is an effective treatment for obstructive sleep apnoea and snoring. Here’s a photo of one of the authors of the study playing the didge at the award ceremony.

59bd25dffc7e9387108b4567

My own nominees for Ig Nobel prizes, from audio-related research published this past year, would included ‘Influence of Audience Noises on the Classical Music Perception on the Example of Anti-cough Candies Unwrapping Noise’, which we discussed in our preview of the 143rd Audio Engineering Society Convention, and the ‘The DFA Fader: Exploring the Power of Suggestion in Loudness Judgments’ , for which we had the blog entry ‘What the f*** are DFA faders‘.

But lets return to Digeridoo research. Its a fascinating aboriginal Australian instrument, with a rich history and interesting acoustics, and produces an eerie drone-like sound.

A search on google scholar, once removing patents and citations, shows only 38 research papers with Didgeridoo in the title. That’s great news if you want to be an expert on research in the subject. The work of Neville H. Fletcher over about a thirty year period beginning in the early 1980s is probably the main starting point.

The passive acoustics of the didgeridoo are well understood. Its a long truncated conical horn where the player’s lips at the smaller end form a pressure-controlled valve. Knowing the length and diameters involved, its not to difficult to determine the fundamental frequencies (often around 50-100 Hz) and modes excited, and their strengths, in much the same way as can be done for many woodwind instruments.

But that’s just the passive acoustics. Fletcher pointed out that traditional, solo didgeridoo players don’t pay much attention to the resonant frequencies and they’re mainly important when its played in Western music, and needs to fit with the rest of an ensemble.

Things start getting really interesting when one considers the sounding mechanism. Players make heavy use of circular breathing, breathing in through the nose while breathing out through the mouth, even more so, and more rhythmically, than is typical in performing Western brass instruments like trumpets and tubas. Changes in lip motion and vocal tract shape are then used to control the formants, allowing the manipulation of very rich timbres.

Its these aspects of didgeridoo playing that intrigued the authors of the sleep apnoea study. Like the DFA and cough drop wrapper studies mentioned above, these were serious studies on a seemingly not so serious subject. Circular breathing and training of respiratory muscles may go a long way towards improving nighttime breathing, and hence reducing snoring and sleep disturbances. The study was controlled and randomised. But, its incredibly difficult in these sorts of studies to eliminate or control for all the other variables, and very hard to identify which aspect of the didgeridoo playing was responsible for the better sleep. The authors quite rightly highlighted what I think is one of the biggest question marks in the study;

A limitation is that those in the control group were simply put on a waiting list because a sham intervention for didgeridoo playing would be difficult. A control intervention such as playing a recorder would have been an option, but we would not be able to exclude effects on the upper airways and compliance might be poor.

In that respect, drug trials are somewhat easier to interpret than practice-based intervention. But the effect was abundantly clear and quite strong. One certainly should not dismiss the results because of limitations (the limitations give rise to question marks, but they’re not mistakes) in the study.

 

The cavity tone……

In September 2017, I attended the 20th International Conference on Digital Audio Effects in Edinburgh. At this conference, I presented my work on a real-time physically derived model of a cavity tone. The cavity tone is one of the fundamental aeroacoustic sounds, similar to previously described Aeolian tone. The cavity tone commonly occurs in aircraft when opening bomb bay doors or by the cavities left when the landing gear is extended. Another example of the cavity tone can be seen when swinging a sword with a grooved profile.

The physics of operation is a can be a little complicated. To try and keep it simple, air flows over the cavity and comes into contact with air at a different velocity within the cavity. The movement of air at one speed over air at another cause what’s known as shear layer between the two. The shear layer is unstable and flaps against the trailing edge of the cavity causing a pressure pulse. The pressure pulse travels back upstream to the leading edge and re-enforces the instability. This causes a feedback loop which will occur at set frequencies. Away from the cavity the pressure pulse will be heard as an acoustic tone – the cavity tone!

A diagram of this is shown below:

Like the previously described Aeolian tone, there are equations to derive the frequency of the cavity tone. This is based on the length of the cavity and the airspeed. There are a number of modes of operation, usually ranging from 1 – 4. The acoustic intensity has also been defined which is based on airspeed, position of the listener and geometry of the cavity.

The implementation of an individual mode cavity tone is shown in the figure below. The Reynolds number is a dimensionless measure of the ratio between the inertia and viscous force in the flow and Q relates to the bandwidth of the passband of the bandpass filter.

Comparing our model’s average frequency prediction to published results we found it was 0.3% lower than theoretical frequencies, 2.0% lower than computed frequencies and 6.4% lower than measured frequencies. A copy of the pure data synthesis model can be downloaded here.

 

The final whistle blows

Previously, we discussed screams, applause, bouncing and pouring water. Continuing our examination of every day sounds, we bring you… the whistle.

This one is a little challenging though. To name just a few, there are pea whistles, tin whistles, steam whistles, dog whistles and of course, human whistling. Covering all of this is a lot more than a single blog entry. So lets stick to the standard pea whistle or pellet whistle (or ‘escargot’ or barrel whistle because of its snail-like shape), which is the basis for a lot of the whistles that you’ve heard.

metal pea whistle

 

Typical metal pea whistle, featuring mouthpiece,  bevelled edge and sound hole where air can escape, and barrel-shaped air chamber and a pellet inside.

 

Whistles are the oldest known type of flute. They have a stopped lower end and a flue that directs the player’s breath from the mouth hole at the upper end against the edge of a hole cut in the whistle wall, causing the enclosed air to vibrate. Most whistle instruments have no finger holes and sound only one pitch.

A whistle produces sound from a stream of gas, most commonly air, and typically powered by steam or by someone blowing air. The conversion of energy to sound comes from an interaction between the air stream and a solid material.

In a pea whistle, the air stream enters through the mouthpiece. It hits the bevel (sloped edge for the opening) and splits, outwards into the air and inwards filling the air chamber. It continues to swirl around and fill the chamber until the air pressure inside  is so great that it pops out of the sound hole (a small opening next to the bevel), making room for the process to start over again. The dominant pitch of the whistle is determined by the rate at which air packs and unpacks the air chamber. The movement of air forces the pea or pellet inside the chamber to move around and around. This sometimes interrupts the flow of air and creates a warble to the whistle sound.

The size of the whistle cavity determines the volume of air contained in the whistle and the pitch of the sound produced. The air fills and empties from the chamber so many times per second, which gives the fundamental frequency of the sound.

The whistle construction and the design of the mouthpiece also have a dramatic effect on sound. A whistle made from a thick metal will produce a brighter sound compared to the more resonant mellow sound if thinner metal is used. Modern whistles are produce using different types of plastic, which increases the tones and sounds now available. The design of the mouthpiece can also dramatically alter the sound. Even a few thousandths of an inch difference in the airway, angle of the blade, size or width of the entry hole, can make a drastic difference as far as volume, tone, and chiff (breathiness or solidness of the sound) are concerned. And according to the whistle Wiki page, which might be changed by the time you read this, ‘One characteristic of a whistle is that it creates a pure, or nearly pure, tone.’

Well, is all of that correct? When we looked at the sounds of pouring hot and cold water we found that the simple explanations were not correct. In explaining the whistle, can we go a bit further than a bit of handwaving about the pea causing a warble? Do the different whistles differ a lot in sound?

Lets start with some whistle sounds. Here’s a great video where you get to hear a dozen referee’s whistles.

Looking at the spectrogram below, you can see that all the whistles produce dominant frequencies somewhere between 2200 and 4400 Hz. Some other features are also apparent. There seems to be some second and even third harmonic content. And it doesn’t seem to be just one frequency and its overtones. Rather, there are two or three closely spaced frequencies whenever the whistle is blown.

Referee Whistles

But this sound sample is all fairly short whistle blows, which could be why the pitches are not constant. And one should never rely on just one sample or one audio file (as the authors did here). So lets look at just one long whistle sound.

joe whistle spec

joe whistle

You can see that it remains fairly constant, and the harmonics are clearly present, though I can’t say if they are partly due to dynamic range compression or any other processing. However, there are semi-periodic dips or disruptions in the fundamental pitch. You can see this more clearly in the waveform, and this is almost certainly due to the pea temporarily blocking the sound hole and weakening the sound.

The same general behaviour appears with other whistles, though with some variation in the dips and their rate of occurrence, and in the frequencies and their strengths.

Once I started writing this blog, I was pointed to the fact that Perry Cook had already discussed synthesizing whistle sounds in his wonderful book Real Sound Synthesis for Interactive Applications. In building up part of a model of a police/referee whistle, he wrote

 ‘Experiments and spectrograms using real police/referee whistles showed that when the pea is in the immediate region of the jet oscillator, there is a decrease in pitch (about 7%), an increase in amplitude (about 6 dB), and a small increase in the noise component (about 2 dB)… The oscillator exhibits three significant harmonics: f, 2f and 3f at 0 dB, -10 dB and -25 dB, respectively…’

With the exception of the increase in amplitude due to the pea (was that a typo?), my results are all in rough agreement with his. So depending on whether I’m a glass half empty / glass half full kind of person, I could either be disappointed that I’m just repeating what he did, or glad that my results are independently confirmed.

This information from a few whistle recordings should be good enough to characterise the behaviour and come up with a simple, controllable synthesis. Jiawei Liu took a different approach. In his Master’s thesis, he simulated whistles using computational fluid dynamics and acoustic finite element simulation. It was very interesting work, as was a related approach by Shia, but they’re both a bit like using a sledgehammer to kill a fly. Massive effort and lots of computation, when a model that probably sounds just as good could have been derived using semi-empirical equations that model aeroacoustic sounds directly, as discussed in our previous blog entries on sound synthesis of an Aeolian Harp, a Propeller. Sword sounds, swinging objects or Aeolian tones.

There’s been some research into automatic identification of referee whistle sounds, for instance, initial work of Shirley and Oldfield in 2011 and then a more advanced algorithm a few years later. But these are either standard machine learning techniques, or based on the most basic aspects of the whistle sound, like its fundamental frequency. In either case, they don’t use much understanding of the nature of the sound. But I suppose that’s fine. They work, they enable intelligent production techniques for sports broadcasts,  and they don’t need to delve into the physical or perceptual aspects.

I said I’d stick to pellet whistles, but I can’t resist mentioning a truly fascinating and unusual synthesis of another whistle sound. Steam locomotives were equipped with train whistles for warning and signalling. to generate the sound, the train driver pulls a cord in the driver’s cabin, thereby opening a valve, so that steam shoots out of an gap and against the sharp edge of a bell. This makes the bell vibrate rapidly, which creates a whistling sound. In 1972, Herbert Chaudiere created an incredibly detailed sound system for model trains. This analogue electronic system  generated all the memorable sounds of the steam locomotive; the bark of exhausting steam, the rhythmic toll of the bell, and the wail of the chime whistle, and reproduced these sounds from a loudspeaker carried in the model locomotive.

The preparation of this blog entry also illustrates some of the problems with crowd sourced metadata and user generated tagging. When trying to find some good sound examples, I searched the whole’s most popular sound effects archive, freesound, for ‘pea whistle’. It came up with only one hit, a recording of steam and liquid escaping from a pot of boiling black-eyed peas!

References:

  • Chaudiere, H. T. (1972). Model Railroad Sound system. Journal of the Audio Engineering Society, 20(8), 650-655.
  • Liu, J. (2012). Simulation of whistle noise using computational fluid dynamics and acoustic finite element simulation, MSc Thesis, U. Kentucky.
  • Shia, Y., Da Silvab, A., & Scavonea (2014), G. Numerical Simulation of Whistles Using Lattice Boltzmann Methods, ISMA, Le Mans, France
  • Cook, P. R. (2002). Real sound synthesis for interactive applications. CRC Press.
  • Oldfield, R. G., & Shirley, B. G. (2011, May). Automatic mixing and tracking of on-pitch football action for television broadcasts. In Audio Engineering Society Convention 130
  • Oldfield, R., Shirley, B., & Satongar, D. (2015, October). Application of object-based audio for automated mixing of live football broadcast. In Audio Engineering Society Convention 139.

Audio Research Year in Review- Part 2, the Headlines

Last week featured the first part of our ‘Audio research year in review.’ It focused on our own achievements. This week is the second, concluding part, with a few news stories related to the topics of this blog (music production, psychoacoustics, sound synthesis and everything in between) for each month of the year.

Browsing through the list, some interesting things pop up. Several news stories related to speech intelligibility in broadcast TV, which has been a recurring story the last few years. Effects of noise pollution on wildlife is also a theme in this year’s audio research headlines. And quite a few of the psychological studies are telling us what we already know. The fact that musicians (who are trained in a task that involves quick response to stimuli) have faster reaction times than non-musicians (who may not be trained in such a task) is not a surprise. Nor is the fact that if you hear the cork popping from a wine bottle, you may think it tastes better, although that’s a wonderful example of the placebo effect. But studies that end up confirming assumptions are still worth doing.

January

February

March

April

May

string wine glass

June

July

August

September

October

November

December

Applied Science Journal Article

We are delighted to announce the publication of our article titled, Sound Synthesis of Objects Swinging through Air Using Physical Models in the Applied Science Special Issue on Sound and Music Computing.

 

The Journal is a revised and extended version of our paper which won a best paper award at the 14th Sound and Music Computing Conference which was held in Espoo, Finland in July 2017. The initial paper presented a physically derived synthesis model used to replicate the sound of sword swings using equations obtained from fluid dynamics, which we discussed in a previous blog entry. In the article we extend listening tests to include sound effects of metal swords, wooden swords, golf clubs, baseball bats and broom handles as well as adding in a cavity tone synthesis model to replicate grooves in the sword profiles. Further test were carried out to see if participants could identify which object our model was replicating by swinging a Wii Controller.
The properties exposed by the sound effects model could be automatically adjusted by a physics engine giving a wide corpus of sounds from one simple model, all based on fundamental fluid dynamics principles. An example of the sword sound linked to the Unity game engine is shown in this video.
 

 

Abstract:
A real-time physically-derived sound synthesis model is presented that replicates the sounds generated as an object swings through the air. Equations obtained from fluid dynamics are used to determine the sounds generated while exposing practical parameters for a user or game engine to vary. Listening tests reveal that for the majority of objects modelled, participants rated the sounds from our model as plausible as actual recordings. The sword sound effect performed worse than others, and it is speculated that one cause may be linked to the difference between expectations of a sound and the actual sound for a given object.
The Applied Science journal is open access and a copy of our article can be downloaded here.

Bounce, bounce, bounce . . .

bounce

Another in our continuing exploration of everyday sounds (Screams, Applause, Pouring water) is the bouncing ball. It’s a nice one for a blog entry since there are only a small number of papers focused on bouncing, which means we can give a good overview of the field. It’s also one of those sounds that we can identify very clearly; we all know it when we hear it. It has two components that can be treated separately; the sound of a single bounce and the timing between bounces.

Let’s consider the second aspect. If we drop a ball from a certain height and ignore any drag, the time it takes to hit the ground is completely determined by gravity. When it hits the ground, some energy is absorbed on impact. And so it may be traveling downwards with a velocity v1 just before impact, and after impact travels upwards with velocity v2. The ratio v2/v1 is called the coefficient of restitution (COR). A high COR means that the ball travels back up almost to its original height, and a low COR means that most energy is absorbed and it only travels up a short distance.

Knowing COR, one can use simple equations of motion to determine the time between each bounce. And since the sum of the times between bounces is a convergent series, one can find the maximum time until it stops bouncing. Conversely, measuring the coefficient of friction from times between bounces is literally a tabletop physics experiment (Aguiar 2003, Farkas 2006, Schwarz 2013). And kinetic energy depends on the square of the velocity, so we know how much energy is lost with each bounce, which also gives an idea of how the sound levels of successive bounces should decrease.

[The derivation of all this has been left to the reader 😊. But again, its straightforward application of the equations of motion that give time dependence of position and velocity under constant acceleration]

Its not that hard to extend this approach, for instance by including air drag or sloped surfaces. But if you put the ball on a vibrating platform, all sorts of wonderful nonlinear behaviour can be observed; chaos, locking and chattering (Luck 1993).

For instance, have a look at the following video; which shows some interesting behaviour where bouncing balls all seem to organise onto one side of a partition.

So much for the timing of bounces, but what about the sound of a single bounce? Well, Nagurka (2004) modelled the bounce as a mass-spring-damper system, giving the time of contact for each bounce. It provides a little more realism by capturing some aspects of the bounce sound, Stoelinga (2007) did a detailed analysis of bouncing and rolling sounds. It has a wealth of useful information, and deep insights into both the physics and perception of bouncing, but stops short of describing how to synthesize a bounce.

To really capture the sound of a bounce, something like modal synthesis should be used. That is, one should identify the modes that are excited for impact of a given ball on a given surface, and their decay rates. Farnell measured these modes for some materials, and used those values to synthesize bounces in Designing Sound . But perhaps the most detailed analysis and generation of such sounds, at least as far as I’m aware, is in the work of Davide Rocchesso and his colleagues, leaders in the field of sound synthesis and sound design. They have produced a wealth of useful work in the area, but an excellent starting point is The Sounding Object.

Are you aware of any other interesting research about the sound of bouncing? Let us know.

Next week, I’ll continue talking about bouncing sounds with discussion of ‘the audiovisual bounce-inducing effect.’

References

  • Aguiar CE, Laudares F. Listening to the coefficient of restitution and the gravitational acceleration of a bouncing ball. American Journal of Physics. 2003 May;71(5):499-501.
  • Farkas N, Ramsier RD. Measurement of coefficient of restitution made easy. Physics education. 2006 Jan;41(1):73.
  • Luck, J.M. and Mehta, A., 1993. Bouncing ball with a finite restitution: chattering, locking, and chaos. Physical Review E, 48(5), p.3988.
  • Nagurka, M., Shuguang H,. “A mass-spring-damper model of a bouncing ball.” American Control Conference, 2004. Vol. 1. IEEE, 2004.
  • Schwarz O, Vogt P, Kuhn J. Acoustic measurements of bouncing balls and the determination of gravitational acceleration. The Physics Teacher. 2013 May;51(5):312-3.
  • Stoelinga C, Chaigne A. Time-domain modeling and simulation of rolling objects. Acta Acustica united with Acustica. 2007 Mar 1;93(2):290-304.