How does this sound? Evaluating audio technologies

The audio engineering team here have done a lot of work on audio evaluation, both in collaboration with companies and as an essential part of our research. Some challenges come up time and time again, not just in terms of formal approaches, but also in terms of just establishing a methodology that works. I’m aware of cases where a company has put a lot of effort into evaluating the technologies that they create, only for it to make absolutely no difference in the product. So here are some ideas about how to do it, especially from an informal industry perspective.

– When you are tasked with evaluating a technology, you should always maintain a dialogue with the developer. More than anyone else, he or she knows what the tool is supposed to do, how it all works, what content might be best to use and has suggestions on how to evaluate it.

subjective evaluation details

– Developers should always have some test audio content that they use during development. They work with this content all the time to check that the algorithm is modifying or analysing the audio correctly. We’ll come back to this.

– The first stage of evaluation is documentation. Each tool should have some form of user guide, tester guide and developer guide. The idea is that if the technology remains unused for a period of time and those who worked on it have moved on, a new person can read the guides and have a good idea how to use it and test it, and a new developer should be able to understand the algorithm and the source code. Documentation should also include test audio content, preferably both input and output files with information on how the tool should be used with this content.

– The next stage of evaluation is duplication. You should be able run the tool as suggested in the guide and get the expected results with the test audio. If anything in the documentation is incorrect or incomplete, get in touch with the developers for more information.

– Then we have the collection stage. You need test content to evaluate the tool. The most important content is that which shows off exactly what the tool is intended to do. You should also gather content that tests challenging cases, or content where you need to ensure that the effect doesn’t make things worse.

– The preparation stage is next, though this may be performed in tandem with collection. With the test content, you may need to edit it, in order that its ready to use in testing. You may also want to create manually create target content, demonstrating ideal results, or at least of similar sound quality to expected results.

– Next is informal perceptual evaluation. This is lots of listening and playing around with the tool. The goal is to identify problems, find out when it works best, identify interesting cases, problematic or preferred parameter settings.


– Now on to semi-formal evaluation. Have focused questions that you need to find the answer to and procedures and methodologies to answer them. Be sure to document your findings, so that you can say what content causes what problem, how and why, etc. This needs to be done so that the problem can be exactly replicated by developers, and so that you can see if the problem still exists in the next iteration.

– Now comes the all-important listening tests. Be sure that the technology is at a level such that the test will give meaningful results. You don’t want to ask a bunch of people to listen and evaluate if the tool still has major known bugs. You also want to make sure that the test is structured in such a way so that it gives really useful information. This is very important, and often overlooked. Finding out that people preferred implementation A over implementation B is nice, but its much better to find out why, and how much, and if listeners would have preferred something else. You also want to do this test with lots of content. If, for instance only one piece of content is used in a listening test, then you’ve only found out that people prefer A over B for one example. So, generally, listening tests should involve lots of questions, lots of content, and everything should be randomised to prevent bias. You may not have time to do everything, but its definitely worth putting significant time and effort into listening test design.

Keeping Score for the Team

We’ve developed the Web Audio Evaluation Toolbox, designed to make listening test design and implementation straightforward and high quality.

– And there is the feedback stage. Evaluation counts for very little unless all the useful information gets back to developers (and possibly others), and influences further development. All this feedback needs to be prepared and stored, so that people can always refer back to it.

– Finally, there is revisiting and reiteration. If we identify a problem, or a place for improvement, we need to perform the same evaluation on the next iteration of the tool to ensure that the problem has indeed been fixed. Otherwise, issues perpetuate and we never actually know if the tool is improving and problems are resolved and closed.

By the way, I highly recommend the book Perceptual Audio Evaluation by Bech and Zacharov, which is the bible on this subject.


The Mix Evaluation Dataset

Still at the upcoming International Conference on Digital Audio Effects in Edinburgh, 5-8 September, our group’s Brecht De Man will be presenting a paper on his Mix Evaluation Dataset (a pre-release of which can be read here).
It is a collection of mixes and evaluations of these mixes, amassed over the course of his PhD research, that has already been the subject of several studies on best practices and perception of mix engineering processes.
With over 180 mixes of 18 different songs, and evaluations from 150 subjects totalling close to 13k statements (like ‘snare drum too dry’ and ‘good vocal presence’), the dataset is certainly the largest and most diverse of its kind.

Unlike the bulk of previous research in this topic, the data collection methodology presented here has maximally preserved ecological validity by allowing participating mix engineers to use representative, professional tools in their preferred environment. Mild constraints on software, such as the agreement to use the DAW’s native plug-ins, means that mixes can be recreated completely and analysed in depth from the DAW session files, which are also shared.

The listening test experiments offered a unique opportunity for the participating mix engineers to receive anonymous feedback from peers, and helped create a large body of ratings and free-field text comments. Annotation and analysis of these comments further helped understand the relative importance of various music production aspects, as well as correlate perceptual constructs (such as reverberation amount) with objective features.

Proportional representation of processors in subjective comments

An interface to browse the songs, audition the mixes, and dissect the comments is provided at, from where the audio (insofar the source is licensed under Creative Commons, or copyrighted but available online) and perceptual evaluation data can be downloaded as well.

The Mix Evaluation Dataset browsing interface

What the f*** are DFA faders?

I’ve been meaning to write this blog entry for a while, and I’ve finally gotten around to it. At the 142nd AES Convention, there were two papers that really stood out which weren’t discussed in our convention preview or convention wrap-up. One was about Acoustic Energy Harvesting, which we discussed a few weeks ago, and the other was titled ‘The DFA Fader: Exploring the Power of Suggestion in Loudness The DFA Fader: Exploring the Power of Suggestion in Loudness Judgments.’ When I mentioned this paper to others, their response was always the same, “What’s a DFA Fader?” . Well, the answer is hinted at in the title of this blog entry.

The basic idea is that musicians often give instructions to the sound engineer that he or she can’t or doesn’t want to follow. For instance, a vocalist might say “Turn me up” in a soundcheck, but the sound engineer knows that the vocals are at a nice level already and any more amplification might cause feedback. Sometimes, this sort of thing can be communicated back to the musician in a nice way. But there’s also the fallback option; a fader on the mixing console that “Does F*** All”, aka DFA. The engineer can slide the fader or twiddle an unconnected dial, smile back and say ‘Ok, does this sound a bit better?’.

A couple of companies have had fun with this idea. Funk Logic’s Palindrometer, shown below, is nothing more than a filler for empty rack space. Its an interface that looks like it might do something, but at best, it just flashes some LEDs when one toggles switches and turns the knobs.


RANE have the PI 14 Pseudoacoustic Infector . Its worth checking out the full description, complete with product review and data sheets. I especially like the schematic, copied below.


And in 2014, our own Brecht De Man  released The Wire, a freely available VST and AudioUnit plug-in that emulates a gold-plated, balanced, 100% lossless audio connector.


Anyway, the authors of this paper had the bright idea of doing legitimate subjective evaluation of DFA faders. They didn’t make jokes in the paper, not even to explain the DFA acronym. They took 22 participants and divided them into an 11 person control group and an 11 person test group. In the control group, each subject participated in twenty trials where two identical musical excerpts were presented and the subject had to rate the difference in loudness of vocals between the two excerpts. Only ten excerpts were used, so each pair was used in two trials. In the test group, a sound engineer was present and he made scripted suggestions that he was adjusting the levels in each trial. He could be seen, but participants couldn’t see his hands moving on the console.

Not surprisingly, most trials showed a statistically significant difference between test and control groups, confirming the effectiveness of verbal suggestions associated with the DFA fader. And the authors picked up on an interesting point; results were far more significant for stimuli where vocals were masked by other instruments. This links the work to psychoacoustic studies. Not only is our perception of loudness and timbre influenced by the presence of a masker, but we have a more difficult time judging loudness and hence are more likely to accept the suggestion from an expert.

The authors did an excellent job of critiquing their results. But unfortunately, the full data was not made available with the paper. So we are left with a lot of questions. What were these scripted suggestions? It could make a big difference if the engineer said “I’m going to turn the vocals way up” versus “Let me try something. Does it sound any different now?” And were some participants immune to the suggestions? And because participants couldn’t see a fader being adjusted (interviews with sound engineers had stressed the importance of verbal suggestions), we don’t know how that could influence results.

There is something else that’s very interesting about this. It’s a ‘false experiment’. The whole listening test is a trick since for all participants and in all trials, there was never any loudness differences between the two presented stimuli. So indirectly, it looks at an ‘auditory placebo effect’ that is more fundamental than DFA faders. What were the ratings for loudness differences that participants gave? For the control group especially, did they judge these differences to be small because they trusted their ears, or large because they knew that loudness judging is the nature of the test? Perhaps there is a natural uncertainty in loudness perception regardless of bias. How much weaker does a listener’s judgment become when repeatedly asked to make very subtle choices in a listening test? There’s been some prior work tackling some of these questions, but I think this DFA Faders paper opened up a lot of avenues of interesting research.

AES Berlin 2017: Keynotes from the technical program


The 142nd AES Convention was held last month in the creative heart of Berlin. The four-day program and its more than 2000 attendees covered several workshops, tutorials, technical tours and special events, all related to the latest trends and developments in audio research. But as much as scale, it’s attention to detail that makes AES special. There’s an emphasis on the research side of audio topics as much as the side of panels of experts discussing a range of provocative and practical topics.

It can be said that 3D Audio: Recording and Reproduction, Binaural Listening and Audio for VR were the most popular topics among workshops, tutorial, papers and engineering briefs. However, a significant portion of the program was also devoted to common audio topics such as digital filter design, live audio, loudspeaker design, recording, audio encoding, microphones, and music production techniques just to name a few.

For this reason, here at the Audio Engineering research team within C4DM, we bring you what we believe were the highlights, the key talks or the most relevant topics that took place during the convention.

The future of mastering

What better way to start AES than with a mastering experts’ workshop discussing about the future of the field?  Jonathan Wyner (iZotope) introduced us to the current challenges that this discipline faces.  This related to the demographic, economic and target formatting issues that are constantly evolving and changing due to advances in the music technology industry and its consumers.

When discussing the future of mastering, the panel was reluctant to a fully automated future. But pointed out that the main challenge of assistive tools is to understand artistry intentions and genre-based decisions without the need of the expert knowledge of the mastering engineer. Concluding that research efforts should go towards the development of an intelligent assistant, able to function as an smart preset that provides master engineers a starting point.

Virtual analog modeling of dynamic range compression systems

This paper described a method to digitally model an analogue dynamic range compression. Based on the analysis of processed and unprocessed audio waveforms, a generic model of dynamic range compression is proposed and its parameters are derived from iterative optimization techniques.

Audio samples were reproduced and the quality of the audio produced by the digital model was demonstrated. However, it should be noted that the parameters of the digital compressor can not be changed, thus, this could be an interesting future work path, as well as the inclusion of other audio effects such as equalizers or delay lines.

Evaluation of alternative audio mixing interfaces

In the paperFormal Usability Evaluation of Audio Track Widget Graphical Representation for Two-Dimensional Stage Audio Mixing Interface‘  an evaluation of different graphical track visualization styles is proposed. Multitrack visualizations included text only, different colour conventions for circles containing text or icons related to the type of instruments, circles with opacity assigned to audio features and also a traditional channel strip mixing interface.

Efficiency was tested and it was concluded that subjects preferred instrument icons as well as the traditional mixing interface. In this way, taking into account several works and proposals on alternative mixing interfaces (2D and 3D), there is still a lot of scope to explore on how to build an intuitive, efficient and simple interface capable of replacing the good known channel strip.

Perceptually motivated filter design with application to loudspeaker-room equalization

This tutorial, was based on the engineering briefQuantization Noise of Warped and Parallel Filters Using Floating Point Arithmetic’  where warped parallel filters are proposed, which aim to have the frequency resolution of the human ear.

Thus, via Matlab, we explored various approaches for achieving this goal, including warped FIR and IIR, Kautz, and fixed-pole parallel filters. Providing in this way a very useful tool that can be used for various applications such as room EQ, physical modelling synthesis and perhaps to improve existing intelligent music production systems.

Source Separation in Action: Demixing the Beatles at the Hollywood Bowl

Abbey Road’s James Clarke presented a great poster with the actual algorithm that was used for the remixed, remastered and expanded version of The Beatles’ album Live at the Hollywood Bowl. The method achieved to isolate the crowd noise, allowing to separate into clean tracks everything that Paul McCartney, John Lennon, Ringo Starr and George Harrison played live in 1964.

The results speak for themselves (audio comparison). Thus, based on a Non-negative Matrix Factorization (NMF) algorithm, this work provides a great research tool for source separation and reverse-engineer of mixes.

Other keynotes worth to mention:

Close Miking Empirical Practice Verification: A Source Separation Approach

Analysis of the Subgrouping Practices of Professional Mix Engineers

New Developments in Listening Test Design

Data-Driven Granular Synthesis

A Study on Audio Signal Processed by “Instant Mastering” Services

The rest of the paper proceedings are available in the AES E-library.

The Benefits of Browser-Based Listening Tests

Listening tests, or subjective evaluation of audio, are an essential tool in almost any form of audio and music related research, from data compression codecs over loudspeaker design to realism of sound effects. Sadly, because of the time and effort required to carefully design a test and convince a sufficient number of participants, it is also quite an expensive process.

The advent of web technologies like the Web Audio API, enabling elaborate audio applications within a web page, offers the opportunity to develop browser-based listening tests which mitigate some of the difficulties associated with perceptual evaluation of audio. Researchers at the Centre for Digital Music and Birmingham City University’s Digital Media Technology Lab have developed the Web Audio Evaluation Tool [1] to facilitate listening test design for any experimenter regardless of their programming experience, operating system, test paradigm, interface layout, and location of their test subjects.

APE interface - Web Audio Evaluation Tool

Web Audio Evaluation Tool: An example single-axis, multiple stimuli listening test with comment fields.

Here we cover some of the reasons why you would want to carry out a listening test in the browser, using the Web Audio Evaluation Tool as a case study.

Remote tests

The first and most obvious reason to use a browser-based listening test platform. If you want to conduct a perceptual evaluation study online, i.e. host a website where participants can take the test, then there are no two ways about it: you need a listening test that works within the browser, i.e. one that is based on HTML and JavaScript.

A downloadable application is rarely an elegant solution, and only the most determined participants will end up taking the test if they can get it to work. A website, however, amounts to a very low-threshold participation.


  • Low effort A remote test means no booking and setting up of a listening room, showing the participant into the building, …
  • Scales easily If you can conduct the test once, you can conduct it a virtually unlimited number of times, as long as you find the participants. Amazon Turk or similar services could be helpful with this.
  • Different locations/cultures/languages within reach For some types of research, it is necessary to include (a high number of) participants with certain geographical locations, cultural backgrounds and/or native languages. When these are scarce nearby, and you cannot find the time or funds to fly around the world, a remote listening test can be helpful.


  • Limited programming languages For the implementation of the test, you are basically constrained to using web technologies such as JavaScript. For someone used to using e.g. MATLAB or C++, this can be off-putting. This is one of the reasons we aim to offer a tool that doesn’t involve any coding for most of the use cases.
  • Loss of control A truly remote test means that you are not present to talk to the participant and answer questions, or notice they misunderstand the instructions. You also have little information on their playback system (make and model, how it is set up, …) and you often know less about their background.

Depending on the type of test and research, you may or may not want to go ‘remote’ or ‘local’.
However, it has been shown for certain tasks that there is no significant difference between results from local and remote tests [2,3].

Furthermore, a tool like the Web Audio Evaluation Tool has many safeguards to compensate this loss of control. Examples of these features include

  • Extensive metrics Timestamps corresponding with playback and movement events can be automatically visualised to show when participants auditioned which samples and for how long; when they moved which slider from where to where; and so on.
  • Post-test checks Upon submissions, optional dialogs can remind the participant of certain instructions, e.g. to listen to all fragments; to move all sliders at least once; to rate at least one stimulus below 20% or at exactly 100%; …
  • Audiometric test and calibration of the playback system An optional series of sliders shown at the start of a test, to be set by the participant so that sine waves an octave apart are all equally loud.
  • Survey questions Most relevant background information on the participant’s background and playback system can be captured by well-phrased survey questions, which can be incorporated at the start or end of the test.
The Web Audio Evaluation Tool - an example test interface inspired by the popular MUSHRA standard, typical for the evaluation of audio codecs.

Web Audio Evaluation Tool – an example test interface inspired by the popular MUSHRA standard, typical for the evaluation of audio codecs.

Cross-platform, no third-party software needed

Source: Sewell Support

Listening test interfaces can be hard to design, with many factors to take into account. On top of that it may not always be possible to use your personal machine for (all) your listening tests, even when all your tests are ‘local’.

When your interface requires a third party, proprietary tool like MATLAB or Max to be set up, this can pose a problem as this may not be available where the test is to take place. Furthermore, upgrades to newer versions of this third party software has been known to ‘break’ listening test software, meaning many more hours of updating and patching.

This is a much bigger problem when the test is to take place at different locations, with different computers and potentially different versions of operating systems or other software.

This has been the single most important driving factor behind the development of the Web Audio Evaluation Tool, even for projects where all tests were controlled, i.e. not hosted on a web server, with ‘internet strangers’ as participants, but in a dedicated listening room with known, skilled participants. Because these listening rooms can have very different computers, operating systems, and geographical locations, using a standalone test or a third party application such as MATLAB is often very tedious or even impossible.

In contrast, a browser-based tool typically works on any machine and operating system that supports the browsers it was designed for. In the case of the Web Audio Evaluation Tool, this means Firefox, Chrome, Edge, Safari, … Essentially every browser which supports the Web Audio API.

Multiple machines with centralised results collection

Central server

Another benefit of a browser-based listening test, again regardless of whether your test takes place ‘locally’ or ‘remotely’, is the possibility of easy, centralised collection of results of these tests. Not only is this more elegant than fetching every test result with a USB drive (from any number of computers you are using), but it is also much safer to save the result to your own server straight away. If you are more paranoid (which is encouraged in the case of listening tests), you can then back up this server continually for redundancy.

In the case of the Web Audio Evaluation Tool, you just put the test on a (local or remote) web server, and the results will be stored to this server by default.
Others have put the test on a regular file server (not web server) and run the included Python server emulator script python/ from the test computer. The results are then stored to the file server, which can be your personal machine on the same network.
Intermediate versions of the results are stored as well, so that an outage of the test computer means the results are not lost in the event of a computer crash, a human error or a forgotten dentist appointment. The test can be resumed at any point.

Multiple participants using the Web Audio Evaluation Tool at the same time, at Queen Mary University of London

Leveraging other web technologies

Source: LinkedIn

Finally, any listening test which is essentially a website, can be integrated within other sites or enhanced with any kind of web technologies. We have already seen clever use of YouTube videos as instructions or HTML index pages tracking progression through a series of tests.

The Web Audio Evaluation Tool seeks to facilitate this by providing the optional returnURL attribute, which specifies the page the participant is redirected to upon completion of the test. This page can be anything from a Doodle to schedule the next test session, an Amazon voucher, a reward cat video, to a secret Eventbrite page for a test participant party.

Are there any other benefits to using a browser-based tool for your listening tests? Please let us know!

[1] N. Jillings, B. De Man, D. Moffat and J. D. Reiss, “Web Audio Evaluation Tool: A browser-based listening test environment,” 12th Sound and Music Computing Conference, 2015.

[2] M. Cartwright, B. Pardo, G. Mysore and M. Hoffman, “Fast and Easy Crowdsourced Perceptual Audio Evaluation,” IEEE International Conference on Acoustics, Speech and Signal Processing, 2016.

[3] M. Schoeffler, F.-R. Stöter, H. Bayerlein, B. Edler and J. Herre, “An Experiment about Estimating the Number of Instruments in Polyphonic Music: A Comparison Between Internet and Laboratory Results,” 14th International Society for Music Information Retrieval Conference, 2013.

This post originally appeared in modified form on the Web Audio Evaluation Tool Github wiki