The audio engineering team here have done a lot of work on audio evaluation, both in collaboration with companies and as an essential part of our research. Some challenges come up time and time again, not just in terms of formal approaches, but also in terms of just establishing a methodology that works. I’m aware of cases where a company has put a lot of effort into evaluating the technologies that they create, only for it to make absolutely no difference in the product. So here are some ideas about how to do it, especially from an informal industry perspective.
– When you are tasked with evaluating a technology, you should always maintain a dialogue with the developer. More than anyone else, he or she knows what the tool is supposed to do, how it all works, what content might be best to use and has suggestions on how to evaluate it.
– Developers should always have some test audio content that they use during development. They work with this content all the time to check that the algorithm is modifying or analysing the audio correctly. We’ll come back to this.
– The first stage of evaluation is documentation. Each tool should have some form of user guide, tester guide and developer guide. The idea is that if the technology remains unused for a period of time and those who worked on it have moved on, a new person can read the guides and have a good idea how to use it and test it, and a new developer should be able to understand the algorithm and the source code. Documentation should also include test audio content, preferably both input and output files with information on how the tool should be used with this content.
– The next stage of evaluation is duplication. You should be able run the tool as suggested in the guide and get the expected results with the test audio. If anything in the documentation is incorrect or incomplete, get in touch with the developers for more information.
– Then we have the collection stage. You need test content to evaluate the tool. The most important content is that which shows off exactly what the tool is intended to do. You should also gather content that tests challenging cases, or content where you need to ensure that the effect doesn’t make things worse.
– The preparation stage is next, though this may be performed in tandem with collection. With the test content, you may need to edit it, in order that its ready to use in testing. You may also want to create manually create target content, demonstrating ideal results, or at least of similar sound quality to expected results.
– Next is informal perceptual evaluation. This is lots of listening and playing around with the tool. The goal is to identify problems, find out when it works best, identify interesting cases, problematic or preferred parameter settings.
– Now on to semi-formal evaluation. Have focused questions that you need to find the answer to and procedures and methodologies to answer them. Be sure to document your findings, so that you can say what content causes what problem, how and why, etc. This needs to be done so that the problem can be exactly replicated by developers, and so that you can see if the problem still exists in the next iteration.
– Now comes the all-important listening tests. Be sure that the technology is at a level such that the test will give meaningful results. You don’t want to ask a bunch of people to listen and evaluate if the tool still has major known bugs. You also want to make sure that the test is structured in such a way so that it gives really useful information. This is very important, and often overlooked. Finding out that people preferred implementation A over implementation B is nice, but its much better to find out why, and how much, and if listeners would have preferred something else. You also want to do this test with lots of content. If, for instance only one piece of content is used in a listening test, then you’ve only found out that people prefer A over B for one example. So, generally, listening tests should involve lots of questions, lots of content, and everything should be randomised to prevent bias. You may not have time to do everything, but its definitely worth putting significant time and effort into listening test design.
We’ve developed the Web Audio Evaluation Toolbox, designed to make listening test design and implementation straightforward and high quality.
– And there is the feedback stage. Evaluation counts for very little unless all the useful information gets back to developers (and possibly others), and influences further development. All this feedback needs to be prepared and stored, so that people can always refer back to it.
– Finally, there is revisiting and reiteration. If we identify a problem, or a place for improvement, we need to perform the same evaluation on the next iteration of the tool to ensure that the problem has indeed been fixed. Otherwise, issues perpetuate and we never actually know if the tool is improving and problems are resolved and closed.
By the way, I highly recommend the book Perceptual Audio Evaluation by Bech and Zacharov, which is the bible on this subject.