What Can and Can't Reviewers Hear? Devising Added Context for Reviews

Tom Martin -- Sat, 01/02/2010 - 09:52

There a numerous critiques of the basic testing methodology we use for TAS, HiFi+, Playback and TPV in the audio realm. While these critiques can and have been critiqued, I think it would be interesting to try some additional testing methods.
 
Key point: the purpose of this is NOT to replace our current review methods, but to add some context to our methods.
 
After all, no test method is perfect, and no test method can tell us everything a user might want to know. In particular, there are concerns on the part of some users about the limitations of our approach.
 
"We report what we hear" is a simple summary of our current approach. In expanded form: trained listeners ("we") can observe the performance of components in systems with a good degree of accuracy ("what we hear") and they can describe this in useful ways to consumers ("report"). Training means "you can train yourself to identify (hear) specific kinds of distortion artifacts in the audio chain, and use source material specifically selected to highlight those artifacts". Reporting means that you work to learn and devise language that conveys the sound of equipment to users.
 
The goal of this approach is to provide objective information that helps users determine whether a piece of equipment might be of interest (might go on a short list for purchase). Our recommended approach is that interested users then further investigate by actually listening to equipment on their short list at a dealer and/or at home. Thus we try to leave the subjective bits, systems integration, and the value equation to the user.
 
One issue that has been raised about this approach is something like "open trial listening observations (what we normally do) may simply reveal a reviewer bias, not actual differences between components". A related assertion is "many of the reviews of components tested using observational listening (again, what we normally do) describe differences between components that simply do not exist."
 
So, could we devise a testable hypothesis and method to address this? The idea would be to do this once (or every so often) to see if there is evidence of this bias and to see if there is evidence that reviewer can discriminate among sounds the way they say they can. Thus, this is not a change in the way we test everything; it is context for the current review approach.

 
For example, we might state the testable hypothesis as "in a blind test, reviewers will not be able to distinguish between two components A and B with a 95% confidence level".

 
We would then do a test of, say, 20 trials with a reviewer. An assistant would have two test components -- A and B. The assistant would flip a coin, and if it comes up heads, insert A into the test setup while the reviewer is out of the room. When the coin comes up tails, the assistant would insert B into the rig. The assistant records the trial number and whether A or B was in circuit. The assistant leaves the room. The reviewer enters the room and listens to the rig. The reviewer can listen as long as he wishes to any music. When he/she has identified A or B, he/she records this next to the trial number. The reviewer then leaves the room. The assistant enters the room and flips the coin, etc. At the end, the number of correct and incorrect identifications of A or B is determined. If 14 or more correct answers are provided, the products under test are considered distiguishable at the 95% confidence level. In that case, the hypothesis would be considered incorrect.

 
I don't know if this is the best hypothesis or the right test method. I'm asking for suggested improvements. I chose what is basically a single-blind approach because I can't see how to make this double-blind without introducing another piece of equipment that a) we don't have a b) could introduce additional problems into the system. But, I'm asking for input because I'm not an experimenter and science may have easy answers to these questions.
 
So, my question: can knowledgable, thoughtful people help us craft a practical contextual test or tests?

 
Thanks for any help you can provide.
 
 
 
 
 

Tom Martin -- Sat, 01/02/2010 - 10:05

One AVGuide user, ScottB, pointed out in another thread that there is the logical problem here that trying a test method, if it has low discrimination capability, may seem to reveal something about the equipment or the listener that is in fact a revelation about the test.  His suggestion, if I understood it correctly, was to add some instrumented measurements to the blind test proposed above. The idea here is that we would then have three tests:
 
1. Observational listening to a components "A" and "B" (in a system, of course)
 
2. Blind ABX listening
 
3. Instrumented testing of A and B
 
I think ScottB's thought was that if test 1 and test 3 indicate differences between A and B, but test 2 does not, then we have some suggestion that test 2 may lack discrimination. Similarly, if test 1 and test 2 indicate a difference, but test 3 does not, then we have the wrong measurement, or so it would seem. 
 
My thanks to ScottB for this suggestion.

CEO and Editorial Director, Nextscreen LLC

All content, design, and layout are Copyright © 1999 - 2011 NextScreen. All Rights Reserved.
Reproduction in whole or part in any form or medium without specific written permission is prohibited.