Thoughts on Testing Methods

Tom Martin -- Wed, 12/23/2009 - 12:28

There a numerous critiques of the basic testing methodology we use for TAS, HiFi+, Playback and TPV in the audio realm. While these critiques can and have been critiqued, I think it would be interesting to try some additional testing methods. After all, no test method is perfect. There is the logical problem here that trying a test method, if it has low discrimination capability, may seem to reveal something about the equipment or the listener that is in fact a revelation about the test. With that caveat in mind, reasonable people should be able to look at any results not as definitive but as interesting input. As one user said "this is a hobby, it is supposed to be fun." Well, let's have some fun.
 
So, my question: can knowledgable, thoughtful people help us craft a practical alternative test?
 
My understanding is that the core issue of interest is something like "open trial listening observations (what we normally do) may simply reveal a reviewer bias, not actual differences between components". A related idea is "many of the components testing using observational listening (again, what we normally do) describe differences between components that simply do not exist." So, could we devise a testable hypothesis and method to address this?
 
For example, we might state the testable hypothesis as "in a blind test, reviewers will not be able to distinguish between two components A and B with a 95% confidence level".
 
We would then do a test of, say, 20 trials with a reviewer. An assistant would have two test components -- A and B. The assistant would flip a coin, and if it comes up heads, insert A into the test setup while the reviewer is out of the room. When the coin comes up tails, the assistant would insert B into the rig. The assistant records the trial number and whether A or B was in circuit. The assistant leaves the room. The reviewer enters the room and listens to the rig. The reviewer can listen as long as he wishes to any music. When he/she has identified A or B, he/she records this next to the trial number. The reviewer then leaves the room. The assistant enters the room and flips the coin, etc. At the end, the number of correct and incorrect identifications of A or B is determined. If 14 or more correct answers are provided, the products under test are considered distiguishable at the 95% confidence level. In that case, the hypothesis would be considered incorrect.
 
I don't know if this is the best hypothesis or the right test method. I'm asking for suggested improvements. I chose what is basically a single-blind approach because I can't see how to make this double-blind without introducing another piece of equipment that a) we don't have a b) could introduce additional problems into the system. But, I'm asking for input because I'm not an experimenter and science may have easy answers to these questions.
 
Thanks for any help you can provide.

ScottB (not verified) -- Wed, 12/23/2009 - 16:05

 Before getting into testing methodology per se, it's important to identify what it is you hope to do better than current testing methodologies. Otherwise, what's the point?

You've identified one deficiency of your current testing methodology, which is that it's subject to various kinds of unconscious bias. The "objectivist" answer to this is classic double blind testing, where all extra-aural cues are removed. For me, the problem with double blind testing, as applied to music reproduction systems, is that DBT can't distinguish between a negative result due to true inaudibility, and a negative result due to contextual factors - listener inexperience, unfamiliar source material, source material which doesn't highlight differences, and, especially, the vicissitudes of aural memory. As Bob Stuart of Meridian points out, once you've heard something in a recording, it's very difficult to forget it. The ear/brain system, unlike a test instrument, does not return to the same baseline following each test, and is thus unreliable when used in the same manner as a test instrument. That's why source material like pink noise, which minimize the impact of short-term aural memory, are most sensitive in DBT tests.

Beyond the above issues, I think much (if not most) of the skepticism about what reviewers actually hear comes from assumptions about what can or cannot cause physically significant differences in sound reproduction. Components that seem to measure essentially identically under simplified test conditions are assumed to sound essentially identical, with perceived differences chalked up to reviewer bias. Devising technical measurements which more closely model component performance under real-life conditions is not especially difficult (see the speaker cable tests in this link: http://www.audiodesignline.com/howto/showArticle.jhtml?articleID=201807390). But such tests are rarely done anyway, resulting in lost credibility for reviewers and missing opportunities for correlating design characteristics to subjective results.

I get the impression from your post, Tom, that you're looking for ways to do this cheaply and easily. I afraid that I doubt you're going to find a methodology which is simultaneously sensistive, credible, cheap, and easy.

Tom Martin -- Wed, 12/23/2009 - 18:30

Thanks for the reply.

To clarify, I wasn't actually hoping to do anything better than current testing methodologies beyond addressing the claim that much of what reviewers hear is observer bias and not real. That is, I'm not looking for a new methodology to replace what we do, since for our goals I think our approach provides the best benefit/cost ratio. I'm looking to add context to the current approach via an experiment; context that addresses some perceived issues with our approach.

[Another way of looking at this is that I think there is something like a Heisenberg uncertainty principle that applies to audio evaluaton. The methods that are most like actual listening and which yield the most understandable data (high meaning) are subject to the lowest repeatability (or so it is said). As we drive for greater repeatability, we give up realistic situations and understandable results. That suggests that multiple approaches could usefully be combined -- hence an experiment to add context. For whatever reason, much of the argument about methods ignores this concept, though I note that most manufactuers in fact recognize the Heisenberg idea (you can't have both high meaning and high repeatability) in their approaches.]

So,in proposing some additional trials, the issue I'm trying to address is what I understand as the "objectivist" claim that some form of blind testing is practical (conceivable, executable, low cost, doable by our staff) and from time to time and could help to answer certain questions (described above). If a test that is practical can't work at all, then that's another matter. But as I said, if there are simply some problems with DBT, but why not see what happens anyway? Because my premise is that there are problems with all methodologies (a problem the methodology zealots want to sweep under the rug), but also many methodologies produce useful information. If something like DBT isn't practical, well then the claim is wrong and I think we can just ignore it.

I believe you have another hypothesis, which is that perception is essentially useless in evaluating audio equipment, at least if forced to function as a test device. That is also something we might take on in another experiment. If I understand, this requires measurement, the practicality of which seem much tougher to me:

1. We have to create a complete theory of the measurements that are needed to characterize each piece of equipment in the audio chain.

2. We need a complete theory of how these measurements are weighted, how they interact, and how components interact

3. We need the test equipment, test environment and knowledge to perform these tests

4. We need a theory of how to explain the tests to readers in a way that they can relate to what they hear

I am not aware of anyone who has developed and published 1, 2 and 4. We don't have 3. So, as you say, this isn't so easy. Lacking 1-4, we could do some ~random measurements to add context to what we do.

I didn't propose this initially because I confess I was at a loss as to why our "failure" to present random essentially uninterpretable quantitative information somehow damages our credibility. But re-conceiving this in the spirit of fun, I can see the virtues of the approach.

CEO and Editorial Director, Nextscreen LLC

ScottB (not verified) -- Wed, 12/23/2009 - 19:26

Thanks for the response, Tom.

I didn't intend to imply that perception is essentially useless in evaluating audio equipment - in fact, quite the opposite. I wouldn't have a system comprised of such components as the Berkeley AlphaDAC, Spectral amplifiers, and MIT cables if I didn't find my own perceptions to correlate well with those of TAS reviewers. What I was trying to say, not well I suppose, is that A/B testing may not be a good way of measuring the complex phenomenon of perception as it relates to listening to music.

Regarding objective measurement, I don't think you can develop the theories listed in your points 1, 2, and 4 without first just doing some measurements of the system as a whole, observing results, and beginning to posit reasons for what you see. And I don't know why you think I'm proposing that you present "random essentially uninterpretable quantitative information". Look at those simple speaker cable tests I posted in the link above (which required nothing more than standard signal generation and spectrum analysis equipment). From looking at those first three sets of graphs, you can determine the following:

1. The amplifier and speaker cables under test exhibited linear transfer functions while driving an 8 ohm test load. In other words, they behaved as essentially perfect "black boxes" when tested under standard simplified conditions.
2. When the amplifier and speaker cables under test were connected to a real loudspeaker load, they exhibited non-linear transfer functions. In other words, both the amplifier and the speaker cables created distortion - of a magnitude we would expect to be potentially audible - when connected to real loudspeakers.
3. The amplifier and speaker cable distortions both varied with the particular speaker cable under test.

It seems to me that a test like this would actually go quite some way in undermining overly-simplistic objectivist assumptions about the behavior of amps and cables in a real system, even before we had any way of correlating the results to subjective impressions. And curiosity would logically drive you to begin to find those correlations over time.

I have to say, I find the argument that "we don't have the right equipment" to be, well, exactly the kind of excuse which makes the objectivist crowd (rightly) suspicious. You're an audio publication, the equipment is not that expensive, your editor-in-chief is a recording engineer. I have to think if you're serious about this, you can make it happen. And somebody needs to make it happen, IMO.

Tom Martin -- Wed, 12/23/2009 - 22:06

Sorry, let me clarify. If the objective were to replace our test methods with measurement (one possible/likely implication of my apparently erroneous interpretation -- that perception is useless --  of your point), then we don't have the tools. This latter point is simply a statement of fact, not an excuse. In this situation, one would need a complete set of equipment to measure amps, preamps, speakers, digital devices, cables etc. I assumed this wouldn't be financially easy, because you said as much and based on the fact that we invested $40k just in test equipment for one type of component: displays. Of course, I don't know what the theories are, so I don't really know what test gear is required. In addition I don't know what knowledge is required to operate the gear (another expense).
 
OTOH, if we were only to test cables to address a specific hypothesis, I can see that that would be easier. However, if the tests have already been done, we could simply do a story about that. Seems like a good idea either way. I think the hypothesis you are working on is something like "Readers view the claim that reviewers can hear differences in cables as suspect because there are theories that cables do not differ electrically in any meaningful way and therefore they must perform essentially identically." The test would then be to show distortion measurements between cables under certain seemingly relevant conditions. If they differ significantly, hypothesis refuted. If not, hypothesis=? (because we don't know if the right measurement being used).
 
That said, I'm not sure this addresses the issue with reviewers' observation bias, though it at least suggests there is another explanation and it seems useful as context. I'm also not sure there is really an issue with trying some blind testing (as you say the issue with false negatives is there, but if you know that you don't overinterpret the results).
 
So all that seems like progress, and I thank you for the idea.
 
But I don't think I'm with you on another front. I don't think our goal is to develop theories of how audio works. We aren't designing audio equipment. And I don't think our readers are trying to design audio equipment. So, if we need to develop complete theories of how audio design works, it would be because a) our current approach is pretty much useless (for the goal of helping readers come up with a short list of products to consider) and/or b) because measurement is a more efficient way of delivering the needed information to our readers. I don't believe a) is true, because reader experience says it isn't true. I'm skeptical that b) is true because I think we'd have a heck of a time coming up with the published theories needed, and if they haven't been published I imagine it is because they are really hard to create (they certainly would be for us). In addition, I think the theories would be stunningly complex, and so my point 4 would be a big problem. However, if it would be easy to come up with the all the theories in published form, someone should be able to direct us to the source(s).
 
My point about "random, essentially uninterpretable quantitative information" is my rendition of what I think happens when we and readers don't have the complete theories I'm pointing to above. You said "such tests are rarely done anyway, resulting in lost credibility for reviewers". I agree that many people think that by not publishing some pieces of measurement data, we lose credibility. But I don't understand the logic, and I don't favor pandering to bad logic. If we (reviewers and readers) don't have the complete theory, we are then presenting bits of data from the partial theories we have and then trying to correlate bits with what we hear. Maybe that could work, and I'm interested in an explanation of how (I would then understand why what I thought was bad logic wasn't).
 
My experience is that people are exceedingly good at seeing correlations when then are weak or non-existent. My experience is that people confuse (bad) correlations with causation. This happens most often when the data presented are a small fraction of the data that any reasonable theory would require (e.g. simplification for "ease of reading"). It is amplified when parts of the theory are unknown, undeveloped or undiscovered. If the latter is the case, which I believe it is per my point above about published theory, then we're just presenting random data that the reader (and we) can't really interpret reliably.
 
Despite that general concern, it is likely that there is some measured data that is simple and useful. We simply don't know what that is. Maybe this thread can help unearth that.
 
I'm further concerned that we're in a kind of logical loop. We want a few measurements that correlate with what we hear. Well, why do we need the measurements then? We could just report what we hear (which, no surprise, is what we do). My impression from listening to many readers is that we want explanation and validation. Fine, but then we don't need measurements for credibility, we need measurements to satisfy our souls.
 
Or we don't trust what we hear, but then why are we trying to find measurements that correlate with what we hear? If this is the case, we just need a theory of measurement, which I guess would be based on technical aesthetics? I don't know about you, but history says to me that that's a scary path.

CEO and Editorial Director, Nextscreen LLC

ScottB (not verified) -- Thu, 12/24/2009 - 13:22

Tom,

I think we're talking past each other somewhat here, but a few points and clarifications:

1. It is not necessary, in fact it is exactly the wrong thing to do, to develop a bunch of theories about how things work before doing measurements. The whole point of measurements is that they do not require theories about the inner workings of the system, they only measure the system's behavior.

2. Measuring components in isolation is the approach usually taken, but that approach misses the complexity of systemic interaction. That was the point of the speaker cable measurement example, above - measuring cables in the context of a real world system revealed behavior that was hidden in the context of more isolated measurements. Note, there is no theory presented as to the reasons for differences between speaker cables, merely a validation that significant quantifiable differences exist. That is, as you say, useful in itself.

3. A/B tests are simply another form of measurement. I find this passage from your last post interesting:

"I'm further concerned that we're in a kind of logical loop. We want a few measurements that correlate with what we hear. Well, why do we need the measurements then? We could just report what we hear (which, no surprise, is what we do). My impression from listening to many readers is that we want explanation and validation. Fine, but then we don't need measurements for credibility, we need measurements to satisfy our souls."

Yes, of course, you can just assert that "we hear what we hear", and be done with it. Why bother with A/B tests then? I think the distinction between credibility and the need for validation is a distinction without a difference, in this case.

I think one of the reasons I'm pushing technical measurements is that I'm operating under an assumption that you are not, at least not yet. That assumption, based upon admittedly unscientific impressions garnered from reading about other, similar A/B testing of the kind you propose, is that those tests will come up negative (no statistically significant differences demonstrated) for most comparisons of non-electromechanical components. If that assumption is correct, you will then face a couple of choices: either you accept that many of the differences you report on are, in fact, imaginary, or you conclude that the testing methodology is insufficiently sensitive, and try to enhance and/or augment it. I'm already in the "enhance/augment" mode.

But who's to say I'm correct in my assumption? Please do try it and find out. And I should have said from the beginning, I truly applaud the effort to do something like this in the first place.

ScottB (not verified) -- Thu, 12/24/2009 - 13:26

And before I allow another misunderstanding to creep in, technical measurements are only one way to enhance/augment A/B testing regimens. You can, just for example, train yourself to identify (hear) specific kinds of distortion artifacts in the audio chain, and use source material specifically selected to highlight those artifacts. That's how Meridian designed the new digital filter in the 808.2 CD player RH liked so much.

Tom Martin -- Thu, 12/24/2009 - 16:33

Scott: thanks again for the reply.
 
Thanks also for the clarification on theories. At the most basic level, by requesting "theories" I am assuming one must have some guidance about what measurements matter. I'm calling that guidance a theory (or, if you prefer, a hypothesis). We could for example test the behavior of the system aerodynamically and determine its Cd. We could drop the component under test and determine its reponse to g forces. I'm guessing that you and others wouldn't suggest those tests, and I think the reason for that is that you have a theory (hypothesis) of what behaviors matter.
 
I would agree that we could set out to use intelligent hypotheses to develop a complete theory of how measurements can be related to what we hear. I'm just struggling to find the logic of doing so. My understanding was that if we wanted to do this, we would perform measurements based on our hypotheses. We would have trained listeners listen to the test systems. We would try to find measurements that correlate with what the trained observers hear. We would then have, I think, a set of tools that can be used by many people more expeditiously than the long and arduous effort required of trained listeners. I think you are also suggesting that we'd have a more repeatable process than using trained listeners (at least as far as the measurements are predictive). Those are both interesting goals for our work primarily if the time and effort to develop the complete theory is fairly low (i.e. practical). Or if the complete theory already exists (which is a distinct case from "develop a bunch of theories about how things work before doing measurements" because all that would already be done). For us to immediately use measurements to give readers useful information requires that the complete theory exist and be practically implementable.
 
I take it that you agree that the complete theory either doesn't exist or at least isn't published. So, you'd like us to try to create it. My hesitation is simply that we need a connection between the rumblin', stumblin' world of hypothesis testing and publishing. If readers would find that ugly, two steps forward, one step back activity interesting that might work. Otherwise, as I said, we're in the land of random, uninterpretable data (at least for a while).
 
You suggest that I assert that "we hear what we hear". Not exactly. "We report what we hear" is a summary of our current position: trained listeners ("we") can observe the performance of components in systems with a good degree of accuracy ("what we hear") and they can describe this in useful ways to consumers ("report"). A good rough description of the first part is "You can train yourself to identify (hear) specific kinds of distortion artifacts in the audio chain, and use source material specifically selected to highlight those artifacts". The latter part (report) is surprisingly difficult, an area we spend a lot of time on, and one that rarely comes up in the discussion by readers because of the assumed primacy of issues with listener reliability, subjectivity etc.

The reason for an A/B test (a one time trial) is to address the issue of whether we just make up what we write. A/B might address the issue, it might not. I'm at a point now where it seems worth a swing despite the problems (the comments to the effect that "you're charlatans, you're on the take, you're on drugs, you're idiots, you're a jackass, you're a joke, you're lying, you're pimps" etc get old for the staff after a while). I also plan to implement a button that lets users insert "You're a greedy, slimy, charlatan, jackass, unethical, unprincipled, drug-addled, brain-dead, deaf dumb and blind April Fool's joke" in the toolbar in the editor for comments.That way certain readers can save time.

When I point to the logical loop, I am outlining the logic of your proposal (as I understand it) and pointing out that it seems to have as a benchmark the method we already use. We then develop another method to imperfectly correlate with that benchmark. Either I've got this wrong (entirely possible) or it is worth asking "why?" (as in "why would we work hard to create an imperfect way to do what we already do?"). As I said, efficiency could be one reason. Repeatability could be another. Fine reasons, if they are real and applicable to us.
 
Validation could be a sort of third reason. As I outline below, validation is a distinctive reason, because it is really a reason to do some measurement and to work toward a theory but doesn't require a complete theory (shared by reviewers and readers) ever.
 
In this latter case, validation is not credibility by a different name -- an imperfect method if understood should not be more credible than the benchmark. It could appear more credible because people are confused, but it is hard for me to spend money on maintaining that confusion to our benefit. Or, as I suggest based on reader feedback, it could be that people want something "hard" to help justify the accuracy of the description they've read. That is, they're human and they need a story to tell about why component A is good. The story doesn't have to be right, it just has to be believable. Measurements, in this case, provide the basis for a believable story -- that's what I'm calling validation.
 
You may be under the impression that I'm arguing against measurements. I'm not. I'm not arguing for them either. I don't care about measurements per se. I'm arguing for identifying a problem that we can understand and care about, that our current approach doesn't address and, in this case,  that measurements solve.
 
Finally, I didn't understand this sentence: "I think one of the reasons I'm pushing technical measurements is that I'm operating under an assumption that you are not, at least not yet." Could you try it with different words? Thanks.

CEO and Editorial Director, Nextscreen LLC

ScottB (not verified) -- Thu, 12/24/2009 - 17:01

 Tom,
 
I'm reminded of the old cliche about Britons and Americans being "divided by a common language", as you seem to think I'm implying all kinds of things I'm not trying to imply, and I seem to be making it worse. Perhaps its best for me to just leave this thread to other commenters at this point.
 
To your question: "Finally, I didn't understand this sentence: "I think one of the reasons I'm pushing technical measurements is that I'm operating under an assumption that you are not, at least not yet." Could you try it with different words? Thanks."
 
What I meant is that I assume your A/B tests, at least as proposed, will not back up your reviewers' subjective impressions of audible differences between components in the electronics chain, in particular. You don't seem similarly skeptical. Technical measurements are just one way to augment those A/B tests, to show that differences of audible magnitude might exist, even though the A/B test didn't confirm them.

Tom Martin -- Thu, 12/24/2009 - 17:38

Actually, I'm trying to express my understanding of what you are saying or its implications. If I've got it wrong, feel free to correct me. This isn't the normal forum style of "attack the other guy relentlessly". My intent is to have a discussion to identify possible context enhancing things we could add to our test regime. In that spirit, I'm asking if you're implying (or if your logic implies) certain things.

To simplify the long post just above: "why do measurements?" I propose some reasons, but these might not be the best ones, and goals make all the difference. Upon further reflection, a fourth reason could simply be "measurements add more information."

As for the A/B test, I share your skepticism, I think. I don't think an A/B test at least as I outlined it will do much to get us a description or impression of the components under test. I figured we could do an A/B test just to see if we can tell the difference (identification rather than description) between two items (cables as you said seem to be a good choice). The false negative problem means it might not even do that, of course.

CEO and Editorial Director, Nextscreen LLC

MJW (not verified) -- Mon, 12/28/2009 - 14:36

First of all I commend Tom Martin for raising this issue. I feel however that discussions on how to improve 'testing' methods are pointless until everyone involved understands the purpose of the test. That is we need a very clear definition of the question before we start to debate how to get the best answer. What exactly is the reviewer trying to do when they review, say a power amplifier? What aspects of the sound are they considering-clarity, instrument timbre, imaging,soundfield,etc, etc.We need to recognize that no 2 people will hear exactly the same thing, nor can we really describe in a standardized way what we ourselves hear so that others fully comprehend. On top of that we all have personal biases as to which parameters are the most important to us.This last point is the one that drives all the arguments about objective vs subjective testing, double blinding etc. It is important to understand that ANY test that is based on human beings assessing sound is, by definition a subjective test. The questions of test methodology then relate to the removal of bias (recognized or not) and the ability to convey convey conclusions in a repeatable, generally understood way. These two points I think are crucial when the number of people conducting reviews of a specific piece of equipment is low-usually just one! What I find lacking in the review environment is any consistent audio standard for comparison, and virtually no attempt to quantify effects described. For example in loudspeaker reviews does 'a slight emphasis on the upper octaves vs the midrange' mean the same (quantitative) difference in a $100,000 pair of speakers as it does for a pair costing $1,000? What was the reference used by the reviewer in making the statement? There are many scoring, or scaling, methods available for attempting to quantify subjective parameters: word descriptions of levels of pain on a 1-10 scale displayed in hospital emergency rooms, a stright line with a smiley face at one end, and a sad face at the other end. The subject simply makes a mark on the line indicating their response to the particular sensory challenge being evaluated. Such methods are usually used with groups of people to establish their validity. They can however be used for a single person, if that person is calibrated by evaluating a standard reference at various time points.So a reviewer could 'score' a power amp for several criteria. They would be calibrated by scoring a reference amp against the same criteria every, say, 3 months. Only if the reference scores were consistent over several scoring sessions would the scoring of a 'test' amplifier be judged valid.

I consider an attempt to collect objective performance on any audio item to be essential. However the extent to which that data should be collected by the reviewer is debatable. I do not believe that everything a person hears can be defined by quantitative, objective data, but if we don't make more and more objective measurements then the hifi world will not improve for any of us.

To summarize I see very restrictive practical limitations on the process of evaluating audio gear: a limited number of credible, experienced reviewers; a limited amount of money available for testing methodology of any type; a lack of absolute standards against which to judge new equipment; and an absence of generally accepted numerical assessment tools to bring more rigor to the process.

I think that a practical way forward would be to pick product categories for which there would be a limited number of evaluation parameters (say cables or power amps) and then discuss what methods might be useful to address each of those parameters. Such an approach would not preclude reviewers providing an holistic evaluation of the product, but would provide a more substantial basis for that assessment.

All content, design, and layout are Copyright © 1999 - 2011 NextScreen. All Rights Reserved.
Reproduction in whole or part in any form or medium without specific written permission is prohibited.