The difference between medical DBT's and audio DBT's

Keith_W -- Wed, 08/20/2008 - 10:41

First post! I got here after reading Robert Harley's invitation to participate in a recent TAS.

I hear a lot from objectivists that medical DBT's are the standard of evidence, therefore we audiophiles should submit to DBT's as the best way to evaluate equipment. As a physician, it is my job to read and critique journals, so I am VERY familiar with the ins and outs of medical DBT's.

While I have no sympathy for the most hardcore objectivists, the very same "partisan hacks" Mr. Harley was ranting about, I believe that DBT's need to be applied intelligently before the results can be interpreted. But the way most audio DBT's are conducted, these are about as unscientific as the subjectivists they are trying to debunk.

As pointed out in another post by Jonathan Valin - medical DBT's are massive. They involve thousands of patients with strict entry criteria (the disease being studied is strictly defined, you can not have other medical conditions which may interfere with data interpretation, you must be of a certain age, etc etc). These studies are carefully designed, take months or years to complete, months to analyze, and then months for the peer review process and finally publication.

Audio DBT's are not. We never know how sophisticated the listening panel are, whether they know what to look for, and whether individual variations in hearing, perception, chronic diseases which may affect hearing - have been identified and controlled for. We do not know if the test material (music) being played is familiar to the listener. We don't know if non-verbal cues (which can be used to lead AND mislead) are present. And finally, the evaluation period is all too brief. We all know that it can sometimes take weeks of listening to material we are familiar with, on systems we are familiar with, before we get to know the effect of a particular change. How are we expected to identify the changes in such a short period of time, and in an unfamiliar system?

The lack of identification of potential sources of bias, the lack of scientific rigour exposes most DBT proponents as sham merchants keen to offer scientific window dressing on a testing methodology which is of limited use.

The next difference is this - in a medical DBT, we know what we are looking for. The trials are designed to demonstrate the primary or secondary endpoint. For example - if a drug is purported to reduce the incidence of stroke, the primary endpoint would be number of new strokes per year in the treatment and control group. The study design would identify other stroke reducing drugs in both groups, and specify what is permissible and what isn't. We don't just take a new drug, give it to 5,000 patients and give placebo to 5,000 controls, and look at both populations to see what happens.

In an audio DBT, what is the primary endpoint? Does the testing panel even know what they are supposed to be looking for? Is there a score sheet that says "image width was xxx meters" or "frequency response: skewed to bass or treble"? Well, there isn't. You are expected to notice a difference, whatever it is, and then use that as a basis for comparison.

Another point about the size of the sample. In medical DBT's, we often do a power calculation before we start recruiting. A power calculation tells us how many subjects we need to recruit before the DBT becomes statistically meaningful. For example, a dose of 10,000 Grays of radiation only needs a small sample size to demonstrate the harmful effect. But what about much smaller doses, like the radiation you get from a chest X-ray? You need to know how many people to study before you even begin the study.

So what sample size is needed to demonstrate a relatively subtle audio tweak, such as the effect of various interconnects? Nearly everyone except the most naive listener can hear differences in loudspeakers, so you can get away with a small sample size. How many people do you need to test to demonstrate the difference between 0.01% THD and 0.02% THD? Most audio DBT's involve maybe 5-10 listeners. Is this enough to demonstrate the difference?

Despite the medical DBT being held up as the gold standard, I can tell you that many of them are either uninterpretable or poorly generalizable because of various failings in study design, study sample, and so on. Many of them tweak the statistics to make the differences seem more impressive than actually measured. Of course, it is possible to tweak the stats the opposite direction, for example to minimize the number of adverse outcomes. Simply redefine your endpoint, and there you go. I also know enough about these academics to know that their motives are not always pure and there may be all kinds of conflicts of interest.

I should also say: in audio DBT's, there is a strong bias towards the null hypothesis (that intervention X made no difference). Medical DBT's would be the same - if they were as poorly designed as audio DBT's. In fact, there have been a number of medical DBT's that have shown the null hypothesis when all of us in clinical practice, anecdotally know that the intervention makes a difference with our patients. In such cases, I ignore the study and tell people that I know that xxx works. Eventually another DBT may come along that changes the conclusion. It happens all the time!

In the end, you may call me a "super-objectivist". I think that audio DBT's have their place, but only well designed ones. Most DBT's I read about are absolutely pathetic.

Robert Harley -- Wed, 08/20/2008 - 13:13

That was an extremely thoughtful and insightful observation that provides a fresh perspective on the debate.

I'd like to add that in medical DBT, the subject isn't asked to perform any tasks; he just goes about his daily life. In audio DBT, the subject must perform a task under conditions that are very different from what he's used to.

Thanks so much for contributing to the discussion.

Keith_W -- Thu, 10/23/2008 - 03:53

You're welcome Robert.

Here are a few more thoughts. In medical DBT's, the outcomes measured are objective. Here are some things that can be measured directly:

- Blood pressure and heart rate
- Kidney function (Creatinine)
- Hormone levels
- Expression of tumour markers (e.g. PSA, CEA, Ca19.9, b-hCG)
- Anemia, White cell count, platelet count
- etc.

Or, you could measure outcomes:

- Survival - does this drug make you live longer? Or does one study group have fewer deaths at 5 years than the other group?
- Does the study group have fewer heart attacks or strokes than the control group?
- Does the study group have fewer patients who progress to dialysis than the control group?
- etc.

Studies of subjective phenomena are usually vague and difficult to interpret. Psychometrics is the field associated with measurement of psychological and psychiatric phenomena, and it is a minefield. It is easy enough to measure some phenomena, such as memory - but even then measurements such as these may be subject to daily variation. How do you measure happiness or depression? Or psychosis? Can you correlate one person's experience with another?

As it turns out, you can - but these measurements rely on self-reporting and are themselves validated against other tools. For example, here are a few questions in the PHQ-9 depression score:

Quote:Score 0 ("Not at all") to 3 ("Nearly every day") for frequency of following symptoms in the last two weeks:

- Little interest or pleasure in doing things (0 1 2 3)
- Feeling depressed or hopeless (0 1 2 3)
- Insomnia or hypersomnia (0 1 2 3)
- Feeling tired or lacking energy (0 1 2 3)
- etc

If you are an old hand at reading medical DBT's, you will find that psychiatric papers have wider confidence intervals and are generally more difficult to interpret than papers which measure objective phenomena.

One point I should make is that if the tools are not properly validated, the study will exhibit a strong bias towards the null hypothesis.

Needless to say, audio DBT's are subjective tests. They are nowhere close to even the least sophisticated psychiatric DBT (or to use medical speak, RCT = Randomized Control Trial). At least those RCT's attempt to break down and quantify the subjective phenomena being measured. At least you know whether you have less pain now than you did before you took the study medication. Even then - are you sure?

What is the audio equivalent of the medical psychometric test? Are we going to develop tools that will allow us to quantify the phenomena we are hearing?

My point is this - DBT's are no more objective than unblinded listening tests. The only difference is that the bias is different - one method produces a strong negative bias, the other a strong positive bias.

I wish that the DBT advocates would just realize this and shut up. Lose the pretense and their smug self-satisfied morally superior attitude. Or, if they want to do REAL science, then maybe they can start developing tools (along the lines of psychometric tests) and start measuring subjective phenomena objectively.

Robert Harley -- Thu, 10/23/2008 - 13:21

Thanks again for the interesting post. Of particular interest was the observation that if the test has poor controls, there's a bias toward the null hypothesis.

In my editorial in Issue 183, I mentioned the Audio Engineering Society paper that concluded listeners couldn't hear the difference between high-resolution digital audio and CD-quality digital audio. The paper was written by two long-time critics of audiophiles, Brad Myer and David Moran.

I've been listening to high-resolution (176.4kHz/24-bit) versions of some Reference Recordings titles (off a server) and comparing them with the CDs of the same titles. How anyone could conclude that these two sound "indistinguishable" is beyond belief.

My full report appears in the January issue.

Thanks again, Keith, for sharing your unique insight and perspective on this debate.

Keith_W -- Sat, 10/25/2008 - 20:14

Thanks Robert - but I should clarify what I said. If the test is poorly validated it may have insufficient distinguishing power to differentiate between the test and the control - THUS exhibiting a bias towards the null hypothesis.

Earlier this year, a study was published in Proceeds of the National Academy of Neurosciences (PNAS) involving a double blind test of wine with functional MRI imaging of the brain. Read the article yourself, but essentially they took 20 volunteers, fed them wine via a straw, and did not let them know if they were tasting a $10 wine or a $90 wine. When they were told they were drinking the $90 wine, the fMRI recorded higher levels of activity in the part of the brain associated with pleasure. This happened regardless of which wine they drank.

Some have taken this to be proof of the human brain's propensity for suggestibility. I am in no way disputing that this is true, but you only need to look further into the "methods" section of that paper.

The volunteers were university students and staff who "enjoyed wine" or "had some experience drinking wine". Because MRI machines require you to lie supine with no metal objects nearby, they fabricated a drink delivery system consisting of polypropylene tubing which delivered 1mL of wine per dose. Not to mention - the tasters had to taste the wine lying supine (thereby changing the normal taste process which also involves smell).

As a friend of mine said - "if you are an unsophisticated wine drinker lying in a culvert, being fed minute quantities of polypropylene tainted wine in an unnatural drinking position, and with people on the other side of a glass screen growling at you through a loudspeaker - you will exhibit a preference for the wine you were told is better. Big surprise".

This is an extreme case of the test having insufficient distinguishing power. I think DBT's in audio are possible, as long as we approach audio DBT's with the same exacting standards that we do in any branch of science. If the test fails to demonstrate any difference, we should quite rightly question the test as much as we question the results. THIS is how science is done, something the DBT crowd seems to miss.

Tom Martin -- Sun, 10/26/2008 - 21:03

Keith -- thanks for the multiple interesting perspectives on this subject.

I'd like to suggest another interpretation of what you've said: audio DBTs are sufficiently hard to do well that in practice audio magazines will probably not do them in a meaningful way. I don't know what it costs to do a clinical medical trial or how long it takes, but my impression is that the cost is vastly higher than what we spend testing a piece of audio gear ($2000 - $3000) and my impression is that it takes a lot longer than the time we have (1 week to 3 months).

Given that, I would add that I don't think the circumstances or the purpose of audio testing matches the intent of DBTs as I understand the proposal for audio.

1. Our (TAS/Playback/HiFi+) intent is to objectively describe what a very experienced listener hears when a piece of equipment is used primarily in one pretty well-understood system. How does this work for the consumer? If the attributes sound desirable within the context of the consumer's objectives for his/her stereo, then we suggest that the consumer listen to that equipment at a qualified dealer or try it at home. If he/she hears what we heard, and it seems valuable, he/she can choose to buy the equipment in question. We're not dealing with cancer here, where you have to pick one course of treatment and it had better work to the maximum degree possible. We're dealing with an ultimately subjective, system-dependent experience which the consumer can test for himself.

2. There seems to be a concern about whether reviewers actually hear what they say they hear (or, I guess, whether they're hallucinating). If that is a concern, the consumer can simply log the descriptions of reviewers and then listen to the equipment in question. This isn't any different than what you'd do if a friend recommended a car or a restaurant. If you don't hear what the reviewer heard (felt while driving, tasted when eating), then you don't have to follow his/her advice. This could happen because the reviewer is wrong or because you don't hear the same way, or because circumstances are different. But there is a practical way to deal with the issue that doesn't require DBTs and all their issues. And, we think, the practical way is better anyway when you take into account the impossibility of doing DBTs on a statistically valid representative sample of systems.

3. I think from what I've read that there is a political agenda surrounding audio DBTs. If one believes that expensive gear is a problem (for whatever reason -- there are many), then one might wish to use DBTs to prove that differences described between expensive and inexpensive gear are not real. But practically (for the consumer) if one believes expensive gear is problematic, a simpler solution is just to avoid it. There is plenty of inexpensive gear that is very satisfying (particularly if you hold the belief that there is no difference).

I do think audio DBTs would be desirable for consumers in one practical circumstance: for those consumers who do not (readily) hear the differences between pieces of gear, but wish to allay the anxiety of not having "optimal" gear, DBTs might provide the evidence needed to choose one piece of gear over another (and to be happy with it). A related need could be for those consumers who do hear differences (whether or not they are correlated with reviewers' writings) but who have a sort of epistemological problem with direct experience. That is, some readers may not trust human experience without the imprimatur or sense of scientific methodology to back them up.

In addition to DBTs, quantitative measurements could serve these needs as well. If enough readers felt one of these offered real value, we would consider it (still might not be affordable). We don't have any religion about DBTs or quant measurements. We do have logical and practical objections, so I could go on at length about why this consumer need (if it is one) is hard to warm up to, but we could figure out a way -- it is our job.

Tom

CEO and Editorial Director, Nextscreen LLC

Keith_W -- Sat, 11/01/2008 - 09:06

Thanks for your reply, Tom.

The cost of medical randomized control trials is stupendous, especially if it is a multi-center trial attempting to recruit thousands of patients. The "gold standard" type of trial - a prospective, placebo controlled, double blind, randomized control trial - takes years to complete, several hundred thousand dollars (if not millions), and involves an army of researchers, statisticians, pharmacists, doctors, trial nurses, etc.

This is but one reason why audio DBT's can not compare with medical DBT's. And one reason why I think people who say that the two are equivalent are being somewhat disingenous.

You are dead right when you say that the agenda is political. I moderate and partially own an audio forum (Stereo Net Australia) and I have seen these discussions take place there, along with countless other audio forums. The proponents are the same as anywhere else in the world - they accuse high enders of being elitists and try to drag them down by "proving" that their $100 secondhand CD player is indistinguishable from a $20k CD player. If you assemble the right crowd and put them in front of the right system under double blind conditions - this can be done.

However, this is not to say that I do not believe that an audio DBT can be designed and conducted in a meaningful way. To start, we would have to develop some tools, and then validate these tools to determine the resolving power. We would have to take into account the hundreds of thousands of different systems out there, and the ability of their owners to discern and report the changes.

It is a minefield. And personally, I don't care. After all, a hobby is a hobby and I have no intention to make it more like work :)

Arny Krueger (not verified) -- Fri, 12/26/2008 - 08:01

Kieth W wrote:
"The proponents are the same as anywhere else in the world - they accuse high enders of being elitists and try to drag them down by "proving" that their $100 secondhand CD player is indistinguishable from a $20k CD player. If you assemble the right crowd and put them in front of the right system under double blind conditions - this can be done. "
The above is a clear case of presenting supposition as fact.
One need not be waging class warfare to find the question of price versus performance of audio components to be intellectually stimulating.
Obviously, hand-picking a panel of dolts listening via a boom-box  would be very counter-productive towards answering such a question, and completely defeat the hoped-for intellectual stimulation.
A little logic and reason, please?
 

Tom Martin -- Fri, 12/26/2008 - 11:45

Arny -- I believe there is ample reason being applied here (read this thread and others in this forum), though feel free to highlight any gaps in our thinking.
 
All methodologies have problems -- DBTs and others. To describe in a useful way the differences in performance between equipment at different price points can be done with various approaches, and is certainly worthwhile (we have worked hard over the last several years to review more lower priced gear to help address this issue -- including launching a magazine devoted solely to more affordable equipment).  That said, certainly there is room for improvement in how we get at price vs. performance (or really performance vs. performance when price differs).
 
But there is a stream of assertion in some threads on this forum to the effect that DBTs must be used for this endeavor, and we are trying to understand what the driving force is behind that assertion. Since DBTs have some troubling flaws, it would appear that the driving force behind using them might be:
 
a. for some people the simple assumption that direct comparison of equipment is a superior method (intuitively appealing, but actually problematic)
 
b. that reviewers have a bias toward more expensive gear and that bias causes them to hear phenomena that aren't there (an assertion, but possible; DBTs would seem to address this but are costly and thus we'd like to avoid them if this issue isn't real)
 
c. the desire to exploit the limitations of DBTs to make a point about expensive gear being an emperor sans clothes
 
Much of the writing in this forum by proponents of DBTs at some point mentions issue b or issue c (or both). Since those issues seem to derive from an agenda re: expensive gear (though clearly if you read the posts, "expensive" is defined by the writer not by some external standard), our observation that there is a political agenda here is pretty logical (it is a hypothesis, but it is empirically inspired not random).
 
Why does this seem to be an agenda to those of us who work in this field every day?  Among other reasons, our findings using professional listeners doing objective observations are:
 
1. Some lower priced pieces of gear perform as well as or better than higher priced pieces of gear. In general, the best lower priced equipment outperforms at least some higher priced equipment, including equipment that costs 3X or 5X what the lower priced gear does. Some of this is because not all engineers are equally good, and some of it is because some engineers (and companies) are not trying to deliver the kind of performance we seek.
 
2. If you have a budget of $X, and you give one of our reviewers $2X, $5X or $10X as a budget, that reviewer can almost always find a piece of gear that outperforms anything you can buy for $X (this gets harder as X gets big). This seems reasonable: we live in a technologically constrained world that poses well-known tradeoffs, and the industry is full of good engineers; if one good engineer faces fewer constraints (has a bigger cost budget) than another good engineer, chances are he/she can create a better product.
 
3. The differences between gear at $X and gear at say $5X are easily noticed but often not dramatic (i.e. subtle). At the same time subtle differences are often meaningful to music lovers.
 
These findings make sense to us, given what we know about the world and our desire to describe for readers what different equipment does. The assertion that DBTs must be used seems to come with the assertion that these findings are wrong (particularly point 2). Since the findings have been repeated so many times and they make sense in the real world, a reasonable hypothesis is that there is an agenda at work unrelated to accurately describing what equipment does.
 
There are other possibilities as well, so this short note is simply to point out the effort that has been applied to reasoning through this issue.

CEO and Editorial Director, Nextscreen LLC

Arny Krueger (not verified) -- Sat, 12/27/2008 - 12:43

"Arny -- I believe there is ample reason being applied here"
If the amount and type of reason were ample, then reason converge on a single answer and stimulate positive action, not just more talk.

"All methodologies have problems -- DBTs and others."
That's a truism. This is typical of the weak forms of reason that are all too common in this kind of discussion.
In fact all methodologies have problems, but some methodologies are far more effective than others. Cutting to the chase, sighted evaluations have acute problems with false positives when differences are less than obvious. Blind evaluations must always address the possibility that they are generating false negatives and therefore are ineffective. The false positive problem with sighted evaluations is known to be due to the inherent properties of human perception. The false negative problem with blind evaluations has been effectively addressed for over a decade.

The other example of the weak forms of reason is the complete approach embodied in this post that I'm replying to. What I see is an attempt to divide and conquer, except that in the end nothing gets conquered. So it is just another wasted effort do divide without out a positive outcome.

"To describe in a useful way the differences in performance between equipment at different price pointscan be done with various approaches"

All of the effective descriptions will agree about the basic outcome of the evaluation.

"and is certainly worthwhile (we have worked hard over the last several years to review more lower priced gear to help address this issue -- including launching a magazine devoted solely to more affordable equipment). That said, certainly there is room for improvement in how we get at price vs. performance (or really performance vs. performance when price differs)."

AFAIK, any magazine that relies solely on classic journalistic means (merely conveying news, descriptive material and comment as opposed also doing objective technical evaluations) to do product evaluations has been out-of-date for going on up to 50 years.

For example, Stereophile and its predecessor The Audio League Reports, historic advocates of subjective evaluation of audio products, have been doing objective measurement-based technical product evaluations for over 50 years.
IOW, in the 21st century there is now enough scientific enlightenment in the audience for mainstream and high-end entertainment technology products that publishing articles based on just subjective evaluations with inherent questionable reliability will result in a large number of skeptical responses and ultimately a declining audience.

”But there is a stream of assertion in some threads on this forum to the effect that DBTs must be used for this endeavor, and we are trying to understand what the driving force is behind that assertion.“

The driving force for DBT technology as applied to subjective evaluations has always been a scientifically-based understanding of human perception. One thing that has changed in the last 40 years that I’ve been involved with audio DBTs is the increasing degree to which human perception and how it applies to audio has become understood in the audio product marketplace. The world educational system including both formal educational institutions and the popular press are promoting scientific understandings of most aspects of life including human perception that are inherently critical of many high end audio publications.

“Since DBTs have some troubling flaws,”

I thought you said that “All methodologies have problems”. Why do you not also admit that sighted evaluations have debilitating flaws that have never been effectively addressed or corrected by any high-end audio publication? Many people in your marketplace seem to be telling you that they know this to be true.

“ it would appear that the driving force behind using them might be:”

Note that no item in this list confesses the well-known truth that sighted evaluations have severe flaws that are neither addressed or corrected in any high end audio publication to date.

”a. for some people the simple assumption that direct comparison of equipment is a superior method (intuitively appealing, but actually problematic).”

Ignores the fact that a proper test always includes reliable comparison to a generally-accepted standard. In the case of audio electronics, the generally-accepted standard of performance (since the writings of Vilchur in the late 1950s) has been the “straight wire with gain”.
That standard can be readily applied to most audio components other than loudspeakers.

Audio has been strangling for decades on the mistaken idea that evaluating audio gear is like tasting wine. In fact, audio equipment evaluation is like picking bottles or wine glasses. The musical performance is the wine, the equipment is like the glass.

“b. that reviewers have a bias toward more expensive gear and that bias causes them to hear phenomena that aren't there (an assertion, but possible; DBTs would seem to address this but are costly and thus we'd like to avoid them if this issue isn't real)”

Ah, John Atkinson’s lame defense that proper DBTs are too expensive for cash-strapped high end audio magazines to ever use makes its appearance on TAS’s web site. It is out of character for a communications service that claims to be able to effectively judge high end audio products to plead poverty when it comes to evaluation techniques.

”c. the desire to exploit the limitations of DBTs to make a point about expensive gear being an emperor sans clothes”
There’s no question that the continued one-sided coverage of obsolete ideas about the alleged limitations of DBTs, and unmitigated glorification of the inherently-flawed sighted procedures that the high end audio press still worships is an irritant to people with contemporary educations about science.

”Much of the writing in this forum by proponents of DBTs at some point mentions issue b or issue c (or both). Since those issues seem to derive from an agenda re: expensive gear (though clearly if you read the posts, "expensive" is defined by the writer not by some external standard), our observation that there is a political agenda here is pretty logical (it is a hypothesis, but it is empirically inspired not random).”

I think it is time for high-end audio writers to realize that writing about class warfare from the perspective of “haves” living in luxury isn’t very politically correct in the 21st century.

These days luxury is about performance and reliability, not wasteful opulence and spending money for the sake of spending money. Not many people are interested in luxury cars that get passed on hills by econo-boxes. Clearly there exists a belief in many people’s minds that audio’s “luxury saloons” can be passed or at least kept up with by econo-boxes. Right now, the only thing that people are being provided with that relates to this topic by the high end press is poetry and unsupported assertions.

”Why does this seem to be an agenda to those of us who work in this field every day? Among other reasons, our findings using professional listeners doing objective observations are:”

”1. Some lower priced pieces of gear perform as well as or better than higher priced pieces of gear. In general, the best lower priced equipment outperforms at least some higher priced equipment, including equipment that costs 3X or 5X what the lower priced gear does. Some of this is because not all engineers are equally good, and some of it is because some engineers (and companies) are not trying to deliver the kind of performance we seek.”
This ignores a relevant truth, which is that there is no way to obtain improved sonic accuracy once sonic transparency has been achieved. If a $35 optical player plays CDs with perfect sonic transparency that is indistinguishable from the proverbial straight wire, then no more expensive product can give you more accurate sound. Obviously the issue of whether or not the inexpensive product’s performance can be distinguished from the ideal must be settled by credible means, not just more bluster and hand-waving.

“If you have a budget of $X, and you give one of our reviewers $2X, $5X or $10X as a budget, that reviewer can almost always find a piece of gear that outperforms anything you can buy for $X (this gets harder as X gets big).”

I don’t see any evidence that your reviewers have any special abilities when it comes to finding performance for value.

“This seems reasonable: we live in a technologically constrained world that poses well-known tradeoffs, and the industry is full of good engineers; if one good engineer faces fewer constraints (has a bigger cost budget) than another good engineer, chances are he/she can create a better product.”

I just deconstructed this. IOW, the law of diminishing returns has not been repealed. Once effective sonic transparency is achieved, no amount of additional money can be spent to obtain improved sound quality. There’s no way to buy more of what you already have as much of as you can ever have!

”3. The differences between gear at $X and gear at say $5X are easily noticed but often not dramatic (i.e. subtle). At the same time subtle differences are often meaningful to music lovers.”

Whether or not there are any audible differences at all between gear at $X and gear at say $5X is the fundamental question. You’re still making the fatal of mistake of presuming that this question has been convincingly answered for all kinds of audio components, not just loudspeakers.

”These findings make sense to us, given what we know about the world and our desire to describe for readers what different equipment does.”

That would be a serious problem for you.

“The assertion that DBTs must be used seems to come with the assertion that these findings are wrong (particularly point 2).”

The means by which your reviewers typically obtain “findings” are not universally credible, to say the least. That’s why people keep talking about DBTs. You would do well to read every request for DBTs as a serious expression of doubt in the credibility of your publications.

“Since the findings have been repeated so many times and they make sense in the real world…”

Except they don’t. Many of the so-called findings of TAS and high-end publications are in general ridiculed and reviled by many consumers and audio professionals. Probably, at least half.

“a reasonable hypothesis is that there is an agenda at work unrelated to accurately describing what equipment does.”

Anybody who has a modern and reasonably complete education in electronics will find reading TAS publications to be a trip to an alternative universe where the laws of physics and electronics as they are generally known and taught, have been bent beyond recognition.

For every Boyk there are at least 10,000 PhDs who will teach that stranded speaker and interconnect wire operating at audio frequencies differs only from solid wire in terms of its mechanical properties. That’s if a student would even dare to ask the question. By the end of the sophomore year he should be able to work the answer out for himself.

Tom Martin -- Sat, 12/27/2008 - 13:29

Arny -- thanks for the reply.

CEO and Editorial Director, Nextscreen LLC

Curtis -- Wed, 03/18/2009 - 22:55

 
 
Parts of
 
     Keith_W -- Wed, 08/20/2008 - 09:41 comments;
 
I believe that DBT's need to be applied intelligently before the results can be interpreted.
As pointed out in another post by Jonathan Valin - medical DBT's are massive.
They involve thousands of patients with strict entry criteria
(the disease being studied is strictly defined,
you can not have other medical conditions which may interfere with data interpretation, you must be of a certain age, etc ).
These studies are carefully designed, take months or years to complete,
months to analyze, and then months for the peer review process and finally publication.
We never know :
- how sophisticated the listening panel are,
- whether they know what to look for,
- and whether individual variations in
- hearing,
- perception,
- chronic diseases which may affect hearing
- have been identified and controlled for.
We do not know if the test material (music) being played is familiar to the listener.
We don't know if non-verbal cues (which can be used to lead AND mislead) are present.
And finally, the evaluation period is all too brief.
We all know that it can sometimes take weeks of listening to material we are familiar with, on systems we are familiar with, before we get to know the effect of a particular change.
How are we expected to identify the changes in such a short period of time,
and in an unfamiliar system?
The lack of identification of potential sources of bias,
the lack of scientific rigor exposes most DBT proponents as sham merchants keen to
offer scientific window dressing on a testing methodology which is of limited use.
Audio DBT's are not.
 
CJL : Maybe that is ok for audio testing...
that is having an 'unknown' group of listeners provides a better
indication whether there are differences or not.
If a manufacture can not tell if there are significant changes
quickly [ i.e. in hours to a month later ] and changes for the
better, he moves on...
If 9 out of 10 random listeners hear differences and these
heard differences are very close to the same response ... great !]
Maybe a loose criteria, but we are not trying to cure someone !
- - -
 The next difference is this - in a medical DBT,
- we know what we are looking for.
- We don't just take a new drug, give it to
5,000 patients and give placebo to 5,000 controls,
and look at both populations to see what happens.
- Does the testing panel even know what they are supposed to be looking for?
- In an audio DBT, what is the primary endpoint ?
 
CJL : Wished we had 5,000 listeners !
- That is true, we do not know what we are looking for !
- and we do not want the listeners to seek out or try to
find Differences,we want their immediate first impressions...
listening too long is detrimental to the audio testing.
- - -
 Is there a score sheet that says "image width was xxx meters" or "frequency response:
    skewed to bass or treble"?
    Well, there isn't.
   You are expected to notice a difference, whatever it is, and then use that as a basis for comparison.
 
CJL : well we used a listening chart that rated differences 1,2,3,4,5
    with details like:
    vocal details, air-space, sound stage differences in feet,
   timbre of instruments... etc.
And we were [are] able to correlate what the listeners heard
to our measurements. After a while our measurements made it
possible for us to become predictive as to what should be heard !
- - -
So what sample size is needed to demonstrate a relatively subtle audio tweak,
such as the effect of various interconnects?
 
CJL : 9 out of 10 for initial testing...
- - -
How many people do you need to test to demonstrate the
difference between 0.01% THD and 0.02% THD?
Most audio DBT's involve maybe 5-10 listeners.
Is this enough to demonstrate the difference?
 
CJL : very few general audiophiles would be able to hear
  such small differences... .01 -.02 %
For as you stated before what is the 'quality' of the listeners
Yes 9 - 10 listeners of mixed 'value' is ample to point out
or prove there are differences... good or bad.
- - -
 I should also say:
in audio DBT's, there is a strong bias towards the null hypothesis
- that intervention X made no difference -
 
CJL : of the all years of testing power cords and inter-connects
the NULL bias was not noticed / since the listeners
were expecting to hear some differences...
- - -
Medical DBT's would be the same - if they were as poorly designed as audio DBT's.
CJL :  We are all grateful they are not ...
- - -
I think that audio DBT's have their place, but only well designed ones.
Most DBT's I read about are absolutely pathetic.
 
  CJL : hopefully we, the audio world, will come to some
starting point on how to conduct a more precise
DBT, But I see no reason for now to be as strict
in the 'quantity' or 'quality' of the listeners.
   The correlation process though needs to be updated,
for we have very useful electronic test systems
that allow us to measure many parameters, which
we can learn how to correlate what is heard
with that which is measured.
- - -
   Curtis
 
 

 

LarryB -- Wed, 04/08/2009 - 13:04

Having read and participated in these discussions for some time now, I am continually struck by the fact that, in my estimation, most participants are not genuinely interested in determining the truth, presumably because  they are uncomfortable with the notion that carefully constructed tests might show them to be wrong.  This is of course antithetical to the advancement of science.
 
And for the record, I too am guilty of this.
 
Larry

"Digital finishes what the transistor began" James Boyk

ScottB (not verified) -- Wed, 04/08/2009 - 20:06

 I remember reading, not too long ago, a DBT study which concluded that at least most listeners could not tell the difference between lossy and lossless digital music formats of 16/44.1 origin. Sorry, I can't find the link to the study itself now. But the study also found, if I recall correctly, that self-described audiophiles or experienced listeners were more likely to generate statistically significant outcomes than those who did not so self-describe.
 
To any experienced audiophile or recording engineer, this is a completely unremarkable finding. Of course, as time goes on and we have more experience listening closely to live and reproduced music, our ability to discern subtle differences in sound increases, right? But why should that be so? More to the point, why should experience play a role in our ability to detect audible differences in signal quality? And, if it does, what does that say about what we're actually measuring in an audio DBT?
 
Of course, it's possible that audiophiles are audiophiles just because they can hear better. Because hearing tests of DBT listening panels seem rarely to be done, that's a possibility. But given the age and gender of most audiophiles, it would seem more likely to me that audiophiles actually hear worse than the general population, at least in standard perceptual threshold tests. That would leave the explanation to cognition and learning: our ability to focus on, and objectify, the kinds of small differences our ears detect as sensory input is a learned (and therefore, learnable) skill. Once again, an unremarkable observation: a classical violinst can detect immediately whether it's the strad or the ordinary instrument; a real novice might have some trouble hearing the difference between a violin and viola playing in the same key.
 
Now, a bit of thought suggests that this perceptual/cognitive duality has profound consequences for DBT. For one thing, the results of DBT are often represented as though the test was measuring audio signal differences against the standard of perceptual acuity - that is, a null result indicates that the differences can't be heard. Certainly, this is one interpretation of a null result that can never be ruled out. But it should also be understood that a DBT might be substantially a test of aural cognition - the ability to quickly learn the characteristics of an unfamiliar piece of music, and an unfamiliar system and room, sufficiently that the subtle differences between two audio technologies can in turn be learned and committed to memory well enough to differentiate them against a control. In that sense, a DBT is indeed "rigorous" - but that's not the sense usually intended. And I don't believe I've ever seen the results of a null-result DBT characterized as "the differences were not large enough to be learned by the listening panel under the conditions of this test" - but that clearly is a scientifically supportable, and often likely, conclusion for a DBT with null results.
 
I really don't have a big dog in this hunt. I'm an graduate degreed ME, and have spent my 25 year career on the technical side of the software industry, including 6 years as CTO of an F500 software company. I hardly fit the stereotype of a technically illiterate, credulous audiophile with more money than sense. Sound is a physical phenomenon; differences in sound must have physical causes. But I've learned to be far more humble through the years about pre-judging what can and can't have physically significant impact, and that's because I've so often been surprised by what actually does sound different, and how different it sounds. When you live with an audio system for thousands of hours of playback time, and particular recordings for hundreds of replays, and you've developed a certain self-confidence in your own judgment, you really don't give a damn what somebody else (or some listening panel of somebody elses) tells you you can't hear. On the other hand, trying to construct a repeatable framework for listening tests that actually accomodates the role of experience and learning is clearly a worthy scientific endeavor for those of us interested in continuous improvement in sound reproduction.
 
 
 
 
 
 

Norman Varney -- Fri, 04/10/2009 - 17:53

Normal
0

false
false
false

MicrosoftInternetExplorer4

st1\:*{behavior:url(#ieooui) }

It might be of interest to some reading this threar of an experiment I conducted about ten years ago while working at the Science & Technology Center at Owens Corning where I worked as a Sr. Acoustical Engineer.  I'll make it short.

We built two identical rooms for the purpose of performing acoustic characterization studies between a room with interior acoustic treatments and one without. The rooms were optimized for room mode distribution, speaker/listener positions and noise control. They were outfitted with typical home furnishings and entry-level high-end, 5.1 system electronics. The entire electro-acoustic chain was precisely calibrated and the same in both rooms. The only difference between the rooms was that one included interior acoustic treatments and the other did not. This treatment was 1.5" deep and covered by a stretch panel system making the treatment invisible and the two rooms appear the same. During the long period of tests, I was often asked to explain to VIPs about the research going on. I began to notice that these people, who had no real interest in audio, would make comments in the treated room like; "I'd love to have this at home”. I noticed that "regular" folks enjoyed the treated room most. They didn't really understand or care, but they noticed.  This was interesting me.  How can I capture this scientifically?
We had our own medical facility and staff onsite. I got them involved in a study. The approximately 600 employees had to have an annual hearing test. I ask medical to give me a list of individuals who met hearing criteria I established. Of those candidates, I asked for volunteers. We then conducted bio-feedback tests on about two dozen participants.  I played a seven minute clip from "Das Boot".  With the exception of one person (who was flat in both rooms), all experienced elevated hart rate, pressure, etc. in the acoustically treated room.
A similar story is of a group of teenage boys who I was giving a tour of the lab. We ended in the A/B rooms playing a few audio and  video pieces. In the untreated room, they acted like typical teenagers, joking around, etc. In the treated room, they were motionless and focused as if someone had flipped a switch.
The conclusion- acoustics controlled their emotional response. People become more involved in the experience when presented in an optimized acoustic setting. The performance value was elevated due to acoustic control. It wasn't just me, who understands it and appreciates it, that got it. Anyone exposed to it will get it.

Norman Varney
A/V RoomService, Ltd.

Robert Harley -- Mon, 04/13/2009 - 10:28

That's a very interesting experience and insight. I've found the same thing on a casual, anecdotal basis in my acoustically treated room (by Norman Varney, incidentally). Although I didn't have identical rooms except for the treatment, I reviewed products, and had visitors, in my untreated room for five years, and then for another five years after the treatment. The level of listener involvement (my own included) increased dramatically after the treatment.  The room before treatment had freestanding acoustic products, and I had chosen the dimensional ratios, meaning that it started out quite good.

Brutta Figura (not verified) -- Mon, 08/10/2009 - 15:04

I am a on again and off again audiophile since the early 70s - currently on and am amazed at the vitriol of the A/B testing people.  An A/B test is akin to a badly designed pilot study and a negative finding is not surprising.
It is gratifying that others think along the lines of applying rigourous human testing to the issue of audible differences.
The analogy to clinical testing is an apt one and one which I believe in.  I do not believe that a large study is needed if we adopt adaptive clinical design.  This study design selects for responders but is only one means of "patient enrichment" in clinical research.  Genomics/proteomics/diagnostics may one day give us personalised medicine and maybe one day save those who cannot or do not want to hear a difference from spending money on audio nivirna.

regreene (not verified) -- Sun, 12/20/2009 - 12:58

 
There is no doubt that DBT has been abused in audio by the "everything sounds the same" crew. My favorite was a speaker cable test where in order to include a lot of participants, the sound of the speakers was not provided to the participants directly but rather the speakers were miked and the picked up sound was sent out over a public address system.(I am not making this up, this really happened).
 
And then there was an AES test that purported to prove that two amplifiers were the same in perceived sound even though one of them differed from the other in frequency response by many times the scientifically known thresholds for frequency response audibility.
 
All that being admitted, it still seems to me that knowing the brand and price of the product being reviewed can surely  introduce a little subconscious prejudice into the review process. We all try to be impartial, but no one is really in control of their subconscious mind. As TM said, DBT is expensive and difficult. But sending out electronic components in black boxes that obscured their identity without altering their sonic performance might not be that hard.
 
Incidentally,if you are a long -time TAS reader , you have read some blind test reviews albeit without knowing it.;The late and much missed Ann Turner used to do reviews by having her husband Roman Zajcew fasten up something or other in the morning before he left for work(Ann would not know what). She would listen and take notes on her listening impression without looking at what was in the system Then when the day was over, she would check what it was she had been taking notes on. Or so I understand what she did. (Roman can correct me if I am wrong here). What one read was definitely what she heard without prejudice. This is not easy to do with speakers, but with other things, it was doable.
 
Would it have been a lot different if she had not done this. Perhaps not. She was a great reviewer and she was, as we all try to be, fair and unbiased. But in her case, one was SURE that there no bias, just a true listening report.
 

Steve S (not verified) -- Fri, 01/27/2012 - 18:20

"There is no doubt that DBT has been abused in audio by the "everything sounds the same" crew. My favorite was a speaker cable test where in order to include a lot of participants, the sound of the speakers was not provided to the participants directly but rather the speakers were miked and the picked up sound was sent out over a public address system.(I am not making this up, this really happened).
 
And then there was an AES test that purported to prove that two amplifiers were the same in perceived sound even though one of them differed from the other in frequency response by many times the scientifically known thresholds for frequency response audibility."
(I am a meter reader so nothing against measurements in the least.) I have read the above as well. By the way, there have been numerous witnesses (including a federal investigator) who has seen falsified data/graphs, altering tests in order to discredit it, attacking compeitors of friends, attempting to cover up falsified data, not once but twice etc, by several "scientists", "engineers" who regularly post on forums and push dbt/abx testing.
Concerning Ws comments and other medical science, I have yet to see a dbt post or a pro dbt website that addresses problems other than no "sight" and not knowing the "manufacturer". As such I appreciate Ws posts/explanations.

As a meter man, I have to deplore some unethical conduct. Sorry if this post is negative.

All content, design, and layout are Copyright © 1999 - 2011 NextScreen. All Rights Reserved.
Reproduction in whole or part in any form or medium without specific written permission is prohibited.