Blind Listening Tests are Flawed: An Editorial

Robert Harley -- Wed, 05/28/2008 - 16:18

The following is my editorial from The Absolute Sound Issue 183 (not yet published) on blind listening tests.

The Blind (Mis-) Leading the Blind

Every few years, the results of some blind listening test are announced that purportedly “prove” an absurd conclusion. These tests, ironically, say more about the flaws inherent in blind listening tests than about the phenomena in question.

The latest in this long history is a double-blind test that, the authors conclude, demonstrates that 44.1kHz/16-bit digital audio is indistinguishable from high-resolution digital. Note the word “indistinguishable.” The authors aren’t saying that high-res digital might sound a little different from Red Book CD but is no better. Or that high-res digital is only slightly better and not worth the additional cost. Rather, they reached the rather startling conclusion that CD-quality audio sounds exactly the same as 96kHz/24-bit PCM and DSD, the encoding scheme used in SACD. That is, under double-blind test conditions, 60 expert listeners over 554 trials couldn’t hear any differences between CD, SACD, and 96/24. The study was published in the September, 2007 Journal of the Audio Engineering Society.

I contend that such tests are an indictment of blind listening tests in general because of the patently absurd conclusions to which they lead. A notable example is the blind listening test conducted by Stereo Review that concluded that a pair of Mark Levinson monoblocks, an output-transformerless tubed amplifier, and a $220 Pioneer receiver were all sonically identical. (“Do All Amplifiers Sound the Same?” published in the January, 1987 issue.)

Most such tests, including this new CD vs. high-res comparison, are performed not by disinterested experimenters on a quest for the truth but by partisan hacks on a mission to discredit audiophiles. But blind listening tests lead to the wrong conclusions even when the experimenters’ motives are pure. A good example is the listening tests conducted by Swedish Radio (analogous to the BBC) to decide whether one of the low-bit-rate codecs under consideration by the European Broadcast Union was good enough to replace FM broadcasting in Europe.

Swedish Radio developed an elaborate listening methodology called “double-blind, triple-stimulus, hidden-reference.” A “subject” (listener) would hear three “objects” (musical presentations); presentation A was always the unprocessed signal, with the listener required to identify if presentation B or C had been processed through the codec.

The test involved 60 “expert” listeners spanning 20,000 evaluations over a period of two years. Swedish Radio announced in 1991 that it had narrowed the field to two codecs, and that “both codecs have now reached a level of performance where they fulfill the EBU requirements for a distribution codec.” In other words, Swedish Radio said the codec was good enough to replace analog FM broadcasts in Europe. This decision was based on data gathered during the 20,000 “double-blind, triple-stimulus, hidden-reference” listening trials. (The listening-test methodology and statistical analysis are documented in detail in “Subjective Assessments on Low Bit-Rate Audio Codecs,” by C. Grewin and T. Rydén, published in the proceedings of the 10th International Audio Engineering Society Conference, “Images of Audio.”)

After announcing its decision, Swedish Radio sent a tape of music processed by the selected codec to the late Bart Locanthi, an acknowledged expert in digital audio and chairman of an ad hoc committee formed to independently evaluate low-bit rate codecs. Using the same non-blind observational-listening techniques that audiophiles routinely use to evaluate sound quality, Locanthi instantly identified an artifact of the codec. After Locanthi informed Swedish Radio of the artifact (an idle tone at 1.5kHz), listeners at Swedish Radio also instantly heard the distortion. (Locanthi’s account of the episode is documented in an audio recording played at workshop on low-bit-rate codecs at the 91st AES convention.)

How is it possible that a single listener, using non-blind observational listening techniques, was able to discover—in less than ten minutes—a distortion that escaped the scrutiny of 60 expert listeners, 20,000 trials conducted over a two-year period, and elaborate “double-blind, triple-stimulus, hidden-reference” methodology, and sophisticated statistical analysis?

The answer is that blind listening tests fundamentally distort the listening process and are worthless in determining the audibility of a certain phenomenon.

As exemplified by yet another reader letter published in this issue, many people naively assume that blind listening tests are somehow more rigorous and honest than the “single-presentation” observational listening protocols practiced in product reviewing. There’s a common misperception that the undeniable value of blind studies of new drugs, for example, automatically confers utility on blind listening tests.

I’ve thought quite a bit about this subject, and written what I hope is a fairly reasoned and in-depth analysis of why blind listening tests are flawed. This analysis is part of a larger statement on critical listening and the conflict between audio “subjectivists” and “objectivists,” which I presented in a paper to the Audio Engineering Society entitled “The Role of Critical Listening in Evaluating Audio Equipment Quality.” You can read the entire paper here http://www.avguide.com/news/2008/05/28/the-role-of-critical-listening-in-evaluating-audio-equipment-quality/. I invite readers to comment on the paper, and discuss blind listening tests, on a special new Forum on AVguide.com. The Forum, called “Evaluation, Testing, Measurement, and Perception,” will explore how to evaluate products, how to report on that evaluation, and link that evaluation to real experience/value. I look forward to hearing your opinions and ideas.

Robert Harley

Steven Stone -- Wed, 05/28/2008 - 19:49

in a manner that even an objectivist should be able to understand. :shock:

Steven Stone
Contributor to The Absolute Sound, EnjoytheMusic.com, Vintage Guitar Magazine, and other fine publications

fkrausz -- Fri, 05/30/2008 - 15:30

The Swedish experiment highlights what I think is a common flaw in double-blind tests confounding the percentage of times a listener can perceive a difference in a single sound sample with the percentage of samples that reveal a perceptible difference. Imagine that 70% of the test samples had sufficient masking sound at 1.5 kHz, so that the untrained ear wouldn't have picked up the idle tone in them. Now even if the listeners caught the difference in the remaining samples some 90% of the time, they would probably have gotten the right answer in only about (70% * 50%) + (30% * 90%) = 35% + 27% = 62% of the comparisons, which might well be within the statistical variance for the experiment. Getting around this problem requires knowing which sound samples are audibly different via the systems under test, before conducting the test. And good luck with that.

Of course, I don't really know if masking was the problem with the Swedish Radio experiment. But the next time somebody describes a blind A/B listening test, ask yourself if their procedure would detect such detectable problems as dynamic compression, or a loss of the top or bottom octave -- problems that won't show up in every music sample used. Seems to me that, with the samples chosen randomly and treated equally, such tests are doomed to failure.

Robert Harley -- Sat, 05/31/2008 - 09:48

That's an interesting hypothosis, fkrausz. A related factor is the low-quality playback systems used in such tests, which tend to obscure differences and contribute to the phenomenon you cite. The experimenters often come from a position that all CD players, amplifiers, and cables sound the same, so why not use the cheapest possible equipment?

During a visit to a very famous acoustic testing laboratory and blind-listening evaluation room I saw their playback system: mass-market integrated amplifier and $99 CD player, all connected with the "patch cords" that came in the box.

For another example, see my description of the double-blind cable tests at the 91st AES convention that appears in my AES paper "The Role of Critical Listening in Evaluating Audio Equipment Quality."

gmgraves -- Wed, 07/09/2008 - 12:11

I read your editorial with great interest as I have recently been party to a number of carefully set-up double-blind tests.

First of all, you are correct, "indistinguishable" does, indeed, mean that NO audible difference exists between 16-bit, 44.1 KHz digital and the High Resolution formats. I have been doing a lot of recording lately and I have had ample opportunity to test 16-bit, 44.1KHz digital quantization against both 24-bit, 96KHz and 32-bit (fp), 192 Khz quantization. Playback directly from the computer (through the same DAC/ADC with which the recording was made) yielded no audible difference to a listening panel of 5 audiophiles. Also, no audible difference was detected between the high-resolution recording and the SAME recording output as 16-bit, 44.1KHz Red Book and burned to CD. The CD was also played back through the same DAC/ADC as through which the recording was made in order to keep variables low.

Secondly, you are wrong to characterize those who value double-blind or ABX testing as being mostly supported by "...Partisan hacks, bent on discrediting audiophiles. " I too was skeptical of the so-called "double-blind" test until I started to participate in some of them. My first encounter was with a group of audiophiles using a home-made ABX setup to audition two very high-end power amplifiers. It was bad enough when the tests revealed that no one could reliably tell when the controller (sitting in another room and switching or not switching amps every 30 seconds by the clock while a number of different types of music were played) switched or didn't switch or to which amplifier she had switched (she didn't know which was which either. Sometimes she would switch at the appointed time, sometimes not. We listeners never knew). The end result was that statistically, the 6 listeners got it right about 50% (give or take a few percentage points) of the time. In other words, blind chance. When the same experiment was repeated using one of the expensive amps and a much cheaper amp of similar power and the same results were obtained it did seem that all modern amps sound the same. This brings me to point three.

Scientists use ABX and double-blind testing for things other than audio (you mentioned pharmaceuticals) for a good reason. If such tests reveal no differences, its because there are no differences. Whether the device under test is a pair of amplifiers, a set of cables, or a 16-bit, 44.1 KHz digital recording vs. a 24-bit, 192 KHz digital recording of the same performance or a new "wonder drug". If the test is properly set-up and the results are that there is no NOTICEABLE difference between the two units in question (or the test subject and the control), then it stands to reason a difference which makes no difference is no difference at all. But something here strikes me as even more important. If the differences that we are talking about here are so subtle that a direct A-B comparison does not illuminate them, does this also not say that we are engaged in counting angels on the head of a pin (I leave it to you to decide whether these are "recording angels"javascript:emoticon(':wink:'))

I don't pretend to be able to explain your European radio CODEC story, but obviously something was wrong if no one in a listening test heard a 1.5 KHz noise in the program material. Perhaps the panels weren't given more than a few seconds to hear each sample, I don't know, but it would seem to me that anyone would be able to hear a non-correlated 1.5 KHz tone in a musical performance if given ample time (no more than a few seconds) to focus in on it.

George Graves

Steve S (not verified) -- Sat, 01/21/2012 - 17:43

George, I really hate posting this, but up to a dozen "engineers"/"scientists" proponents have been witnessed, some by a Federal Investigator, either falsified data to the public, altered at least two tests to discredit them, attacked a compeitor of a friend (traveled 5k miles to help friend in a video) who owned a business etc. As a design engineer I hate posting this, but it is happening.
If too negative, delete this post.

Robert Harley -- Thu, 07/10/2008 - 10:00

Thanks for the thoughtful reply to my editorial. I recently had a conversation with Peter McGrath (one of the world's greatest recording engineers of classical music) about the sound quality differences between high-res and 16-bit/44.1kHz. He did the same experiment as you, listening to high-res directly from the computer and then listening to the same file downconverted to 16-bit/44.1kHz. He described the sound-quality difference as "like throwing a light switch."

It's not as though the superiority of high-res digital (compared with 16-bit/44.1kHz) flies in the face of theory. Bob Stuart has shown in a series of AES papers why 16-bit/44.1kHz is insufficient to encode all the information humans can hear. The positions argued in the papers are not idle speculation; Stuart has an extensive scientific background in psychoacoustics, and cites previously published literature regarding human auditory capability.

You can read the details of the Swedish Radio story in the AES paper "Subjective Assessment on Low Bit-Rate Audio Codecs" by C. Grewin and T. Ryden, published in the proceedings of the 10th International AES Conference, "Images of Audio."

michkhol -- Thu, 07/10/2008 - 11:07

Dear Robert,

I just read your editorial and I've got a few questions because not everything is clear for me in your article.

I quote: "blind listening tests fundamentally distort the listening process and are worthless in determining the audibility of a certain phenomenon."

1. In the first test you mentioned, the "certain phenomenon" apparently was indistinguishablility of 44.1/16 and hi-res playback. This phenomenon was confirmed by the test. Then you call it "patently absurd conclusions". Could you please elaborate on this?

2. To support your point of view you bring an example of another, totally unrelated 20 y.o. test with similar "absurd conclusions". How can validity of one test affect the validity of another one if the conditions and time were totally different?

3. In the third test "the certain phenomenon" was the musical difference between unprocessed signal and the compressed one. It was not a test of the quality of the codec. Again, the test confirmed that there was no difference. You contend that the codec had a flaw and the test could not be valid. How is it possible to question validity of a test by something that was not its purpose?

Sincerely,
Mike

P.S. Regarding your conversation with Peter McGraph, what converters did he use?

gmgraves -- Thu, 07/10/2008 - 11:49

Quote: Thanks for the thoughtful reply to my editorial. I recently had a conversation with Peter McGrath (one of the world's greatest recording engineers of classical music) about the sound quality differences between high-res and 16-bit/44.1kHz. He did the same experiment as you, listening to high-res directly from the computer and then listening to the same file downconverted to 16-bit/44.1kHz. He described the sound-quality difference as "like throwing a light switch."

I find that very interesting. I use an Apogee ADC with my Macintosh as my DAW to record High-res audio. My equipment is very good - perhaps not top-notch, but still, the recordings I get are better than most commercial releases in terms of dynamic range, imaging, etc. While I, of course, respect Peter McGarth's opinion, I simply cannot understand what he is hearing that is so stunningly apparent as to be likened to "throwing a light switch" (in a dark room, one would presume). I use 32-bit floating point to record simply because most of my recording is done live, and the the higher bit-rates afford me more headroom than does a 16-bit system. This is crucial for live recording where one may not always have the luxury of carefully setting record levels in advance of the actual event. I usually use 96KHz as the sampling rate and I play these recordings back using the same Apogee ADC/DAC. I simply hear no difference between that and a CD output from the same recording. I'd love to know what he is hearing. I don't think that I'm that deaf. I can, after all, routinely tell which of my microphones, in a multi-mike setup, I'm "potting-up" just by their sound.

George Graves

Robert Harley -- Fri, 07/11/2008 - 15:30

To address Mike's questions, he asserts that the test in question "confirmed" that 16-bit/44.1kHz digital audio is sonically identical to 24-bit/192kHz digital audio as well as to DSD. My point is that CD-quality audio and high-resolution digital audio sound so different that any test that purports to prove that they are indistinguishable says more about the test methodology than about the phenomenon in question.

I cite the 20-year-old Stereo Review double-blind test (that "confirmed" that all power amplifiers sound identical) as further evidence that blind testing of audio products if flawed. The age of the test makes no difference; the methodology is the same. I also cite more recent double-blind tests that "confirm" equally absurd conclusions.

The purpose of the Swedish Radio tests was, in fact, to judge the quality of the low-bit-rate codec. Swedish Radio's charter was to determine if the codec was good enough to replace FM broadcasting in Europe. The fact that 60 "expert" listeners underwent 20,000 trials over a two-year period with elaborate double-blind methodology and failed to hear an artifact of the codec that was instantly recognizable by a single listener after a few minutes of non-blind observational listening surely should rings some alarm bells about the validity of double-blind listening tests.

Finally, I don't know what converters Peter McGrath is listening to at the moment, but I expect them to be near-state-of-the-art.

Robert Harley -- Fri, 07/11/2008 - 15:35

I should add that I'm not a lone voice questioning the validity of blind listening tests.

Here's a footnote from my AES paper:

Many respected academic researchers also question the validity of blind A/B testing. Michael Gerzon stated in his paper “Limitations of Double- Blind A/B Listening Tests,” presented at the 91st AES convention, “It would be a disaster if we had protocols that didn’t reveal subjective differences that the average consumer would notice in five years’ time. I want to indicate possible areas in which normal double-blind A/B protocols may not be adequate to reveal faults that may be audible to even unsophisticated end listeners. I’m going to do this with possible models of how we hear.” Gerzon encouraged other researchers to look beyond double-blind testing and “to develop experimental methodology matched to the phenomenon which is being tested,” and to “not believe that one simple protocol—double-blind or A/B or ABX—is the answer to all kinds of measurement problems.”

Similarly, AES Fellow J. Robert Stuart stated in his landmark papers “Estimating the Significance of Errors in Audio Systems” and “Predicting the Audibility, Detectability and Loudness of Errors in Audio Systems” that “A/B testing using program material and particularly musical program is fraught with difficulties. . .the author sets out some reasons why the ‘objective’ approaches of A/B listening and null-testing may be flawed.”

You can read the entire paper here:

http://www.avguide.com/news/2008/05/28/the-role-of-critical-listening-in-evaluating-audio-equipment-quality/

gmgraves -- Fri, 07/11/2008 - 16:37

robert_harley6 wrote:I should add that I'm not a lone voice questioning the validity of blind listening tests.

Here's a footnote from my AES paper:

Many respected academic researchers also question the validity of blind A/B testing. Michael Gerzon stated in his paper “Limitations of Double- Blind A/B Listening Tests,” presented at the 91st AES convention, “It would be a disaster if we had protocols that didn’t reveal subjective differences that the average consumer would notice in five years’ time. I want to indicate possible areas in which normal double-blind A/B protocols may not be adequate to reveal faults that may be audible to even unsophisticated end listeners. I’m going to do this with possible models of how we hear.” Gerzon encouraged other researchers to look beyond double-blind testing and “to develop experimental methodology matched to the phenomenon which is being tested,” and to “not believe that one simple protocol—double-blind or A/B or ABX—is the answer to all kinds of measurement problems.”

Similarly, AES Fellow J. Robert Stuart stated in his landmark papers “Estimating the Significance of Errors in Audio Systems” and “Predicting the Audibility, Detectability and Loudness of Errors in Audio Systems” that “A/B testing using program material and particularly musical program is fraught with difficulties. . .the author sets out some reasons why the ‘objective’ approaches of A/B listening and null-testing may be flawed.”

You can read the entire paper here:

http://www.avguide.com/news/2008/05/28/the-role-of-critical-listening-in-evaluating-audio-equipment-quality/

I'm certainly not arguing that double-blind testing - or indeed any kind of "subjective" (listening as opposed to measuring)test methodology isn't fraught with problems, it is. And I don't doubt that double-blind is only effective for illuminating gross differences between devices under test, and if gross differences aren't there, then we get a return of statistical probability that there are NO differences. Hardly conclusive. Blind testing has another problem with regard to reviewers - most of us can't use it. We are one person reviewing one piece of equipment. While I might find double blind evaluation to be a useful tool (when used in conjunction with other tools), the fact remains that most of us rarely have access to it.

But I am convinced that when a double-blind test returns a verdict of no differences, what it is really telling us is that the differences really aren't worth the splitting of hairs over unless one is really anal about this subject. I find that unless a piece of equipment is markedly worse than something else, when the music starts, these tiny differences become extremely academic and fade into the distant landscape. Don't know about you, but I'm it to listen to the music, not the equipment.

George Graves

Robert Harley -- Sat, 07/12/2008 - 10:56

My experience is antithetically opposed to your view that

" . . . when the music starts, these tiny differences become extremely academic and fade into the distant landscape. Don't know about you, but I'm it to listen to the music, not the equipment."

I also listen to the music rather than to the equipment, which is why improvements in sound quality are so meaningful. Moreover, there's not a linear relationship between differences heard in the presentation when listening analytically for those differences, and the musical consequences of those differences. That is, what appears to be a small difference when listening analytically can have profound musical consequences. In addition, much of musical expression lives in the subtleties of dynamics and inflection, characteristics that are not focused on when listening analytically. Finally, timbral realism is often a function of how well a component reproduces the inner micro-detail of an instrument's sound.

I find that rather than "fading into the distant landscape," small differences become magnified when listening for pleasure.

Perhaps we just hear differently.

Ro (not verified) -- Tue, 12/23/2008 - 17:43

"I also listen to the music rather than to the equipment, which is why improvements in sound quality are so meaningful."

Robert, that is exactly why blind tests (and any "scientific" audio tests) are meaningless. It will always come down to the expertise of the person - either capable to appreciate the transmission of *music* or not. There're people who listen to music, and there're people who hear sound playback. Unfortunately, a generation brought up with CDs and MP3 files and 10-KHz-range computer speakers has forgotten how to listen to recorded music. It'll take a few days - if not weeks - for someone used to 44 KHz to start distinguishing life and how moving music is in higher-resolution formats, digital or analogue. Simple things though, like quick dynamics and accurate high-frequency imaging, are obvious immediately. Someone at another website has described the difference as "movement and space", 96/24 audio has movement in it, whereas CDs sound canned and still.

gmgraves -- Sat, 07/12/2008 - 13:03

I think that we just put different weights on these things. When I review a piece of electronics (let's leave speakers and other transducers out of this as there is nothing subtle or ambiguous about the differences between those components) I notice the differences between them and my reference components and I note them. When I start listening to music for pleasure, though, I find myself refocused on the performance and not the equipment. If the component I'm listening too is significantly better than my reference, I tend to not notice the improvement when listening for pleasure, I simply enjoy it. If the equipment is not as good as my reference, I tend to, after a few minutes, "forget" that too, and again, just enjoy the music. I could shift my focus back into "review" mode and hear all of the anomalies I heard when seriously reviewing the component, but then I would be letting my subsidiary awareness take precedent over my focal awareness, wouldn't I? javascript:emoticon(':wink:')

Robert Harley -- Mon, 07/14/2008 - 10:28

My point is that the better-sounding piece of equipment more easily allows you to forget the equipment and focus on the music. Electronic artifacts are a constant reminder that we're listening to a re-creation, rather than to music itself.

Greater musical enjoyment comes not from listening in analytical mode for specific sonic attributes, but through lower awareness of the electro-mechanical system between listener and performer. The listener might not be aware of why he feels a closer connection to the music with better equipment. One doesn't have to be consciously aware of a reduction in artifacts to appreciate the deeper musical involvement the reduction in those artifacts engenders.

michkhol -- Mon, 07/14/2008 - 12:56

robert_harley6 wrote:To address Mike's questions, he asserts that the test in question "confirmed" that 16-bit/44.1kHz digital audio is sonically identical to 24-bit/192kHz digital audio as well as to DSD. My point is that CD-quality audio and high-resolution digital audio sound so different that any test that purports to prove that they are indistinguishable says more about the test methodology than about the phenomenon in question.

Robert,

Unfortunately what sounds obvious to you is not necessarily obvious to others. Do you have any agruments (examples, tests) other than "critical listening" to prove your point?
Quote:
I cite the 20-year-old Stereo Review double-blind test (that "confirmed" that all power amplifiers sound identical) as further evidence that blind testing of audio products if flawed. The age of the test makes no difference; the methodology is the same. I also cite more recent double-blind tests that "confirm" equally absurd conclusions.

If this was true we would not have many new drugs. Methodology is the same (drug, placebo, control group) but if a test gave absurd results with some chemical 20 years ago it does not mean that the test was flawed.
Quote:
The purpose of the Swedish Radio tests was, in fact, to judge the quality of the low-bit-rate codec. Swedish Radio's charter was to determine if the codec was good enough to replace FM broadcasting in Europe. The fact that 60 "expert" listeners underwent 20,000 trials over a two-year period with elaborate double-blind methodology and failed to hear an artifact of the codec that was instantly recognizable by a single listener after a few minutes of non-blind observational listening surely should rings some alarm bells about the validity of double-blind listening tests.

I quote: Swedish Radio's charter was to determine if the codec was good enough to replace FM broadcasting in Europe. I don't see anything here about technical quality of the codec. The test proved what it was supposed to prove: it was "good enough". Any further speculation will put us beyond the purpose of the test. If the purpose of the test was to look for codec flaws, the test would reveal them (and it was confirmed later by the same expert group). Again I don't see any grounds to question validity of the test.
Quote:
Finally, I don't know what converters Peter McGrath is listening to at the moment, but I expect them to be near-state-of-the-art.
Then it means that Peter McGraph's words do not confirm or deny your point of view. The same effect can be easily observed (and explained) with low-quality converters.

Jonathan Valin -- Mon, 07/14/2008 - 17:57

Quote:If this was true we would not have many new drugs. Methodology is the same (drug, placebo, control group)

I'm amused by the way some of you conflate double- or triple-blind drug trials with blinded or ABX listening tests. As I have participated in several drug trials, let me point out:

1) Drug trials don't take an hour or so (at most) and don't involve a handful of randomly assembled "subjects" who may or may not be qualified for the trial. The ones I've been involved in take years and thousands (often tens or hundreds of thousands) of highly pre-qualified subjects, who generally have to come in to a clinic to be tested weekly (at first), then bi-weekly (after a year or so), then monthly (after another six-to-twelve months), and then at longer intervals over a period of many more years. The amount of data collected is huge; it has to be to be statistically significant.

2) Out of curiosity, which is the control group in an ABX or blinded listening test? And which is the experimental group?

3) Most importantly, at no point in any of the drug trials I've been involved in, has anyone asked me: "So did you notice any difference in how you were feeling this week?" And, depending on how I answered, then put a big checkmark in a "Yes" or "No" box and told me I was done. Short of reporting persistent pain (or its alleviation), there is little to no place for subjective perceptual responses in a drug trial. They take your blood, they take your urine, they run specific complex tests on both, and (in a true double-blind or triple-blind test) you don't know the results of those tests (and neither do the people who administer them) until the entire trial is over and the drug is either approved or disapproved.

How, in anybody's wildest imaginings, is asking someone if they can hear a difference between A/B/C, on the basis of a few blind listens at relatively quick intervals, "the same methodology" as a drug trial? Indeed, how is this any different than setting someone down, for the first time, in a situation they've had little to no experience with in front of equipment they've had little to no previous experience then rapidly changing stimuli and asking them to judge, more or less instantly, if they can hear...well, what? Any differences? Specific differences? Their own preconceptions? Their friends' and neightbors' preconceptions?

Sounds more like Abu Ghraib than a drug trial.

Jonathan Valin -- Mon, 07/14/2008 - 19:08

Quote: Swedish Radio's charter was to determine if the codec was good enough to replace FM broadcasting in Europe. I don't see anything here about technical quality of the codec. The test proved what it was supposed to prove: it was "good enough". Any further speculation will put us beyond the purpose of the test. If the purpose of the test was to look for codec flaws, the test would reveal them (and it was confirmed later by the same expert group). Again I don't see any grounds to question validity of the test.

I don't get your logic. The trial didn't prove what it was supposed to prove: that the new codec was "good enough" to replace FM broadcasting in Europe. If it had, why would Swedish Radio have given a damn about the serious error that was discovered by a (non-blinded) professional listener, and why wouldn't they have used that original blind-testing-panel-approved codec without any qualification or modification to replace FM broadcasts? And what makes you think that if the test had been "designed to look for codec flaws," these same blind listeners would've found them? Wouldn't detecting such audible flaws be considered part of the blind panel's original mandate?

John Mitchell -- Tue, 07/15/2008 - 00:32

Robert Harley's paper presents some valid criticisms of blind testing of audio components, such as the effects of emotional stress on the listener's perceptions. Some might conclude that, given these difficulties, blind testing is unreliable and should be rejected. It seems to me that a more useful approach is to apply criticisms such as Mr. Harley's to the design of more reliable blind tests. For example, in testing power cords, interconnects, or power conditioners, it seems quite feasible to design blind tests that would allow the reviewer to listen to each component in the usual setting (e.g., the reviewer's home, using his or her reference equipment), over a fairly long period of time, and in the same manner that is used in "observational listening". Such testing could be used to confirm the results of observational listening tests (assuming an honest and open-minded reviewer).

While I agree with Mr. Harley's claim that the "objectivists" are often unscientific in their refusal to recognize or account for the complexities of human perception and psychology, it is equally unscientific to discount the influence of psychological factors (e.g., our expectations, manufacturers' reputations, and the opinions of authorities) on our perception of a component's quality in non-blind tests. Even if a reviewer is convinced that he or she is immune to such psychological effects, others have no reason to trust in such self-proclaimed immunity in the absence of any objective verification.

John Mitchell

Robert Harley -- Tue, 07/15/2008 - 10:22

I must disagree with Michkhol's reasoning on all three points. .

First, my assertion that audible differences exist between 44.1kHz/16-bit, 96kHz/24-bit, and DSD is based on my own listening experiences. This includes hearing a live microphone feed and comparing it to standard resolution PCM, high-resolution PCM, and DSD. Have you performed the same test and reached a different conclusion? Or are you content to base your belief that they all sound identical on the results of a published test? (A “test” devised and conducted, by the way, by two individuals with a long history of attempting to discredit audiophiles.) Are you suggesting that I should reject my own direct experience and conclude that I was simply deluded? Is everyone who hears a difference between standard- and high-resolution digital audio similarly deluded—people like Meridian’s Bob Stuart, Keith Johnson, dCS founder Mikey Storey, Peter McGrath, and other credentialed and respected leaders in the field who have decades of academic research and hands-on experience with the subject?

Second, Swedish Radio was, in fact, attempting to discover audible artifacts in the codecs. Read the paper (“Subjective Assessments on Low-Bit-Rate Audio Codecs” by C. Grewin and T. Ryden, published in the Proceedings of the 10th International AES Conference”). I’ve read the paper, and attended its presentation at an AES conference in London.

More to the point, you’re trying to dodge the essential facts of the Swedish Radio affair; 60 listeners over 20,000 double-blind trials failed to hear an artifact of the codec that was obvious to a single listener using non-blind techniques. No amount of parsing the language of Swedish Radio's mandate gets you around that fact.

Finally, your attempt to dismiss Peter McGrath’s experience merely by the fact that I don’t know the specific D/A converter he was listening through is grasping at straws.

michkhol -- Fri, 07/18/2008 - 14:52

jvalin wrote:

How, in anybody's wildest imaginings, is asking someone if they can hear a difference between A/B/C, on the basis of a few blind listens at relatively quick intervals, "the same methodology" as a drug trial? Indeed, how is this any different than setting someone down, for the first time, in a situation they've had little to no experience with in front of equipment they've had little to no previous experience then rapidly changing stimuli and asking them to judge, more or less instantly, if they can hear...well, what? Any differences? Specific differences? Their own preconceptions? Their friends' and neightbors' preconceptions?

In a word you say that ABX listening test is much more subjective than scientific. How then can you talk about validity of one subjective test ("critical listening") against the other (ABX)? If results of two subjective tests do not match it does not prove anything.

michkhol -- Fri, 07/18/2008 - 15:11

jvalin wrote:

I don't get your logic. The trial didn't prove what it was supposed to prove: that the new codec was "good enough" to replace FM broadcasting in Europe. If it had, why would Swedish Radio have given a damn about the serious error that was discovered by a (non-blinded) professional listener, and why wouldn't they have used that original blind-testing-panel-approved codec without any qualification or modification to replace FM broadcasts? And what makes you think that if the test had been "designed to look for codec flaws," these same blind listeners would've found them? Wouldn't detecting such audible flaws be considered part of the blind panel's original mandate?
Swedish Radio just got lucky. If not for this professional listener it would have approved the codec. For non-professional listeners it probably would sound the same (I didn't hear the codec, cannot say much here). How many of the MP3 generation would give a damn about the codec flaws anyway?
I cannot make any assumptions about what should be considered as the panel's original mandate, I'm not Swedish Radio. However audibility is subjective. People do not hear the same. It's not because some of them are "professionals" and some of them are not. Everyone has his own "taste" for frequency ranges and transient behavior. And it depends on the mood, fatigue, etc., that is is totally subjective. That's why judging equipment by its reviews does not make sense.

michkhol -- Fri, 07/18/2008 - 16:43

robert_harley6 wrote:
First, my assertion that audible differences exist between 44.1kHz/16-bit, 96kHz/24-bit, and DSD is based on my own listening experiences. This includes hearing a live microphone feed and comparing it to standard resolution PCM, high-resolution PCM, and DSD. Have you performed the same test and reached a different conclusion? Or are you content to base your belief that they all sound identical on the results of a published test?
(A “test” devised and conducted, by the way, by two individuals with a long history of attempting to discredit audiophiles.)

Now I'm confused. What test are we talking about? Is it this one:
http://www.aes.org/e-lib/browse.cfm?elib=14195 ?

Quote:
More to the point, you’re trying to dodge the essential facts of the Swedish Radio affair; 60 listeners over 20,000 double-blind trials failed to hear an artifact of the codec that was obvious to a single listener using non-blind techniques. No amount of parsing the language of Swedish Radio's mandate gets you around that fact.

I wish I could hear the audio from that test as non-professional. So far I don't know how noticeable it was.

Quote:
Finally, your attempt to dismiss Peter McGrath’s experience merely by the fact that I don’t know the specific D/A converter he was listening through is grasping at straws.
Not at all. I respect Peter McGrath's experience but consider it only as a personal (subjective) opinion. You and I may have heard it differently in his room.

If you like, you could read a paper from another audio professional Dan Lavry: www.lavryengineering.com/documents/Sampling_Theory.pdf.
Quote:
Nyquist pointed out that the sampling rate needs only to exceed twice the signal bandwidth. What is the audio bandwidth? Research shows that musical instruments may produce energy above 20 KHz, but there is little sound energy at above 40KHz. Most microphones do not pick up sound at much over 20KHz. Human hearing rarely exceeds 20KHz, and certainly does not reach 40KHz. The above suggests that 88.2 or 96KHz would be overkill. In fact all the objections regarding audio sampling at 44.1KHz, (including the arguments relating to pre ringing of an FIR filter) are long gone by increasing sampling to about 60KHz.

Judging by how many respectable studios in the world use his equipment I think that the words of a renown D/A and A/D designer have some weight.

Robert Harley -- Sat, 07/19/2008 - 16:53

You might want to read some of the papers by Bob Stuart (a Fellow of the AES, who also holds a degree is psychoacoustics) in which he presents the case that 44.1kHz/16-bit digital audio is insufficient to encode the entire range of sounds humans can hear. He bases this thesis on models of human hearing that are generally accepted in the psychoacoustic literature.

Secondly, although 44.1kHz sampling is perfect in theory for encoding a signal with a bandwidth of 20kHz, in practice 44.1kHz is too slow because of the requirements it puts on filter design. The anti-aliasing filter needs to have no attentuation at 20kHz and more than 100dB of attenuation at 22.05kHz. Such steep filters introduce time-domain distortions. See the AES papers of the early-to-mid 1990's by Mike Storey.

Third, there's been a suggestion that although we can't hear sinewaves above 20kHz, we can detect the steepneess of transient signals that implies a bandwidth greater than 20kHz. We use these steep transient in localizing sounds. Part of the HDCD process (the patent application makes for fascinating reading) encodes in the hidden information channel indicators that a signal's transient leading edge is less steep on the 44.1kHz/16-bit signal compared with the high-res original from which the HDCD-encoded compatible signal is derived. A conjugate process in the decoder restores the transient's orignal rise time.

michkhol -- Sun, 07/20/2008 - 11:42

robert_harley6 wrote:You might want to read some of the papers by Bob Stuart (a Fellow of the AES, who also holds a degree is psychoacoustics) in which he presents the case that 44.1kHz/16-bit digital audio is insufficient to encode the entire range of sounds humans can hear. He bases this thesis on models of human hearing that are generally accepted in the psychoacoustic literature.

At this point I think it is time to decide which way we are going. We can go scientific way (meaning models, assumptions, approximations and statistics) or subjective way of "critical listening". So far mixing scientific and subjective arguments looks, for the lack of other word, subjective.

Quote:
Secondly, although 44.1kHz sampling is perfect in theory for encoding a signal with a bandwidth of 20kHz, in practice 44.1kHz is too slow because of the requirements it puts on filter design. The anti-aliasing filter needs to have no attentuation at 20kHz and more than 100dB of attenuation at 22.05kHz. Such steep filters introduce time-domain distortions. See the AES papers of the early-to-mid 1990's by Mike Storey.

That's exactly what Dan Lavry talks about and also about ways to overcome it. I would appreciate though if you tell how to obtain these papers without paying $20 for each of them on the AES web site.

Quote:
Third, there's been a suggestion that although we can't hear sinewaves above 20kHz, we can detect the steepneess of transient signals that implies a bandwidth greater than 20kHz. We use these steep transient in localizing sounds. Part of the HDCD process (the patent application makes for fascinating reading) encodes in the hidden information channel indicators that a signal's transient leading edge is less steep on the 44.1kHz/16-bit signal compared with the high-res original from which the HDCD-encoded compatible signal is derived. A conjugate process in the decoder restores the transient's orignal rise time.
That's very interesting. I in fact read the patent, but I don't recall that transient restoration grabbed my attention. The first question however is, was it implemented and to what degree? Are there any test measurements published?

Speaking of the audibility test, if your were talking about the test I referred to in the AES paper then there is no argument here. You heard difference between live feed and differently encoded one, the test compared differently encoded signal with the same signal run through A/D/A loop. If it was a different test you talked about please let me know.

rmcdo -- Tue, 07/22/2008 - 16:26

I have read Robert Harley’s article “The Role of Critical Listening in the Evaluating Audio Equipment Quality” in its entirety. I thought it was extremely well written and thought out, and I feel that I learned a lot from it. However, instead of proving the worthlessness of double blind testing in evaluating the differences in audio equipment, I believe the content of this article strengthens the arguments for the importance of properly conducted blind listening tests in the evaluation of audio equipment.

At this time I am willing to concede the following:
1. There are factors that effect the sound of audio equipment for which measurements have not yet been devised.
2. There are people who through training, experience or natural ability are able to discern subtle differences between different pieces of audio equipment that most people would not notice.
3. Most double blind tests are severely flawed and thus their results are worthless.

I do believe however, that the only way to substantiate the claim that one piece of equipment sounds different and/or better than another piece of equipment is through properly conducted double blind testing. There are too many biases that will adversely effect the results of any non-blind test. First is the fact that nearly anyone who pay a large sum of money for a new piece of equipment is going to be strongly motivated to believe that it sounds better than the old piece of equipment. Second, audio journalists are probably going to be biased toward equipment that is manufactured by people whom they admire or who are their friends. Finally there is going to be pressure to find the more expensive piece of equipment to sound better, just because of its high price. I would call this the “Emperor’s New Clothes syndrome” because obviously only a person with a highly developed listening ability will be able to tell that the more expensive equipment sounds better.

So how should a proper double blind listening test be set up? The article tells how. Here’s what is needed:
1. The participants should be those people with highly developed hearing ability who know what to listen for when evaluating audio equipment.
2. The associated equipment that is not being evaluated should be of a high enough quality to allow differences in the equipment being evaluated to be revealed.
3. The equipment being evaluated should be of a sufficiently high quality to make conducting the test worthwhile.
4. The music recordings used for the test should have a wide frequency and dynamic range and should be chosen by each listener.
5. The listener should control the frequency and duration of the switching between equipment A and equipment B.
6. The number of trials should be sufficient to get a valid result, but not so many that the listener becomes fatigued.
7. There should be a sufficient number of qualified listeners used to ensure valid results.

The test should consist of two parts. First the test should determine if there is any difference between A and B and second, if so, which sounds better. For the first part of test, the objective part, the machine controlling the test would randomly compare A to A, A to B, B to A, or B to B. The listener would switch back and forth as many times as needed and then decide yes, there is a difference, or no there is no difference. The machine would then randomly reset the choices and the listener would choose again. There would need to be a sufficient number of trials per listener to determine if there were any discernable differences between the two pieces of equipment.

If the first part of the test determined that the listeners could not tell a difference between A and B, then there would be no need for the second part and it would be concluded that there was no difference between the sound of A and B that could be reliably heard by listeners of these products. The recommendation would be to purchase the cheaper of the two products. If the listeners were able to consistently determine that there was a difference between the sound of A and B, then the second, subjective part of the test would be conducted.

For this part, the machine would randomly compare A to B or B to A. The listener would switch back and forth a sufficient number of times and then record which sounded better, A or B. The machine would randomly reset the comparison and the listener would choose again. There would again need to be a sufficient number of trials to produce a statistically valid result. The other listeners would do the same thing and the results would be compiled. The conclusions drawn from the second part of the test would not be as clear cut as those from the first part.

There would be three different possible results from each listener. The listener would either consistently prefer the sound of A, consistently prefer the sound of B, or not be able to consistently express a preference. The results could be completely different for each participant. If all or most listeners consistently preferred A to B or B to A, then the conclusion would be that this preference would be the recommended choice for purchase. Of course, “sounds better” is still subject, but if the participants were well versed in the sound of live music, then their recommendation would carry a lot of weight. If there was no consistency among the participants, or if most participants were unable to consistently pick one or the other as sounding better even though they do sound different, then the conclusion would be that there was very little difference between the sound of A and B. Neither A nor B could be recommended for purchase over the other. It would be a matter of personal choice or budget constraints if one cost more than the other.

My conclusion is that a properly conducted double blind test is the only way to prove without a doubt that one component sounds different from another. This type of test would be especially important when evaluating passive items such as cables. If detectable differences between components actually exist, then it should be no problem for trained listeners to hear them in properly set up blind listening tests. By doing so, they would prove once and for all that high end audio products are truly worth their cost.

peterlonz (not verified) -- Fri, 06/12/2009 - 00:26

I think "rmcdo" has pretty much got this issue nailed.
I have as little faith in poorly conducted blind tests as I have in the mutitude of expert witnesses who with great authority claim to be able to pick small differences & describe them.
But I would add a couple of points not fully addressed in rmcdo's conditions list 1 to 7:
1) It must  be possible to switch quickly from A to B etc. This reduces the need to carry in memory one's listening perceptions. An easily understood example might be trying to hold a complex picture in ones head whilst attempting to comapare a very similar picture, you would lose the detail in the first view quiteb soon.
2) The value of the equipment under test is irrelevant. If someone wishes to test & pay for the tests that's all that matters.
Finally to all those who have contributed to the view that blind testing is of no or little value I issue a challenge:
At home set up your system with any audio media you choose, & arrange for a reasonably quick changover so that two pieces of equipment can be compared. Teach your wife, GF, or buddy how to quicky make the change whilst you are blindfolded. You call the changes ( time is your only constraint)& have your friend simply record whether you are right or wrong. To show yourself that you have the skills you claim you would need to be correct on average 2 out of three times over a minimum sample of 20 tests. Good luck.
 
peterlonz

Robert Harley -- Tue, 07/22/2008 - 22:25

Thanks for that thoughtful reply. I agree in principle that such a test as you describe might be useful. No blind listening tests have been conducted with even a fraction of the consideration for the listener that you describe. Nearly all of them have been designed and run by those with an axe to grind against audiophiles.

Is the test you propose a practical reality, or is it a thought experiment?

I disagree, however, with your assertions that reviewers are biased toward products that are more expensive, made by manufacturers with whom they have a relationship, and other factors. This is purely conjecture on your part.

Here's an anecdote that you might find interesting. I received for review two power amplifiers, a Parasound unit designed by the great John Curl, and the McCormack DNA-1, the first product from Steve McCormack (he had previously been in the tweaking business). I listened to the two amplifiers, and the DNA-1 sounded considerably better. If I were biased, I would have expected the John-Curl designed amplifier (which had a more elaborate design and construction) to sound better. Not only that, I learned during the review period that John Curl's house burned to the ground in the Berkeley fire of the early 1990s. He lost everything, including the parts needed to make his Vendetta Research phono preamp. Curl received a royalty on every amplifier sold, and needed the money. I had (and have) the greatest respect for him as a designer; he's one of the three or four best solid-state designers in the history of the high-end. I also liked him as a person.

Nonetheless, I wrote the review and told it like I heard it; the Parasound amplifier was a disappointment. It sounded thick, grainy, and hard.

Curl then examined one of the production amplifiers and discovered, to his horror, that the factory in Taiwan that made the amplifier under contract had replaced his discrete input stage with one based on an op-amp. There were a number of other changes to his design.

He insisted that production be halted, and that the amplifier be produced exactly to his design. Parasound (who had not known about the circuit changes) immediately instituted strict safegards to prevent factories from substituting even a single part. Curl became involved in overseeing the production of the amplifier and all subsequent products bearing his name.

The version of the amplifier that was built to Curl's original specification sounded significantly better. The amplifier went on to become a best seller for its great sound and high value. Both Curl and Parasound learned from the experience, and now keep tight controls over production.

Had I been influenced by the biases that you suggest are present in all reviewers, I would have given the first sample of the Parasound a positive review. But I knew what I heard in the listening room, and nothing can change that.

rmcdo -- Wed, 07/23/2008 - 06:00

I didn't mean to imply that all audio reviewers are biased toward high end products, but that blind listening tests would eliminate any perception of bias on the part of the review reader. As a retired CPA I can tell you that independence and the appearance of independence are both important factors when auditing a business. Not only must the CPA be truly independent of the business being audited, but also the CPA must own no stock in the business, have no family member employed by the business, or have any other relationship with the business which might cause a user of the audit to doubt the independence of the CPA. If an audio reviewer truly did not know what product he was evaluating, then their would be no preception of bias on the part of the reader.

Jonathan Valin -- Wed, 07/23/2008 - 13:16

Quote:In a word you say that ABX listening test is much more subjective than scientific. How then can you talk about validity of one subjective test ("critical listening") against the other (ABX)? If results of two subjective tests do not match it does not prove anything

In a word (or several), I am saying that blind listening tests are more "subjective" than standard observational listening tests. You are literally asking people to make guesses in the dark about differences that experienced folks take days, weeks, and months of close listening on familiar systems with familiar source materials to sort out.

BTW, you didn't answer my point about your comparison of blind listening tests to drug trials. I will repeat: How could any reasonable person say these two procedures are the same?

And why, may I ask, does the fact that two subjective tests don’t give you the same results make both of them invalid? Let's say I sit you down in a room and bring in two healthy, well-groomed horses of the same height and weight (well, within 0.1 gram)--one will be a pure white Lippizan and the other a black plough horse. You will be allowed to view each horse for as long as you want. I will ask you if they differ and you will tell me, on the basis of your observations, how (color, breeding, form, etc.) And then I will ask you to pick the horse you think is best in all observed categories—breeding, color, form, etc. We will call the winner "A/X" and the loser “B.”

Then I will blindfold you to eliminate all "psychological factors" and "unconscious biases." And at short intervals, “A/X” and “B” will be trotted out in front of you, as many times as you want. You must decide by touch, smell, and sound alone (tasting is optional) which one is “A/X.” How do you think you'll fare? (And consider yourself lucky that I didn't propose this test with a garden snake and a Black Mamba.)

And here’s the kicker, if you are unable to pick A/X in this AB/X test with “statistically significant” certainty, does that invalidate the fact that A/X is white and B is black, that A/X is a Lippizan and B is a plough horse, or that you thought A/X had better form and breeding than B? According to you (“If results of two subjective tests do not match it does not prove anything”), it does.

jfm -- Wed, 07/23/2008 - 20:37

While I don't think blind testing is the final answer either, I'll take the opposite tack and take observational reviewing to task. My main critique revolves around the use of language.

I'll focus on one word for now: "incredible" and its variant, "incredibly."

I can pull out any TAS issue and find that word in a review, even in several reviews depending on the issue. "Incredible soundstage," "incredibly transparent," and so on. It's just a slightly longer word, and just as lazily used, as the ubiquitous "very."

That use of language hardly makes for a scientific observation. It would be better to describe the soundstage or the whatever criterion is being observed.

Much like HP used to do:
- "when the players are sitting at an oblique angles to the front of the stage, the [amp] can make you hear it."
- "you can hear each string's individual harmonic signature at different points in space"

Both samples taken from the amplifier survey in Issue 36, still a landmark for the use of language to describe audio.

Jonathan Valin -- Thu, 07/24/2008 - 13:57

You have a point about the use of intensifiers, jfm,. Still, we do try to be as specific as we can in our reviews. And with an audience that parses every adjective or adverb for nuance, even intensifiers are not used without deliberation and purpose.

michkhol -- Thu, 07/24/2008 - 15:21

jvalin wrote:

In a word (or several), I am saying that blind listening tests are more "subjective" than standard observational listening tests. You are literally asking people to make guesses in the dark about differences that experienced folks take days, weeks, and months of close listening on familiar systems with familiar source materials to sort out.

That's the idea of unbiased listening. The "standard observational listening tests" are inherently biased because all testers involved are interested persons. No harm done if they do it between themselves, but they publish "reviews" which say little more about the product than the ad on the next page, but in 100 different ways. Not every reader has the courage as Mr Harley to deny a "professional" authority and make their own choice.

Quote:
BTW, you didn't answer my point about your comparison of blind listening tests to drug trials. I will repeat: How could any reasonable person say these two procedures are the same?

Actually I talked about methodology which is the way you set up the procedure and interpret the results. And my point was that if the procedure was not set up properly it did not invalidate the whole test. Ah, almost forgot, no reasonable person can say these two procedures are the same. This is my answer.

Quote:
And why, may I ask, does the fact that two subjective tests don’t give you the same results make both of them invalid?

I don't know. Does it?

Quote:
Let's say I sit you down in a room and bring in two healthy, well-groomed horses of the same height and weight (well, within 0.1 gram)--one will be a pure white Lippizan and the other a black plough horse. You will be allowed to view each horse for as long as you want. I will ask you if they differ and you will tell me, on the basis of your observations, how (color, breeding, form, etc.) And then I will ask you to pick the horse you think is best in all observed categories—breeding, color, form, etc. We will call the winner "A/X" and the loser “B.”

Then I will blindfold you to eliminate all "psychological factors" and "unconscious biases." And at short intervals, “A/X” and “B” will be trotted out in front of you, as many times as you want. You must decide by touch, smell, and sound alone (tasting is optional) which one is “A/X.” How do you think you'll fare? (And consider yourself lucky that I didn't propose this test with a garden snake and a Black Mamba.)

Are you saying that your "observational listening tests" are based more on observation than listening?

Quote:
And here’s the kicker, if you are unable to pick A/X in this AB/X test with “statistically significant” certainty, does that invalidate the fact that A/X is white and B is black, that A/X is a Lippizan and B is a plough horse, or that you thought A/X had better form and breeding than B? According to you (“If results of two subjective tests do not match it does not prove anything”), it does.
I will try to be more clear. If results of two subjective tests do match, it does not prove anything either. Test validity seems to be very sensitive issue for you and Mr Harley as if dismissing one test would make another one more valid. I never questioned validity of either one. If it is a subjective test, there is no validity, nothing to validate against. Attempts to "validate" a test by bringing in something that is beyond its scope is a poor practice.

Jonathan Valin -- Fri, 07/25/2008 - 02:51

I must say that your reply is, as your other replies have been, bewildering, as you keep shifting the grounds of the argument and dodging points. However, I’ll play some more, although we both know that neither of us is going to come to an agreement.

Quote:If it is a subjective test, there is no validity, nothing to validate against.

If you grant the subjectivity of double-blind listening tests (even, in your case, if you only grant it in a limited way), how can you then argue that they are more "legitimate" than observational reviews? Moreover, if there is nothing to validate against, how is it that so many reviewers and impartial listeners tend to agree on the sound of individual components, whether they prefer them to others they like or not?

Quote:The "standard observational listening tests" are inherently biased because all testers involved are interested persons. No harm done if they do it between themselves, but they publish "reviews" which say little more about the product than the ad on the next page, but in 100 different ways

Ah! Now we get down to the gist of this whole mishigas, which, IMO, really isn't about a reviewer's methodology but his candor and his judgment. It would be nice if audio were a winner-take-all sport (particularly if the winner was priced right and you happen to already own it), but, as I have already had cause to point out on another thread, it isn't, which is one good reason why reducing judgments about audio gear to a checkmark on a true/false test is idiotic. Audio is not just about good, better, best; it is about good, better, best for me. In short, it is about preference and taste. There, I've said it.

As reviewers, we try to clarify our preferences and taste (what we call our listening biases) in at least 100 different ways. We are upfront about this. I, for instance, prefer classical music (although not exclusively) and small-scale classical to large. These "biases" undeniably play a role in which equipment I prefer and which I don't. If your biases jibe with mine, then I may be able to point you toward some worthwhile gear. But that's all I (or any reviewer) can do—point in a direction and explain why I’m pointing.

As for being an “interested person,” well, sure. All reviewers, from movies to books to cars to wine to you-name-it, are. But as for being biased towards advertisers (if that’s what you’re implying—and if you’re not others have), let's look, for example, at some of my recent picks in, oh, preamplifiers. The three I've liked best over the last two years are the hybrid-tube Audio Research Reference 3 linestage, the 300-B-based Audio Source Reference Two linestage, and the solid-state Parasound JC-2 linestage. What's interesting about these picks, at least for this argument, is that all three of these products sound much more alike than different, although each also has its own particular set of strengths and weaknesses. What is also interesting, at least for this argument, is that two out of the three preamps that I extolled are made by companies that don't advertise in our magazine. And the third is made by a company that once took out a small ad but hasn't been in our pages recently. My “interest” in them hasn’t been dictated by commercial concerns or advertising lit; I genuinely prefer them to other linestage preamps on the basis of close listening and comparison to what I consider the sound of the real thing.

Whether you or I or anyone can tell the differences between these three preamps—or any three preamps—in a blind-listening test with the same confidence with which we can distinguish among them in observational listening on familiar reference systems with discs we know well isn’t important. No one’s going to be listening “blinded” when he or she listens to music in his or her home, and no one is going to be listening in two or three minute bursts, followed by two or three minute bursts of another component’s sound. I am confident that, if a reader shares my biases, he or she will hear what I review much the way I do. I’ve had too many people, too many times, tell me that this is the case. And if this isn’t some kind of objective validation that what I and other on TAS are doing we are doing well, then I'll settle for it until something better comes along.

michkhol -- Fri, 07/25/2008 - 13:12

Disclaimer: the whole argument started when Mr Haley questioned the validity of the double-blind test where a hi-res signal was compared with the same signal passed through a A/D/A loop. As an argument he brought up a different test where a live feed was compared with a digitized one at different resolutions. I think we both have to agree that those tests are quite different.

jvalin wrote:I must say that your reply is, as your other replies have been, bewildering, as you keep shifting the grounds of the argument and dodging points. However, I’ll play some more, although we both know that neither of us is going to come to an agreement.

I have to say that you don't answer all my questions either. What you may interpret as shifting grounds and dodging points may be partly misunderstanding and partly that I was not clear enough. People never see/hear things the same.

Quote:
If you grant the subjectivity of double-blind listening tests (even, in your case, if you only grant it in a limited way), how can you then argue that they are more "legitimate" than observational reviews? Moreover, if there is nothing to validate against, how is it that so many reviewers and impartial listeners tend to agree on the sound of individual components, whether they prefer them to others they like or not?

I was not the one who insisted that double-blind test was subjective. I only point out that if we admit it, it is no better or worse than any other subjective test. Hence I see no reason why I should give more credibility to a particular one in this case. In my personal opinion however the double-blind test eliminates the most of the human bias and only requires a calibrated system to give adequate but not necessarily likable results. Regarding the validation, the only information you have is a few (statistically) personal opinions. What if a component sounds similar in 8 systems and totally different in 2 out of 10? Out of 1000? If it is not possible to guarantee how a component would sound in a particular system, how much is this information worth?

Quote:
In short, it is about preference and taste. There, I've said it.

Quote:
If your biases jibe with mine, then I may be able to point you toward some worthwhile gear. But that's all I (or any reviewer) can do—point in a direction and explain why I’m pointing.

Even if my biases jibe with yours (and they do) I will hardly own the same equipment you use for testing. So if I'm in the market for a new component I still have no idea how it would sound in my system. Your reviews usually are very interesting and the latest review of Naim 5i by Mr Harley is very educational indeed. But frankly speaking I cannot care less how it would sound in somebody else's system. Speaking of the agreement, if something sounds too bright and harsh for someone, it could be definitive and clear for the other. The same sound could be dull or relaxed depending on who listens to it, what mood he is in, etc. But you know all that. The question is how to come to an agreement (i.e. reference point) while describing sound?

Quote:
As for being an “interested person,” well, sure. All reviewers, from movies to books to cars to wine to you-name-it, are. But as for being biased towards advertisers (if that’s what you’re implying—and if you’re not others have),

It's not that I'm concerned about bias towards advertisers, it is the lack of additional information which would help me to make an educated choice. If I'm interested in a particular component, I obviously cannot make a choice based solely on the reviewer's taste. So I look for implementation details. I think you will agree that transformer MC input will not sound the same as transformerless, and passive RIAA will differ from the one based on feedback. Also transformer volume control, stepped shunted one and a regular variable resistor will most likely sound differently. The same is for coupling capacitors, transformers and feedback in amplifiers, not mentioning class. These are criteria I use when I try to assess how a component would sound. Some manufacturers a very tightlipped about implementation details, but more and more have them in their ads. And you know sometimes I get more information from an ad than from a review.

Quote:
I am confident that, if a reader shares my biases, he or she will hear what I review much the way I do.
If it were true our (my) life would be much easier. Unfortunately it can be true only under certain conditions. First our music perception should be very close. Second our sound systems should behave very similar. I'm sure that there is enough people who satisfy both of them but unfortunately it does not help much others even with the same musical bias.

Ok, now back to testing. I give you a dilemma: I have two different recordings made by the same label in the same place with the same performers. One recording is considerably darker than the other (because of different microphones used). Which one should I take as a reference? If I tried to judge my system by this dark recording I would think that my system was dark-sounding. If I happen to play only bright records my system would sound bright. I can hear that transients are squashed on some recordings and overemphasized on others. How can I tell whether it is my system or the way it was recorded?

Steven Stone -- Fri, 07/25/2008 - 13:21

[ If I happen to play only bright records my system would sound bright. I can hear that transients are squashed on some recordings and overemphasized on others. How can I tell whether it is my system or the way it was recorded?]

Finally a question that I can answer...

The way to judge your system accurately is to make your own recordings.

For less than the cost of a top D/A you can have an entire on-location recording rig.

If you make the original recording you KNOW how it should sound.

That's what I do, so I can confidently say, "I don't need no stinkin A/B/X box."

Now back to your regular programing...

:roll:

Steven Stone
Contributor to The Absolute Sound, EnjoytheMusic.com, Vintage Guitar Magazine, and other fine publications

Jonathan Valin -- Fri, 07/25/2008 - 16:49

Quote:Even if my biases jibe with yours (and they do) I will hardly own the same equipment you use for testing. So if I'm in the market for a new component I still have no idea how it would sound in my system.

Here we can agree, perhaps for the first time, 100%. I have never said and will never say: "Buy this because I like it." Hell, I wouldn't buy something I recommended solely because I recommended it. I would have to hear it first--in my own system or, at the very very least, in a system that I was very familiar with and I considered to be very high quality and then at length and with my own music. And if this can't be done, then I would pass. Period. NO ONE SHOULD EVER PURCHASE AUDIO GEAR SOLELY ON THE BASIS OF A REVIEW, EVEN A REVIEW BY SOMEONE WHOSE JUDGMENT HE TRUSTS AND WHOSE BIASES ARE SIMILAR TO HIS OWN. When I said i was confident that people with my biases would hear a component the way I do, I did not mean to imply that they would like it as well as I do or run out and buy it because I liked it (although, to my horror, this has happened)--just that they would recognize its sound from my descriptions and perhaps give it a listen as something potentially worthwhile.

Quote: I have two different recordings made by the same label in the same place with the same performers. One recording is considerably darker than the other (because of different microphones used). Which one should I take as a reference?

Why both, of course! And the one that is darkish in balance should sound darkish, and the one that is brighter in balance should sound brighter. I get the feeling you already know this, but that is what we (or I) call "transparency to the source," and is indeed one of the observational "tests" I customarily make (and part of what I meant by using "known sources").

Quote:The way to judge your system accurately is to make your own recordings.
For less than the cost of a top D/A you can have an entire on-location recording rig.
If you make the original recording you KNOW how it should sound.

I agree with this with a couple of caveats. I've made my own recordings and I've been to Telarc recording sessions here in the Cincinnati. The problem for me is this. You say you know how "it" should sound, Steve. But what is "it"? The sound of the real thing as you remember it? The sound of the "mike feed" (and, if so, played back through which mikes, which cables, what chain of electronics, and what speakers or cans?) The sound of the mastertape (and, once again, if so played back through which electronics and which monitors and, if mixed, how e'q, potted, panned, etc.)? Or the sound of a finished CD and, if so, how processed and produced, etc.

My point is the the "it" is a bit of a moving target. Nonetheless, it is still (or it can be) a somewhat "more knowable" source than most commercial media. However, this does raise the very interesting question of whether a stereo system should be "accurate" in the sense of sounding like what you remember a mastertape sounding like (albeit a mastertape played back through a chain of electronics and transducers) or whether a stereo system should sound "realistic" like what you remember the orchestra (you were recording) sounded like, unmediated by a chain of microphones, cabling, electronics, and monitors. We like to think that the one (fidelity to the absolute sound) will follow from the other (accuracy to the source), more or less proportionately, but I will always remember the great Doug Sax saying, in a colloquium I participated in in this magazine, that he preferred the sound of vinyl to the sound of mastertapes! Why? Because the "distortions" added by vinyl cutting and playback come closer to the sound of live music, of real instruments played in real venues.

But this is a different topic for a different day.

Steven Stone -- Fri, 07/25/2008 - 23:42

Jonathan you are quite correct when you say,

Quote:

"My point is the the "it" is a bit of a moving target"

But the advantages of using recordings where the reviewer is intimately familiar with the live sound of the hall from the best seat, the live mike feed (from the best location in the hall), the A/D, and other finer points of the recording process makes accurate sonic evaluation on new or alien system far easier.

The question of whether a recording is better when it mimics the distortions of a live event is better left for another lengthy discussion.

:roll:

Steven Stone
Contributor to The Absolute Sound, EnjoytheMusic.com, Vintage Guitar Magazine, and other fine publications

Robert Harley -- Tue, 07/29/2008 - 10:13

Your editorial in Issue 183 ("The Blind (Mis-) Leading the Blind") is another apparent effort to discredit the role of double-blind evaluation of audio equipment, despite the fact (which, to your credit, you acknowledge) similar methods are accepted, and even insisted upon, in scientific settings when testing claims of difference.

I have two basic issues with the your opposition to double-blind testing methods. First, the quality of the reasons given for DBT's inadequacy and the quality of the argumentation, which largely assumes away the problem that DBT and similar methods are designed to deal with. Second, the implicit point that reviewers of audio equipment must choose one approach at the expense of eliminating the other. Both are faulty premises.

Your basic reason for dismissing DBT is not substantively different from John Atkinson's. It can be summarized as "DBT is obviously useless because it yields results saying that there is no discernible difference between things that I, subjectively, know to be different." However, many scientists have arrived at using DBT-like methods over many years because they understand the limitations imposed by human bias. Your response is, yes, but these are "golden ear" experts, and thus you imply that the problem of human bias is eliminated, or at least can be ignored. In essence, "trust us." But human bias cannot be so easily overcome in any other field of endeavor, and the opponents of DBT have not presented any factual data supporting a belief that it can be overcome here.

In addition to presenting no compelling reasons why "audio is different" than the many other areas in which DBT has proven to be useful or even essential to eliminate human bias, the anti-DBT arguments are marked by unsupported statements and not a little hominem rhetoric (which doesn't exactly quell concerns about human bias).

For example, you simply state that "blind listening tests fundamentally distort the listening process" as if there is some objective benchmark out there that establishes what the "listening process" is. As another example, you characterize DBT-users as "partisan hacks bent on discrediting audiophiles." Yet you do not explain why these individuals would be "partisan," nor do you explain why those who cash paychecks from publications like TAS and Stereophile, which derive a substantial amount of revenue from equipment manufacturers, should be considered any less "partisan." Moreover, you fail to explain why these "hacks" would wish to "discredit audiophiles", and, even more, fail to explain why the amorphous and largely unknowable group known as "audiophiles" is even worth taking the time to discredit. Perhaps if we changed the word "audiophile" to "high end audio journalist," your statement might have a bit more explanatory power. In any event, I'm simply pointing out the large number of confounding human variables and emotions that methods like DBT are designed to minimize.

Another hallmark of the "subjectivist" arguments is that you do not use factual data to discredit the proponents of DBT, but rather the same kind of subjective, qualitative claims that are at issue in the debate over DBT.

For instance, in your editorial, you state, with evident incredulity, that the DBT study published recently in JAES couldn't discern any difference between Red Book and various high resolution digital formats. But why should one be so incredulous over this result? It is hardly an established fact that high resolution digital sounds inherently "better" than Red Book CD. I do not need to rehearse the confounding factors for the relatively well-informed readership of this magazine, but they include improvements in recording, improvements in digital mastering and, not least and lest we forget, the very theory of human hearing, which states that 44 kHz is a high enough sampling rate to faithfully reproduce the vast majority, if not all, of the information capable of being retrieved by the human ear.

My point isn't to say that high resolution is not better than Red Book, but simply to say that it is not remotely justified to act as if this is so established a fact that anyone questioning it should be subject to ridicule. Ditto for your point about expensive high-end monoblocks vs. "a $220 Pioneer receiver."

You come closer to using facts with your anecdote about the Swedish Radio codec, but isn't this a rather extreme example to which to have to resort to prove your point? Highly lossy, low bit rate compression vs. uncompressed audio is one thing; how about two different solid state amps? Two different speaker cables? Two different power cables? Cables on the floor vs. on risers? These are the types of claims that are routinely made in your magazine. "Proving" your point with the low-bit codec example is a bit of a bait and switch, isn't it? In any event, you don't give enough information about the circumstances of this Swedish Radio test to draw any real conclusions -- but a 2 year, 20,000 test study suggests a lot is going on.

Finally, your arguments against DBT implicitly rest on the assumption that DBT would fully displace non-blind, qualitative listening tests. In other words, you paint a highly reductionist "doomsday scenario" about what advocates of a scientific method would like to achieve. That's just a straw man.

I don't think anyone is advocating that we dispense with your qualitative listening tests and the reporting of your considered impressions, which your readership (including me) obviously values. But why not use DBT as an additional data point? It would be interesting to know, after you were through telling us about the latest and greatest power amp that has "lifted veils" and edged ever closer to "the Absolute," that in a DBT test, listeners did no better than chance in distinguishing it from a 30-year old Crown (or even a 10-week old Onkyo). Or whether listeners in a DBT expressed no statistically significant preference for the amp under review than for the reference(s).

DBT information should not replace reviews, but supplement them. In fact, retaining subjective reporting is necessary, as DBT does have limitations as applied to high-end audio. Because you cannot practically design tests that will cover the vast range of possible system configurations and listening modes that might be relevant, there will always be a role for the avuncular, experienced subjective impressions of an audio reviewer. But DBT as a reality check? Ignoring vendor-relationship considerations, no reason not to do it.

One thing we agree on is that the buyer should always make his or her own decision after personal listening. But auditioning audio equipment is expensive and time-consuming, so arming the buyer with as much of the relevant data as possible should be the objective of your magazine. Using some DBT would also likely have a desirable disciplining effect on your reviewers.

Sincerely,

Lawrence S. Makow

michkhol -- Wed, 07/30/2008 - 10:55

StevenStone1 wrote:

If you make the original recording you KNOW how it should sound.
:roll:
I make my own two channel recordings of a symphonic orchestra and I can assure you that live sound in the hall, mic direct feed, sound from my monitors after the recording and sound from my home stereo system are all different. Not better or worse, but different. I can tweak the sound in my DAW and make it sound to your liking or anybody else's (just like mastering engineers). So what should I use as a reference?

michkhol -- Wed, 07/30/2008 - 12:23

jvalin wrote:
Here we can agree, perhaps for the first time, 100%. I have never said and will never say: "Buy this because I like it." Hell, I wouldn't buy something I recommended solely because I recommended it. I would have to hear it first--in my own system or, at the very very least, in a system that I was very familiar with and I considered to be very high quality and then at length and with my own music. And if this can't be done, then I would pass. Period.

I am glad that we agree on this 100%. But I would also appreciate if TAS advocated more for in-home auditions. So far only one dealer in my area allows in-home auditions (for a weekend) but unfortunately it does not carry everything I would like to audition. All others say "come and listen" or "it has great reviews what else do you need?" or "we allow in-home auditions only if you commit to buy". In this situation I have to "pass" almost every time. A few Internet dealers have good return policy, but again they don't cover everything.

I have been reading TAS for over 4 years straight and all reviews (except may be a couple) say how good a component X sounds with slight variations. Even if tried to compare reviews of components X and Y from the same reviewer, it is not always apparent what particular reference equipment was used to play a particular reference piece of music (provided the reviewer uses a predefined set). I trust your judgment (as well as any other professional reviewer) but as a whole the TAS review system is not calibrated or rather there is no system at all. Without a system (any system) it is just a bunch of expanded ads with a personal touch. Now if I see something new and reviewed in TAS and consider it for purchase I read only the technical part. For sound I try to find negative customer reviews on the Internet or elsewhere. Usually they are exaggerated but pinpoint the weak spots well enough. And the final test of course is the in-home audition.

Quote:
Why both, of course! And the one that is darkish in balance should sound darkish, and the one that is brighter in balance should sound brighter. I get the feeling you already know this, but that is what we (or I) call "transparency to the source," and is indeed one of the observational "tests" I customarily make (and part of what I meant by using "known sources").

But how do you know how it should sound in the first place? Statistically if something sounds bright on N systems and dark on a N+1th system it does not reveal any information about the "true" tonality of the sound, only differences between system reproduction. In any case it makes no sense to talk about a system sounding "transparent to the source" (which one by the way?). What I read in every article is comparisons to the "ideal" sound but what is ideal for you is not necessarily ideal for me.

Quote:
...
but I will always remember the great Doug Sax saying, in a colloquium I participated in in this magazine, that he preferred the sound of vinyl to the sound of mastertapes! Why? Because the "distortions" added by vinyl cutting and playback come closer to the sound of live music, of real instruments played in real venues.

I see here a paradox. A digital recording is technically closer to the source (master media) than equalized and compressed vinyl. And yet vinyl sound is considered more real and no one actually cares about the same source it came from.

Jonathan Valin -- Wed, 07/30/2008 - 14:41

Quote:[So far only one dealer in my area allows in-home auditions (for a weekend) but unfortunately it does not carry everything I would like to audition. All others say "come and listen" or "it has great reviews what else do you need?" or "we allow in-home auditions only if you commit to buy". In this situation I have to "pass" almost every time. A few Internet dealers have good return policy, but again they don't cover everything.

What can I say? You’re doing the right thing, and those dealers who don’t permit home auditions aren’t.

Quote:But how do you know how it should sound in the first place? Statistically if something sounds bright on N systems and dark on an N+1th system it does not reveal any information about the "true" tonality of the sound, only differences between system reproduction.

In theory, of course, you are correct. We don't know how a record or CD "should" sound in the first place. But in practice, as I'm sure you already know, if a given record sounds "dark" in overall balance on N systems and then sounds "bright" or markedly less dark on N+1, I would conclude that N+1 is adding a brightish coloration to the disc. Of course, this brightness will (or should) be apparent on all discs played back through N+1. And in practice it always is. I can't think of an instance where a given component selectively darkened or brightened previously bright or dark sounding discs, although I can think of components that have a kind of two-tone balance (i.e., they sound dark and bright simultaneously), and I can think of plenty of components that sound marginally less dark or marginally less bright than other components on basically dark or bright source material (i.e., that seem to bring them closer to neutral in balance without changing their fundamental character).

Quote:In any case it makes no sense to talk about a system sounding "transparent to the source" (which one by the way?). What I read in every article is comparisons to the "ideal" sound but what is ideal for you is not necessarily ideal for me.

You're mixing apples and oranges—again. (And maybe we are too).

“Transparency to the source” (as I use the term) means that a component is playing back an LP or CD without adding much opacity or coloration of its own. You will say—in fact, you already have—how do you know how that source sounds (or is supposed to sound); ergo, how do you know what's being added or subtracted precisely. And I concede that I don't know precisely, but I know approximately (after scores or hundred of listens to the same discs on scores or hundreds of pieces of gear and systems). I know, for instance, that the intervals of the ostinatos played by the cellos and basses on the Mercury recording of The Firebird are clearly audible on certain systems (and not audible or as audible on others); I know that Joni Mitchell sings the words "a dark cocoon" in the penultimate lines of "The Last Time I Saw Richard" from Blue and that this very-difficult-to-decipher line is clear on a very few systems and components and not clear or inaudible on many others; I know that the bassist counts time sotto voce at the very start of Sunny (Mapleshade) and that very few components or systems reproduce this without making you strain to hear it (even if you know it's there). I know literally thousands of things about the pitches, timbres, intensities, and durations of the music on given recordings (and so, in spite of your skepticism, do you). And these things—pitches, timbres, intensities, durations—aren’t approximate; you can find them clearly written in scores and you can compare scores to recordings (which I do). This doesn’t mean that a performer or conductor doesn’t take liberties with the “text” and that the instrument, the manner in which it is being played, how cold or warmed up it is, what hall it is being played in, which mike it is being recorded by, etc. don’t affect timbre, but it does mean that the harp pizzicatos that are adding just the slightest gorgeous bit of color and vibrato to the doublebasses at the start of the Passacaglia of Lutoslawski’s great Concerto for Orchestra ought to be audible (and they aren’t always—or often). Taken together, these little things—heard, not heard, changed, and (very rarely) revealed for the first time—added to the big things like tonal balance and dynamic range and transient response add up to a pretty reliable idea of whether a given component or a given system is "transparent to sources."

Now, this comparison to the "ideal" (to the absolute sound, which is to say to the sound of live music as you or I hear it) is not the same thing as "transparency to sources," although transparency to sources definitely plays a key part in it. The first is a matter of fidelity to what was recorded; the second is a judgment that reviewers and listeners make about the relative realism of what was recorded and played back, and that depends on the quality of the source, the transparency and, for lack of a better word, personality of the hi-fi system, and your own idea of what constitutes the absolute sound. “Realism” is both relative and absolute—relative in the sense that we blind men all have slightly different perspectives on that elephant, the absolute sound, but absolute in the sense that we instantly know “real” (as opposed to ersatz) when we hear it—we know that “that” is a real piano and not a recorded one playing in our living room, regardless of inevitable differences in the instrument’s timbre and the performer’s touch. Even animals catch on to when something is fake. I fondly remember, years ago, my wife making a recording of her voice on a little tape deck in which she called out our dogs’ names (we had three of them) and told them to “Come.” The very first time she did this, the dogs (who had never heard such a thing before) ran over to the table where the tape recorder was playing. They were “fooled” by their first experience of hi-fi. However, they were never fooled again; no matter how many times she played the tape, they ignored it. Of course, they didn’t know it was a tape, but, after their initial astonishment, they knew it wasn’t “really” Kathy calling to them.

Quote:I see here a paradox. A digital recording is technically closer to the source (master media) than equalized and compressed vinyl. And yet vinyl sound is considered more real and no one actually cares about the same source it came from.

It is a paradox (although not all LPs are “equalized and compressed”—consider direct-to-disc LPs, for example—and in any event they aren’t all equalized or compressed to the same degree). I would conclude that perceived sonic “realism” is, to some undefined (as yet) extent, an attribute of recording and playback media. I would also conclude from experience that perceived realism is also relative to the listener.

michkhol -- Thu, 08/07/2008 - 14:46

Sorry for the delay with the reply, I cannot visit this forum as often as I would like.

jvalin wrote:
In theory, of course, you are correct. We don't know how a record or CD "should" sound in the first place. But in practice, as I'm sure you already know, if a given record sounds "dark" in overall balance on N systems and then sounds "bright" or markedly less dark on N+1, I would conclude that N+1 is adding a brightish coloration to the disc. Of course, this brightness will (or should) be apparent on all discs played back through N+1. And in practice it always is. I can't think of an instance where a given component selectively darkened or brightened previously bright or dark sounding discs, although I can think of components that have a kind of two-tone balance (i.e., they sound dark and bright simultaneously), and I can think of plenty of components that sound marginally less dark or marginally less bright than other components on basically dark or bright source material (i.e., that seem to bring them closer to neutral in balance without changing their fundamental character).

That's what I'm talking about. There is no absolute reference. You can compare numerous systems with the same source material but still you have no information about how this source material "should" sound. Even if you bring the CD to the mastering studio it will sound different comparing to the hi-res master it was cut from (due to dithering). My point is that it is plain incorrect to say "this system/component sounds more/less realistic" because there is no such thing. Granted I saw only a few such remarks in your magazine lately but each of them disorients the readers.

Quote:
“Transparency to the source” (as I use the term) means that a component is playing back an LP or CD without adding much opacity or coloration of its own. You will say—in fact, you already have—how do you know how that source sounds (or is supposed to sound); ergo, how do you know what's being added or subtracted precisely. And I concede that I don't know precisely, but I know approximately (after scores or hundred of listens to the same discs on scores or hundreds of pieces of gear and systems).

Thank you for the clarification. I agree that the more transparent the component path is the more details you will hear from the source. But this difference also can be in part due to interaction between the component and the rest of the system. The transparency of a component can vary with the system it works with. Basically it means that if you claim a component to be transparent to a certain degree is will be valid only for the system(s) you tested it with. Will it be as much as transparent in a random system? In a similar system most likely, but somehow I don't think that the most of TAS readers have systems similar to the ones you use. Me, I definitely don't and I remember how disappointed I was with an amplifier which had all kinds of awards and raving reviews.
Another question is, were the counting time, transients, sounds etc. meant to be heard in the first place? May be the counting was not suppressed enough at mastering, or transients were not compressed as they should have been, or the maestro was not good enough to soften the orchestra at this point, or there was no good take for this part of the song? I mean, you may end up using mastering/recording flaws as criteria to judge the quality of a playback system.

Quote:
“Realism” is both relative and absolute—relative in the sense that we blind men all have slightly different perspectives on that elephant, the absolute sound, but absolute in the sense that we instantly know “real” (as opposed to ersatz) when we hear it—we know that “that” is a real piano and not a recorded one playing in our living room, regardless of inevitable differences in the instrument’s timbre and the performer’s touch.

I don't think that the comparison is valid. In both cases the sound waves are a real physical phenomenon. The difference is that in no way the sound waves from a real piano are the same as coming from a speaker and that's what we hear. It is physically and theoretically impossible to hear "real" sound not coming from real instruments. And even if it was, our ears normally are not where the microphones are. That's why no recording can sound objectively "real". As for our knowledge of the real piano sound, we know it as dogs know smell. It's not that the dogs were not fooled by the recording, it is the record player that did not smell like your wife :)
Quote:
It is a paradox (although not all LPs are “equalized and compressed”—consider direct-to-disc LPs, for example—and in any event they aren’t all equalized or compressed to the same degree). I would conclude that perceived sonic “realism” is, to some undefined (as yet) extent, an attribute of recording and playback media. I would also conclude from experience that perceived realism is also relative to the listener.
Regardless of the recording method there is always a natural compression coming from the way the sound is picked up from the vinyl. And RIAA equalization after that. Both produce harmonics not present in the source signal. By pure luck it appears that those harmonics sound pleasant to the human ear. So this perceived sonic "realism" is no more than an ear candy. Nevertheless I like it.

Andy Simpson -- Sat, 09/20/2008 - 08:27

gmgraves wrote:I don't pretend to be able to explain your European radio CODEC story, but obviously something was wrong if no one in a listening test heard a 1.5 KHz noise in the program material. Perhaps the panels weren't given more than a few seconds to hear each sample, I don't know, but it would seem to me that anyone would be able to hear a non-correlated 1.5 KHz tone in a musical performance if given ample time (no more than a few seconds) to focus in on it.

I'd also suspect auditory masking and/or (equal) loudness related issues.

For the golden-ear'd fellow who noticed the tone after the 'extensive' test, I'd be surprised to learn that he didn't crank up the volume in order to notice this.

Andy

Andy Simpson -- Sat, 09/20/2008 - 09:27

jvalin wrote:
“Transparency to the source” (as I use the term) means that a component is playing back an LP or CD without adding much opacity or coloration of its own.

The big problem with this idea (aside from those outlined elsewhere in the thread) is that simple loudness, according to the well known 'equal loudness effect', can have a drastic effect on the spectral perception.

This spectral perception in turn has a drastic effect on many other factors - balance, reverb levels, timbre, auditory masking (depth), etc.

In other words, where a recording has been mixed/mastered for 'ideal sound' at a specific listening level, at any other listening level the spectral balance is likely to deviate from 'ideal' (how could it not?).

Almost none of the recording industry is actually listening at performance SPL, so whichever spectral related choices they make (almost all subjective decisions - processing, equipment choices, balances, etc) are made according to the loudness level at which they listen.

In this way, the 'absolute sound' is absolutely bound to the equal loudness effect.

Andy

rwortman -- Sat, 11/01/2008 - 23:08

All very interesting stuff here. It took me a few months to get around to registering here. First I admit to objectivist leanings. I don't think cable elevators, magic pebbles, little bits of shiny tape, CD demagnetizers and a whole host of of other tweaks do anything but lighten the wallet of the buyer.

I don't believe that DBT is infallible. I agree that certain characteristics of an audio system that are not readily apparent when listening to snippets of music can be come painfully apparent in longer listening sessions. I don't believe that measurements tell us the whole story. I think we can do more and better measurements but by and large the audiophile community thinks all measurements are rubbish so no one outside of a design lab really wants to bother. I also believe that the human ear/brain isn't a very reliable measuring device either. I have read examples of A-A comparison tests where listening panels reliably heard differences between an audio component, cable, or tweak, and itself when they were told that something had been changed. Our brains are a big box of biases and preconceptions that filter everything we see, hear, or think. There is simply no way to stop this from happening, no matter how noble our intentions. Knowing this, I don't see how we can dismiss any testing methodology merely because we don't like the conclusion. Things that are established scientific theory today were dismissed as absurd when first proposed. RH would probably say any test that "proved" that a decent bit of shielded coax and a pair of $5 RCA's sounded the same as a $1000 interconnect was absurd. I, on the other hand think that any subjective listening test that finds that a $1000 interconnect's effect on a system is "nothing short of amazing" is just as absurd. Neither of us really has a right to reject the test results because it fails to agree with our preconceptions

I think a bit of ABX would be an interesting addition to any review staff's arsenal. I have read way too many reviews where some audio gadget that from an engineering perspective should have had a minuscule or no effect on the sound (equipment feet, power cords, connectors) characterized as astounding sonic revelations. At the very least, a reviewer friendly DBT should be able to separate the higher order improvements from the barely audible ones.

If regimented comparison tests are demonstrably unreliable, sighted listening tests aren't that damn reliable either, and measurements can fail to correlate with either, what does that leave us with? It leaves us with a magazine chock full of unsubstantiated opinion. Isn't that why we buy the magazine? Or any other enthusiast magazine? I read Motorcyclist to find out what experienced riders think of riding the latest machinery,not to read the spec sheets. I read TAS to find out what RH and the rest of the staff think about the equipment they review. I do find the measurements that their competitor does interesting just to see how and in what ways they do or don't correlate with the listening impressions but that is just a bit of seasoning on the real meat and potatoes. I also know that because all of the listening impressions are sighted and uncontrolled that all manner of biases and subjectivity are coloring the reviewers opinions. So what. The same thing happens to me too.

The only thing I really worry about is that this community's insistence that people accept as fact all the tweakier aspects of modern day audio or be just as ridiculed from inside as the tweakers are from the outside is pushing people away. I hope that we aren't causing this market to shrink to the point where affordable high performance audio gear will cease to exist. When I buy a new piece of gear today, I know exactly one person I can talk to that shares my enthusiasm. 25 years ago, almost everybody I knew would want to come over and have a listen.

Tom Martin -- Mon, 11/03/2008 - 11:15

As discussed above, I don't think we at TAS are trying to dismiss any testing methodology. It really doesn't matter to us in some conceptual sense what approach we use.

There are a few things we're trying to point out:

1. As rwortman says, we shouldn't be rejecting testing approaches because they provide results that challenge our preconceptions. I think we at TAS are simply trying to point out that there is a preconception about DBT that shouldn't be automatically accepted: that DBT is clearly the definitive and superior method for characterizing audio equipment performance. There are sound reasons not to accept this. Rejecting the preconception is not the same as dismissing DBT.

2. I think we have a practical argument against DBT. We probably should have stated this more clearly, but the practical argument is this: DBT is much harder to do well than the approach we use, and so it is a much less efficient way of providing information to our readers. Again, that is not an effort to dismiss DBT as a methodology per se, but rather an explanation of why it may not be useful for a magazine. This same logic applies to quantitative measurements.

3. We think it is logical to take an open minded stance about phenomena. That's because we're focused on finding ways for our readers to enhance their musical pleasure. Too often over the last 40 years, something than "an engineering perspective" indicated was not plausible, but was audible via our methods, turned out to have a solid physics-based explanation. This happens in many other fields as well (cf. Thomas Kuhn's masterwork The Structure of Scientific Revolutions). We care more about the result than the explanation. Thus, I don't think RH would assume that a test that showed $5 coax sounds the same as $1000 interconnect was absurd (we've published a test that indicated Home Depot extension cords were excellent speaker cables). Proving it is a different matter (see the null hypothesis discussion above), but in any event a proof wouldn't be absurd, just really hard to do. In the end though, proof isn't our worry.

4. We would agree that non-blind, long-term listening tests have plenty of issues. We have enumerated these, and will continue to work on them.

rwortman makes an interesting point about pushing people away. I just don't know that TAS or Playback or AVGuide are doing this by being tweaky (probably he/she was referring to the audio community in general). I do think there is a special language and a tendency not to give much context to the reader. That would seem to be a problem for attracting new players and might get old after a while for those who were involved 25 years ago. We've been trying to address that, but maybe we can do more. In the end, though, we suspect that people who were interested in audio 25 years ago have a host of reasons for losing interest. That would be an appropriate conversation in the section "High End Audio Industry".

CEO and Editorial Director, Nextscreen LLC

rwortman -- Mon, 11/03/2008 - 12:08

I am a 51 year old "he". Yes I was referring to the audiophile community in general and certain online fora in particular. There are places where you can't mention skepticism about any far out tweak without being excoriated by the resident "experts" This place seems like a good place for serious civil discourse about a pastime we all love. It seems I was hanging out in the wrong neighborhood.

Tom Martin -- Mon, 11/03/2008 - 12:41

Well, I wouldn't say there are no "excoriators" here, but the intent is to have a civil and thoughtful dialog!

One of the things you may not like, but may enjoy with some background on its purpose, is that -- as far as our published material goes -- we are trying to give the benefit of the doubt to equipment ideas and tweaks we don't understand. The logic is that in a rapidly changing technological world, reviewer should balance skepticism with open-mindedness so as not to become either dogmatic or goofy. I have, for example, asked how it could be that power cables make a difference? My first reason for doing this is that sometimes there is a good explanation that I (despite having a EE) have missed. But even if there is no explanation, you still have to try to listen for what happens. We can't always explain things at first, but once the phenomenon is heard, others figure it out. I believe that the concept of "jitter" in consumer media coverage of digital went through this cycle.

There is, then, an interesting epistemological question: for one to experience a phenomenon, does one have to have an explanation for it? We have tried to stand on the side of the answer being "no", but some of us are coming to believe that the answer just flat out varies from person to person. I don't think we understand yet how to provide the necessary explanatory framework (at least with any rigor).

CEO and Editorial Director, Nextscreen LLC

blue2blue (not verified) -- Mon, 12/22/2008 - 22:48

It's well observed that a scientific test such as a double blind preference test should be carefully analyzed not only for methodology and rigor, but also with an eye to having a cautious and careful understanding of what can be understood from the results. It's important not to draw invalid conclusions from valid results.
That said, there is a certain clarity of understanding when one confronts the reality that one cannot tell A from B in certain circumstances. One should be careful about the conclusions one draws, to be sure. But there is a certain clarity.

All content, design, and layout are Copyright © 1999 - 2011 NextScreen. All Rights Reserved.
Reproduction in whole or part in any form or medium without specific written permission is prohibited.