December 2012

Sun Mon Tue Wed Thu Fri Sat
2 3 4 5 6 7 8
9 10 11 12 13 14 15
16 17 18 19 20 21 22
23 24 25 26 27 28 29
30 31          

Blogs & Sites We Read

Blog powered by Typepad

Search Site

  • Search Site



  • Counter

Become a Fan

Cat Quote

  • "He who dislikes the cat, was in his former life, a rat."

« Double blinding violins (prasad) | Main | Keith Hill responds to Prasad's post »

January 17, 2012


I guess I've been sufficiently indoctrinated by the psychological perspective that it makes sense to me to differentiate between validity and reliability, giving greater weight to validity. Either that, or I'm not understanding your argument for change, because it seems almost tautological to me that, if a test isn't valid for a given purpose, then it doesn't matter if it's reliable.

@ Rachel:

I think you are about 90 percent there. Look at it this way. There is only one idea, and that is VALIDITY - the demonstration that a test has been used successfully for an intended purpose. That's it - end of story. To the extent that there is enough documented history of a test being valid so that you can expect it to perform as a valid test the next time you use it, then it is reliable. Reliability is the expectation of validity.

"...[I]f a test isn't valid for a given purpose, then it doesn't matter if it's reliable." Therein lies the rub. Not only does it not matter, it was a logical and philosophical (not to mention scientific) mistake to think a test could be reliable but not valid. The early experimental and testing psychologists did not understand what their colleagues were saying. The other scientists, in my opinion, did not expound further because they assumed that psychologists knew that they were talking about instruments that were already valid. To them, reliability was consistency of being valid, not consistency of scores without purpose. Psychology just didn't understand this.

Il ya seulement la validité.
La validité est la seule chose.
Il n'ya rien, mais la validité.

Ordinary language would seem to support the psychologists. As always, OED:

reliable, adj. and n. Statistics. Originally: accurate; free from error. In later use: (of a method or technique of measurement) that yields consistent results when repeated under identical conditions.
valid, adj. (and n.) Of arguments, proofs, assertions, etc.: Well founded and fully applicable to the particular matter or circumstances; sound and to the point; against which no objection can fairly be brought.

Of course, the question then becomes whether ordinary language, reliably construed as we may assume it is here in OED, is a valid measure of the scientific merit of the concepts it denotes. I read in your account, Norm, a confusion of the meaning of the statistical use of the word "reliable" with its more general meanings:

Of a person, information, etc.: able to be trusted; in which reliance or confidence may be placed; trustworthy, safe, sure.


orig. U.S. Of a product, service, etc.: consistently good in quality or performance; dependable.

The U.S. torpedoes were unreliable in this general sense. They produced unhappy results, they performed badly. But the consistency of results points out their nearly perfect reliability as an indication that something was amiss.

I don't read the Standards excerpts as requiring independence of intended purpose. From the definition, the purpose of the scoring procedure is to "enable the examiner to quantify, evaluate, and interpret the behavior or work samples." Consistent results suggest the procedure is reliable. Likewise, the converse. Validity addresses whether the test results are well suited to the intended purpose. Are these scores assigned to the examinee's behavior a sound basis for interpreting that behavior? That seems to me to be a question largely distinct from the question of reliability.

@ Dean:

Thanks for taking the time to read and do a thoughtful analysis. You are hitting on all the substantive issues.

I just got home, and put away the groceries. Your comment deserves a thoughtful comment in return. See ya' later, after supper.

Great post, from my wife's work with questionaires about people's responses to work stress, I suspect that some of these are designed to be "reliable" in coming up with expected results for writing of research papers and yielding nice Excel graphs.

@ Dean:

"Ordinary language would seem to support the psychologists [regarding the definition of reliability as consistency of observed results.]"

Yes, that is true. The interesting thing is why it supports the view of psychology that is embedded in the Standards. That is because it comes directly from the development of statistics from Galton to the pioneers of psychological statistics (Pearson, Spearman, Richardson, Thorndike, Thurston, Gulliksen, Guilford, Guttman, and others) to the present time. That is why it reads "reliable, adj. and n. Statistics.... In later use: (of a method or technique of measurement) that yields consistent results when repeated under identical conditions [emphasis mine.]" This just goes to show how influential the development of statistics for the social sciences has been.

The Standards, and every textbook on psychological testing in the world, is very clear about reliability being consistency of observed results, completely independent of purpose. A measurement instrument can be reliable, even if it is not valid. A given measurement instrument may be valid for several intended purposes, and not valid for many others, but reliability is a constant once it is determined. Keep in mind that all of this is true because it is defined as true. However, no physicist, or chemist, or electrical engineer, or weapons developer, or radar operator would agree with this.

When we examine the functioning of the torpedoes we find, 1) a torpedo, as an instrument of war, that explodes on contact with a ship's hull is a valid instrument. If this is demonstrated once, we have established its validity, successful use for an intended purpose. However, validity is not sufficient for adoption as a weapon system for the U.S. Submarine Service. It must be reliable, that is, perform successfully every time it is used. So, a performance requirement is stated by the U.S. Navy that states the consistent results that will be evidence of reliability. For example, it might state that successful detonation must be expected 95 percent of the time a torpedo strikes a ship's hull.

Psychological science separates the concepts of validity and reliability, explicitly, in the abstract and in everyday use. Other sciences tend to put them together, in most usages, and call the whole thing reliable. Thus, reliability means consistent performance of a valid tool. This is why the early psychologists understood reliability in other sciences to be consistency of results. This is one of the reasons why the concept of reliability was able to develop independently of validity. Other reasons include the development of test theory based upon the concept of true score, and the concept of parallel tests. Finally, the use of the correlation coefficient to compute reliability sealed the deal. All of these supported the idea of reliability as consistency of results independent of validity.

However, the foundation of test theory (classical test theory) was built upon two tautologies, true score and parallel tests. All of reliability theory was constructed upon a tautological foundation that was devised to develop reliability as the early psychologists understood it. The fact that the correlation coefficient could compute an index of reliability, was the icing on the cake of a perpetual delusion. Imagine, for a moment, how things would have been different if they understood, clearly, what other scientists were telling them about RELIABILITY AS CONSISTENCY OF RESULTS FOR A VALID INSTRUMENT. I am going to talk about the foundation of test theory in another article.

The U.S. Navy had a valid system (torpedoes with exploding war heads) that was unreliable. The German saboteurs had a valid system (torpedoes with inoperable detonators) that was very reliable. Reliability is the expectation of validity. The system has to be valid before we can make an assessment of reliability.

Your last paragraph is addressing a fundamental issue. Your last sentence leads us to the real question. "That [a test is valid] seems to me to be a question largely distinct from the question of [psychology's] reliability [as consistency of observed data]. Not only is it distinct, it is distinct because reliability in psychological test theory doesn't exist. It is an illusion. THERE IS ONLY VALIDITY! Reliability, for the rest of science, is observing repeated demonstrations of validity.

Norm's latest comment prompted me to explore further. I quickly found a recent article in the criminology literature that addresses reliability and validity in the context or prisoner self-reporting. It largely echoes what you describe in the psychology context. I was pleased to see that the first reference in the discussion of reliability was to Babbie's widely used Practice of Social Research, which I enjoyed reading back in the late '80s. I've turned to a later edition courtesy of Google Books, on page 151 of which begins a pertinent discussion. Before addressing it, however, I want to state what probably goes without saying. Reliability and validity are analytical constructs for data collection and study. They can only be true in accordance with what might seem to be arbitrary determinations of what passes for true, such as a performance standard set by the military. If officials raise the standard for successful torpedoes to 97%, then it is possible that they will have deemed unsuccessful some number of formerly successful torpedoes. There's nothing wrong with this sort of tautology.

Might a reason the other sciences combine the concepts be that they are engaged in enterprises that have an obvious and direct effect on the material world? Two of your examples suggest as much. Engineering and the military use tools to alter the material configuration of the world. Either the alteration succeeds or it doesn't. There's little need to query explicitly the validity of an exploding torpedo as a weapon. (As I described above, however, the measure of success is less obvious, and so it demands the mediation of a proxy standard.) Again, I think the torpedo analogy is also clouding the illustration. The two distinct meanings of "reliable" are in play. In psychology and the social sciences, validity and reliability are conventional predicates, parameters developed in disciplines to characterize grounds for authority of procedures designed to represent states of affairs in the world that don't have readily apparent material manifestations. A torpedo and a test are both tools, but the work they do is distinct, and amenable in different degrees to measures of success. A torpedo succeeds when it strikes and destroys its target; a test succeeds when, after careful analysis, the examiner can accurately predict such-and-such or diagnose such-and-such respecting the examinee. Consequently, assuming reliability of the data gleaned from the test, the examiner must embark further on the additional task (analytically speaking) of making sure the reliable data may validly be applied to the prediction or the diagnosis. The examinee in effect asks, "Will I strike (or have I struck) the target?"

The passages beginning on p. 151 of Babbie (unfortunately, in the middle of his treatment of "reliability") are admirably straightforward discussions of these concepts. It's hard for me to see not that reliability and validity are ultimately two sides of a coin--a notion I get--but that it is fatally wrong to distinguish them operationally and analytically, if only to assure that problems presented by them in study design are avoided. Look at Babbie's elegant target(!) illustration, fig. 5-2 on p. 155. Keep in mind that the analogy, like all analogies, only works so far. He is talking about collecting data.

@ Dean:

Originally, I was going to produce a very large work for a technical audience and technical journal. I decided, and I am glad I did, to publish it in small installments in a less technical format, and accessible to anyone who has an interest in the subject. I can't tell you how flattered and appreciative I am for your reading and commentary. Yours and the comments of others are helping me develop a better technical presentation and work out the kinks in my arguments.

"It's hard for me to see[,] not that reliability and validity are ultimately two sides of a coin--a notion I get--but that it is fatally wrong to distinguish them operationally and analytically [as is done in the Standards,] if only to assure that problems presented by them in study design are avoided."

Again, you are zeroing in on the crux of the problem. In a later article, I will deal with the flaw for psychology, and that it is fatal. Here is a preview: Let us say that you have a psychological test for a mental disorder, and that it has been shown to be valid - it is successful in diagnosing a mental disorder and classifying those people for specific therapies. It has a validity coefficient of
r = 0.53. Of course, this valid test is not perfect (r = 1.00,) and there will be errors in diagnosis and in treatment classification. There will be false positives and false negatives. So, we want to improve the validity and reduce the errors of false positives and false negatives.

In psychological test theory you can improve validity by either improving validity directly, or by improving reliability directly. In test theory validity is contingent upon reliability. So when you improve reliability the validity coefficient improves. I will show two things: 1. When you do all the things that can be done to improve validity, directly, without addressing reliability, there is nothing else that can be done to improve validity, indirectly, by improving reliability. 2. It is impossible to improving only a reliability coefficient, directly or indirectly, without making any reference to validity. It cannot be done. For example, if you have a test of unknown validity, but known reliability, you cannot improve reliability (consistency of scores) without addressing validity. If I can demonstrate this, then it supports the idea that there is only validity, and that the concept of reliability as consistency of scores without reference to purpose is an illusion. It doesn't exist.

How would you address those situations which produce extremely consistent patterns of results (ie. statistical reliability) with uncertain validity due to, for example, being unable to identify what factors impact those results?

I don't think I'm framing my question well. Essentially, do you see a value in non-contextual reliability if it is used as a tool to help discover validity (eg. helping understand what the results mean or actually indicate)?

@ Rachel:

Thanks for reading and formulating these questions.

"How would you address those situations which produce extremely consistent patterns of results (ie. statistical reliability) with uncertain validity due to, for example, being unable to identify what factors impact those results?"

There is only one way to address the scenario you pose. Forget about statistical reliability as being anything worth considering, and then focus on validity. My thesis is that statistical reliability is irrelevant and will NOT lead you anywhere. If you have uncertain validity (meaning no validity studies or inconsistent validity results,) then focus on how to get evidence of validity and forget about reliability. Consistent statistical reliability, in the absence of validity results, tells you nothing. You do not need to consider reliability in investigating validity. This is a difficult transition to make for psychologists. Reliability is not a path to understanding validity or producing better validity.

Take a look at the last paragraph of my last reply to Dean. This describes the issue, and I will cover it in another article.

"Essentially, do you see a value in non-contextual reliability if it is used as a tool to help discover validity (eg. helping understand what the results mean or actually indicate)?"

Your question, again, is hitting at the central issue. If I can rephrase it, "Gee, I don't know what purpose this test could serve, but since it has statistical reliability, shouldn't I be able to find a purpose (validity) for it?"

Statistical reliability tells you nothing, and you would be better served to examine the item content to find a clue as to possible validity. Statistical reliability simply suggests that these items seem hang together. Are they measuring something meaningful that might be interesting? Maybe yes, maybe no. Without validity data, statistical reliability could be artifact.

Your two questions are highlighting a major problem for psychological science in general, and quantitative psychology in particular. The following is from "Quantitative methods in psychology: inevitable and useless" by Aaro Toomela, Institute of Psychology, Tallinn University, Tallinn, Estonia. The issue is, What drives science? Questions or methodology. Starting from statistical reliability with uncertain validity, and then trying to find meaning is to put method before question. Here's what Toomela says:

"Science begins with the question, what do I want to know? Science becomes science, however, only when this question is justified and the appropriate methodology is chosen for answering the research question. [Scientific] Research question[s] should precede the other questions; methods should be chosen according to the research question and not vice versa. [However] Modern quantitative psychology has [got it backwards and] accepted method as primary; research questions are adjusted to the methods." I'll talk more about Toomela's ideas in subsequent articles.

Did I fix the html for bold?

Did I fix the html for bold?
Did I fix the html for bold?

Norm, an excellent article. I think you ought to point out to Rachel the context in which you are referring to the validity and reliability of tests and measurments - one where psychology aspires to be treated as science.

@ Ruchira:

Good observation. I'm going to repeat this issue of psychology wanting to be recognized, by the larger scientific community, as a science on a par with all other sciences every chance I get.

‎@ Moin:

"On Facebook you made the following observation: "Great piece. I never knew about the torpedoes, but very familiar with Bell Lab's serendipitous discovery of the CMBR.

"I thought you hit the nail on the head with this statement: "The reliability of a testing instrument as consistency of results, independent of purpose, does not exist anywhere else in science." And your (Costa's) definition of reliability. My only comments/questions at this time are as follows:

"1) when we are referring to validity, is one referring to both internal and external validity, or both?

"2) What about those cases where an external agent becomes the arbiter of reliability, as in intra- and inter-rater reliability. Is this really "reliability" or something else?

"Would like to hear your thoughts on this."

As with other comments from other readers on my posts, you understand what I am saying, and then pose a question that addresses a related issue that appears to be unclear or contradictory, in light of my article. This means that your thought process is going where it ought to go. I am going to demur for a bit because I have to bring out a couple of other issues before I can answer your specific questions about internal vs external validity, and intra-inter rater reliability. The problem for classical test reliability theory, that I will address, is that it is impossible to distinguish, unambiguously, between reliability and validity. However, I will solve this problem with my definition of reliability.

Norman, I suppose my naive, from-a-distance question will have a vaguely anti-intellectual feel to it. It's something like 'if it ain't broke.' Validity? Here's my intuitive proxy for validity: if you examine the people a test picks out, does it roughly seem to be doing the right thing? Do people scoring high for Aspergers behave in similar ways? Do people scoring high for anti-social personality disorder behave more, well, anti-socially? Do people populating national academies of science tend to be far out in the IQ tails? Do the 'big five' personality tests pick out psychologically salient categories? By my lights, the most common tests on the market (aptitude, personality, psychological disorder etc) don't *seem* to be fundamentally/irreparably broken. To me, a sensible way of making tests more valid in this intuitive sense would be to try and increase their explanatory power - increase the correlation between scores and outcomes. So if for some reason it seems that Germans who score high on psychopathy don't do as much serial killing as you'd expect them to, then that concrete fact is a reason to tweak tests, etc.

@ Prasad:

Your entire last comment is a summary of where I am going. Put differently, given everything you said, what does reliability as consistency of scores without purpose add to the development of a test. NOTHING! You and others sit there scratching your heads and saying, "Gee, in my thinking it's a whole lot simpler to focus on validity. But maybe I'm being naive." Would that psychologists would scratch their heads and say the same thing.

The comments to this entry are closed.