Psychological Science: The Theory of Test Reliability – Correcting 100 Year Old Mistakes (Norman Costa)
By
Norman Costa, Ph.D.
The Submarine Torpedoes That Would Not Explode
Non-exploding torpedoes may seem a strange way to open a discussion about the reliability of educational and psychological tests. This true story from the U.S. Navy in World War 2, however, is a perfect example of how to understand the concept of reliability. If submarine torpedoes seem an odd comparison to psychological tests, keep in mind that both are tools or instruments that are expected to perform in their own ways. We expect our tools or instruments to be reliable. So let's see what we mean by reliability.
Following the Japanese attack on Pearl Harbor in the Hawaiian Islands on December 7, 1941, the United States entered the Second World War. After declaring war upon Japan, Germany declared war upon the United States. Upon the declaration of war on Japan, the U.S. Navy's fleet of submarines was authorized to hunt and sink enemy war ships and enemy ships of commerce. The submarine service immediately went on the offensive and prowled the waters for prey. They did a superb job of tracking down their targets, and firing their torpedoes at the hulls of enemy ships with great accuracy. For all of their stealth, bravery, and accuracy, none of the torpedoes exploded.
The failure of virtually all torpedoes in the submarine fleet at the outset of the war was kept a tightly guarded secret during and after the war. Today, few people know about the non-exploding torpedoes. Almost no one is aware of the cause of the problem. It was sabotage. German spies and agents had infiltrated the factories that produced the torpedoes for the U.S. Navy, even before the U.S. entered the war. In 1967, I met Richie Harris, one of the former workers at one of the torpedo factories. He told me that he was a maintenance worker and was sweeping the floor of the office of one of the managers. A man who looked like any other factory worker came into the office, flashed a badge and identified himself as an FBI agent, told him to stay in the office and not come out until told that he could do so. The FBI agent closed the door and locked Richie Harris in the office. That was the opening of a massive arrest of saboteurs in the factory.
About six months later a huge picture of an Italian battleship was hung on the wall of the factory floor. That battleship was sunk by one of their torpedoes.
I always loved that story, but now I have to get back to a discussion of reliability. If I were to ask someone to make a statement about the reliability of the torpedoes, it's a good bet that the torpedoes would be described as unreliable. It seems obvious that they were unreliable, and with the arrest of the saboteurs the torpedoes became reliable. From the perspective of the U.S. Navy, the ship sinking weapon system did not perform according to requirements for submarine torpedoes. They did not function according to expectation, therefore, they were unreliable. From the perspective of the German saboteurs, the torpedoes (the same torpedoes that were unreliable for the U.S. Navy) performed according to requirements. The torpedoes functioned as a defensive system to protect Axis shipping. They performed according to requirements. They functioned according to expectation, therefore, they were reliable.
The major lesson of this example is that reliability of a tool or instrument is determined by its purpose, or statement of performance requirements, or fulfilling a specific function. The reliability of the torpedoes was dependent upon the function they were to fulfill. The reliability of the torpedoes is undetermined until we know what they were supposed to do, and then observe their performance.
Pigeon Poop and the Noisy Antenna
In the 1960s, two engineers, Arno Penzias, and Robert Woodrow Wilson, were using the Horn Antenna at Bell Laboratory in New Jersey to study radio waves that were bounced off echo balloons in the atmosphere. It was named the Horn Antenna because it looked like a horn. In order to do their research properly, they had to get as pure a signal as possible from the radio waves bouncing off the balloons. This meant they had to eliminate, or filter out, extraneous signals coming into the Horn Antenna. Extraneous signals could be coming from commercial radio or television broadcasts, military communications, malfunctioning transmitters in the neighborhood, airplane traffic, faulty components in the Horn Antenna, and the like.
After a period of time, and great frustration, Penzias and Wilson were unable to get the pure signal they desired with the Horn Antenna. There was still an unacceptable amount of extraneous signal. At one point, they thought the source was the accumulation of pigeon droppings inside the horn of the antenna. After unsuccessful attempts to clean the poop and banish the pigeons, Penzias got his shotgun and dispatched the pigeons, forthwith. It wasn't the pigeon poop. There was no decrease in the remaining extraneous signal.
The failure of the Horn Antenna to provide the pure radio signal they needed was more than offset when they realized that the extraneous signal was Cosmic Microwave Background Radiation (CMBR.) They discovered a type of signal that was predicted to be a tell-tale remnant of the formation of the very early universe, and support for the Big Bang Theory. They were awarded a Nobel Prize for this incredible discovery.
Let's get back to the concept of reliability and leave pigeon poop behind. This example is especially helpful in understanding reliability. The Horn Antenna was not a reliable instrument for the purposes of studying pure radio waves bouncing off echo balloons – or at least not as reliable as they required. The Horn Antenna at Bell Laboratory did not perform according to the stated requirements of Penzias and Wilson, therefore, it was not a reliable tool or instrument for their needs. However, when they realized they had detected CMBR, the Horn Antenna was found to be a reliable measure for a different purpose. Reliability of a test, or instrument, or tool is dependent upon its purpose, or statement of performance requirement, or fulfilling a specific function.
There is another way of looking at the reliability of the Horn Antenna. The CMBR was considered to be noise when the purpose of the antenna was to produce pure radio waves bouncing off echo balloons. The noise had to be eliminated. When the purpose of the antenna was changed to measuring CMBR, what was once noise, became the signal of interest. All other signals that used to be of interest were now noise. It is meaningless to speak of reliability of an instrument that is not associated with purpose, or requirements, or function.
The Theory of Test Reliability for Educational and Psychological Testing
The following is the definition of test reliability from the 1999 Standards for Educational and Psychological Testing. The first two sentences define a test, and the third and last sentence defines test reliability.
“A test, broadly defined, is a set of tasks designed to elicit or a scale to describe examinee behavior in a specified domain, or a system for collecting samples of an individual's work in a particular area. Coupled with the device is a scoring procedure that enables the examiner to quantify, evaluate, and interpret the behavior or work samples. Reliability refers to the consistency of such measurements when the testing procedure is repeated on a population of individuals or groups [emphasis mine]." p.25.
The first thing to note is that the concept of measurement is defined as using a scoring procedure. There is nothing here about a comparison to a standard, as there is in all other science. It is consistent with the view of S. S. Stevens (1946) that measurement is defined by the procedure or rule that is used to assign numeric values to events or objects. There is no understanding in any official sense in the Standards, or in psychology, as to what constitutes a rule or procedure in any scientific sense. I discussed this in my prior articles, Psychological Science: Mathematical Argument and the Quest for Scientific Respectability – Part 1, and Part 2.
The second thing to note is that the reliability of a test is simply the consistency of assigning numbers. There is nothing in this definition of test reliability that requires a statement of the purpose to which the testing activity is aimed. The Standards begins with an implication that there is some intended purpose for the test. However, after the first two sentences, the definition of test reliability requires only that a testing and scoring procedure be repeated to produce one or more additional sets of scores. If the scores are the same for each individual, then you have a reliable test. The purpose of the test is irrelevant to the definition of reliability.
The reliability of a testing instrument as consistency of results, independent of purpose, does not exist anywhere else in science. Textbooks in psychology and published papers on psychological testing, for 100 years, ALL say that reliability is completely independent of the purpose, or validity, of a test. If we extend this idea of reliability as consistency of observed results, independent of purpose, to the examples of the non-exploding torpedoes and Bell Lab's Horn Antenna, we reach some interesting, if bizarre, conclusions. According to psychological science, the weapon system of non-exploding torpedoes is a very reliable system. In fact, it is equally reliable for the U.S. Submarine Service as it for the German saboteurs. This makes sense for psychology, but for no other science in the world.
The same interpretation applies to the Horn Antenna. The measurements provided by the antenna were consistent whether used, without success, for producing pure radio signals bounced off an echo balloon, or used, successfully, to detect CMBR. It was useless for the original intended purpose, and successful for the later purpose. According to psychological science, the instrument was reliable when it failed its intended purpose, and reliable when it was successful for another purpose.
Every time I discuss this with another scientist or engineer, the definition of reliability from psychological science makes no sense to them. It makes even less sense when I describe how psychological science justifies its view of instrument reliability. It goes like this:
- Reliability is independent of the intended purpose of the instrument producing the result.
- A instrument that produces results can be reliable, even if it doesn't function according to intended purpose.
- Finally, and this is the humdinger, an instrument that produces results that are consistent with its intended purpose is, by definition, a reliable instrument. Reliability does not assure results as intended, but you cannot get results for an intended purpose unless the instrument is reliable. In the parlance of educational and psychological testing: Test validity – performing as per an intended purpose – is contingent upon reliability. You cannot have a test that is valid without it being reliable.
I am not going to go into the logical and philosophical crimes that are committed by the definition of reliability as consistency of observed results, without a reference to purpose. The most difficult thing for psychologists to reject is the notion that you cannot have a valid test if it is not, first, reliable. What is difficult to accept is that this statement is true ONLY BECAUSE IT HAS BEEN DEFINED AS TRUE. There is, seemingly, an initial appeal to common sense in that statement. That the appeal is more apparent than real will make the task even more difficult for psychology and psychologists. Sooner or later, psychological science must come to terms with the fact that test reliability, as described in the Standards, is an illusion. And, that is only the beginning of the problems with test reliability in psychological testing. I will discuss these later.
So, What Is Reliability?
Here we need to get our definitions in order if psychological science wants to deal in scientific concepts that are consistent with the rest of the scientific world. There are two principal concepts: validity, and reliability. In addition, there are related ideas of accuracy, precision, and error, but these will fall into place very nicely once we've understood validity and reliability. The Standards understands validity very, very well. The following is from the first three paragraphs of the chapter on validity:
“Validity refers to the degree to which evidence and theory support the interpretations of test scores for proposed uses of tests. Validity is, therefore, the most fundamental consideration in developing and evaluating tests. The process of validation involves accumulating evidence to provide a sound scientific basis for the proposed score interpretations. It is the interpretations of test scores required by proposed uses that are evaluated, not the test itself [my emphasis]. When test scores are used or interpreted in more than one way, each intended interpretation must be validated.
“Validation logically begins with an explicit statement of the proposed interpretation of test scores, along with a rationale for the relevance of the interpretation to the proposed use. The proposed interpretation refers to the construct the test is intended to measure....
“....To support test development, the proposed interpretation is elaborated by describing its scope and extent and by delineating the aspects of the construct that are to be represented. The detailed description provides a conceptual framework for the test, delineating the knowledge, skills, abilities, processes, or characteristics to be assessed....” page 9.
In short, validity is the demonstration that a measurement instrument was used, successfully, for an intended purpose. At the end of the chapter on validity, there are 24 distinct requirements, called Standards, for demonstrating successful use of a test for an intended purpose. They are explicit and demanding, as right they should be. A failure to document successful use for an intended purpose means a test cannot be considered valid.
Now, what about reliability? In my prior articles on this subject, pioneers of scientific psychology learned from their colleagues that a test instrument, or any tool, is reliable if it produces the same results every time. That is what they heard, and it was translated into consistency of obtained results, independent of purpose. What those early psychologists did not hear, or did not understand, was something that was implicit in the understanding of reliability among other scientists. What they were saying was that an instrument had to produce the same VALID results every time it is used for its intended purpose. Reliability is not independent of validity. A valid instrument must produce consistent results when used for its intended purpose under identical circumstances. Reliability is the documentation of repeated, successful use for an intended purpose, AND the expectation of the same in the future. Validity is successful use for an intended purpose. Reliability is the expectation of validity based upon history of repeated use. Reliability is dependent upon validity. You cannot have a test that is reliable without it, first, being valid. For 100 years, psychological science got it backwards.
Let's take a look at some examples of definitions of reliability in the more successful sciences. First, other sciences do not use the term 'validity,' or not as frequently as psychology. Instead, they refer directly to the purpose of an instrument, or a system, and the demonstration of success. This is the same as validity for psychological science. Other sciences describe purpose in terms of performing to stated requirements, or usefulness in fulfilling a function, or achieving a desired goal. Here are a few typical definitions from a variety of disciplines:
Discipline: Engineering
Definition of Reliability: The expectation, from history, that an item (component, assembly, plant, or process) performed its intended function, without failure, for the required time duration when installed and operated correctly in a specified environment.
How Reliability is Measured: Compute the probability that a device, system, or process performed its prescribed duty without failure for a given time when operated correctly in a specified environment.
Discipline: Life science
Definition of Reliability: The expectation, from history, of the survival of organisms.
How Reliability is Measured: Compute the probability that an organism survived after a given interval of time, or survived to a given age.
Discipline: Military
Definition of Reliability: The expectation of failure-free performance under stated conditions. The expectation that an item can perform its intended function for a specified interval under stated conditions.
How Reliability is Measured: Compute the duration or probability that an item performed its prescribed duty without failure for a given time when operated correctly in a specified environment.
Discipline: Psychology, OLD 1999 Standards
Definition of Reliability: Consistency of observed results
How Reliability is Measured: Compute correlation coefficient on repeated measures from alternate test forms.
Discipline: Psychology, NEW (Norman Costa)
Definition of Reliability: The expectation of validity, successful use for an intended purpose, from the documented history of repeated use.
How Reliability is Measured: Compute the probabilities of ranges of validity coefficients that an item or test produced when used for an intended purpose.
Discipline: Mathematical Reliability Theory
Definition of Reliability: “Generally speaking, it is a body of ideas, mathematical models, and methods directed toward the solution of problems in predicting, estimating, or optimizing the probability of survival, mean life, or, more generally, life distribution of components or systems: other problems considered in reliability theory are those involving the probability of proper functioning of the system at either a specified or an arbitrary time, or the proportion of time the system is functioning properly. in a large class of reliability situations, maintenance, such as replacement, repair, or inspection, may be performed, so that the solution of the reliability problem may influence decisions concerning maintenance policies to be followed. ...[R]eliability theory is mainly concerned with probabilities, mean values, probability distributions...[.]”
Ideas of validity and reliability are essentially the same for all other sciences. There are subtle differences in usage, however, compared to psychology. As mentioned, above, most other disciplines refer directly to success for intended purpose, without using the term 'validity.' Psychological science tends to think of validity and reliability as continuums from perfect validity or reliability to no validity or reliability at all. Other disciplines see reliability, as an example, in binary terms. A tool that has a history of performing to requirements, and is expected to do the same in the future, is reliable. Anything short of that is unreliable.
What's Next?
In my next articles on the theory of test reliability, I will discuss:
- The terms accuracy, precision, and errors of measurement and how they fit into the ideas of validity and reliability;
- How the 100 year old mistakes were made, who made them, and why they persist to this day;
- That the present definition of reliability in the Standards makes it impossible to distinguish, unambiguously, between reliability and validity:
- That numerous anomalies arise from the present definition of reliability;
- The final, absolute, positive proof that reliability in the Standards is an illusion;
- The implications of throwing out the present definition and adopting the new definition that I propose;
- And, so much more.
Thank you for reading. Please contribute to the comments. Bouquets and brick bats welcome.
I guess I've been sufficiently indoctrinated by the psychological perspective that it makes sense to me to differentiate between validity and reliability, giving greater weight to validity. Either that, or I'm not understanding your argument for change, because it seems almost tautological to me that, if a test isn't valid for a given purpose, then it doesn't matter if it's reliable.
Posted by: Rachel Heslin | January 17, 2012 at 01:38 AM
@ Rachel:
I think you are about 90 percent there. Look at it this way. There is only one idea, and that is VALIDITY - the demonstration that a test has been used successfully for an intended purpose. That's it - end of story. To the extent that there is enough documented history of a test being valid so that you can expect it to perform as a valid test the next time you use it, then it is reliable. Reliability is the expectation of validity.
"...[I]f a test isn't valid for a given purpose, then it doesn't matter if it's reliable." Therein lies the rub. Not only does it not matter, it was a logical and philosophical (not to mention scientific) mistake to think a test could be reliable but not valid. The early experimental and testing psychologists did not understand what their colleagues were saying. The other scientists, in my opinion, did not expound further because they assumed that psychologists knew that they were talking about instruments that were already valid. To them, reliability was consistency of being valid, not consistency of scores without purpose. Psychology just didn't understand this.
Il ya seulement la validité.
La validité est la seule chose.
Il n'ya rien, mais la validité.
Posted by: Norman Costa | January 17, 2012 at 02:17 AM
Ordinary language would seem to support the psychologists. As always, OED:
Of course, the question then becomes whether ordinary language, reliably construed as we may assume it is here in OED, is a valid measure of the scientific merit of the concepts it denotes. I read in your account, Norm, a confusion of the meaning of the statistical use of the word "reliable" with its more general meanings:
and
The U.S. torpedoes were unreliable in this general sense. They produced unhappy results, they performed badly. But the consistency of results points out their nearly perfect reliability as an indication that something was amiss.
I don't read the Standards excerpts as requiring independence of intended purpose. From the definition, the purpose of the scoring procedure is to "enable the examiner to quantify, evaluate, and interpret the behavior or work samples." Consistent results suggest the procedure is reliable. Likewise, the converse. Validity addresses whether the test results are well suited to the intended purpose. Are these scores assigned to the examinee's behavior a sound basis for interpreting that behavior? That seems to me to be a question largely distinct from the question of reliability.
Posted by: Dean C. Rowan | January 17, 2012 at 03:34 PM
@ Dean:
Thanks for taking the time to read and do a thoughtful analysis. You are hitting on all the substantive issues.
I just got home, and put away the groceries. Your comment deserves a thoughtful comment in return. See ya' later, after supper.
Posted by: Norman Costa | January 17, 2012 at 05:27 PM
Great post, from my wife's work with questionaires about people's responses to work stress, I suspect that some of these are designed to be "reliable" in coming up with expected results for writing of research papers and yielding nice Excel graphs.
Posted by: Shimon Ein-Gal | January 18, 2012 at 01:21 AM
@ Dean:
"Ordinary language would seem to support the psychologists [regarding the definition of reliability as consistency of observed results.]"
Yes, that is true. The interesting thing is why it supports the view of psychology that is embedded in the Standards. That is because it comes directly from the development of statistics from Galton to the pioneers of psychological statistics (Pearson, Spearman, Richardson, Thorndike, Thurston, Gulliksen, Guilford, Guttman, and others) to the present time. That is why it reads "reliable, adj. and n. Statistics.... In later use: (of a method or technique of measurement) that yields consistent results when repeated under identical conditions [emphasis mine.]" This just goes to show how influential the development of statistics for the social sciences has been.
The Standards, and every textbook on psychological testing in the world, is very clear about reliability being consistency of observed results, completely independent of purpose. A measurement instrument can be reliable, even if it is not valid. A given measurement instrument may be valid for several intended purposes, and not valid for many others, but reliability is a constant once it is determined. Keep in mind that all of this is true because it is defined as true. However, no physicist, or chemist, or electrical engineer, or weapons developer, or radar operator would agree with this.
When we examine the functioning of the torpedoes we find, 1) a torpedo, as an instrument of war, that explodes on contact with a ship's hull is a valid instrument. If this is demonstrated once, we have established its validity, successful use for an intended purpose. However, validity is not sufficient for adoption as a weapon system for the U.S. Submarine Service. It must be reliable, that is, perform successfully every time it is used. So, a performance requirement is stated by the U.S. Navy that states the consistent results that will be evidence of reliability. For example, it might state that successful detonation must be expected 95 percent of the time a torpedo strikes a ship's hull.
Psychological science separates the concepts of validity and reliability, explicitly, in the abstract and in everyday use. Other sciences tend to put them together, in most usages, and call the whole thing reliable. Thus, reliability means consistent performance of a valid tool. This is why the early psychologists understood reliability in other sciences to be consistency of results. This is one of the reasons why the concept of reliability was able to develop independently of validity. Other reasons include the development of test theory based upon the concept of true score, and the concept of parallel tests. Finally, the use of the correlation coefficient to compute reliability sealed the deal. All of these supported the idea of reliability as consistency of results independent of validity.
However, the foundation of test theory (classical test theory) was built upon two tautologies, true score and parallel tests. All of reliability theory was constructed upon a tautological foundation that was devised to develop reliability as the early psychologists understood it. The fact that the correlation coefficient could compute an index of reliability, was the icing on the cake of a perpetual delusion. Imagine, for a moment, how things would have been different if they understood, clearly, what other scientists were telling them about RELIABILITY AS CONSISTENCY OF RESULTS FOR A VALID INSTRUMENT. I am going to talk about the foundation of test theory in another article.
The U.S. Navy had a valid system (torpedoes with exploding war heads) that was unreliable. The German saboteurs had a valid system (torpedoes with inoperable detonators) that was very reliable. Reliability is the expectation of validity. The system has to be valid before we can make an assessment of reliability.
Your last paragraph is addressing a fundamental issue. Your last sentence leads us to the real question. "That [a test is valid] seems to me to be a question largely distinct from the question of [psychology's] reliability [as consistency of observed data]. Not only is it distinct, it is distinct because reliability in psychological test theory doesn't exist. It is an illusion. THERE IS ONLY VALIDITY! Reliability, for the rest of science, is observing repeated demonstrations of validity.
Posted by: Norman Costa | January 18, 2012 at 02:28 PM
Norm's latest comment prompted me to explore further. I quickly found a recent article in the criminology literature that addresses reliability and validity in the context or prisoner self-reporting. It largely echoes what you describe in the psychology context. I was pleased to see that the first reference in the discussion of reliability was to Babbie's widely used Practice of Social Research, which I enjoyed reading back in the late '80s. I've turned to a later edition courtesy of Google Books, on page 151 of which begins a pertinent discussion. Before addressing it, however, I want to state what probably goes without saying. Reliability and validity are analytical constructs for data collection and study. They can only be true in accordance with what might seem to be arbitrary determinations of what passes for true, such as a performance standard set by the military. If officials raise the standard for successful torpedoes to 97%, then it is possible that they will have deemed unsuccessful some number of formerly successful torpedoes. There's nothing wrong with this sort of tautology.
Might a reason the other sciences combine the concepts be that they are engaged in enterprises that have an obvious and direct effect on the material world? Two of your examples suggest as much. Engineering and the military use tools to alter the material configuration of the world. Either the alteration succeeds or it doesn't. There's little need to query explicitly the validity of an exploding torpedo as a weapon. (As I described above, however, the measure of success is less obvious, and so it demands the mediation of a proxy standard.) Again, I think the torpedo analogy is also clouding the illustration. The two distinct meanings of "reliable" are in play. In psychology and the social sciences, validity and reliability are conventional predicates, parameters developed in disciplines to characterize grounds for authority of procedures designed to represent states of affairs in the world that don't have readily apparent material manifestations. A torpedo and a test are both tools, but the work they do is distinct, and amenable in different degrees to measures of success. A torpedo succeeds when it strikes and destroys its target; a test succeeds when, after careful analysis, the examiner can accurately predict such-and-such or diagnose such-and-such respecting the examinee. Consequently, assuming reliability of the data gleaned from the test, the examiner must embark further on the additional task (analytically speaking) of making sure the reliable data may validly be applied to the prediction or the diagnosis. The examinee in effect asks, "Will I strike (or have I struck) the target?"
The passages beginning on p. 151 of Babbie (unfortunately, in the middle of his treatment of "reliability") are admirably straightforward discussions of these concepts. It's hard for me to see not that reliability and validity are ultimately two sides of a coin--a notion I get--but that it is fatally wrong to distinguish them operationally and analytically, if only to assure that problems presented by them in study design are avoided. Look at Babbie's elegant target(!) illustration, fig. 5-2 on p. 155. Keep in mind that the analogy, like all analogies, only works so far. He is talking about collecting data.
Posted by: Dean C. Rowan | January 18, 2012 at 05:18 PM
@ Dean:
Originally, I was going to produce a very large work for a technical audience and technical journal. I decided, and I am glad I did, to publish it in small installments in a less technical format, and accessible to anyone who has an interest in the subject. I can't tell you how flattered and appreciative I am for your reading and commentary. Yours and the comments of others are helping me develop a better technical presentation and work out the kinks in my arguments.
"It's hard for me to see[,] not that reliability and validity are ultimately two sides of a coin--a notion I get--but that it is fatally wrong to distinguish them operationally and analytically [as is done in the Standards,] if only to assure that problems presented by them in study design are avoided."
Again, you are zeroing in on the crux of the problem. In a later article, I will deal with the flaw for psychology, and that it is fatal. Here is a preview: Let us say that you have a psychological test for a mental disorder, and that it has been shown to be valid - it is successful in diagnosing a mental disorder and classifying those people for specific therapies. It has a validity coefficient of
r = 0.53. Of course, this valid test is not perfect (r = 1.00,) and there will be errors in diagnosis and in treatment classification. There will be false positives and false negatives. So, we want to improve the validity and reduce the errors of false positives and false negatives.
In psychological test theory you can improve validity by either improving validity directly, or by improving reliability directly. In test theory validity is contingent upon reliability. So when you improve reliability the validity coefficient improves. I will show two things: 1. When you do all the things that can be done to improve validity, directly, without addressing reliability, there is nothing else that can be done to improve validity, indirectly, by improving reliability. 2. It is impossible to improving only a reliability coefficient, directly or indirectly, without making any reference to validity. It cannot be done. For example, if you have a test of unknown validity, but known reliability, you cannot improve reliability (consistency of scores) without addressing validity. If I can demonstrate this, then it supports the idea that there is only validity, and that the concept of reliability as consistency of scores without reference to purpose is an illusion. It doesn't exist.
Posted by: Norman Costa | January 18, 2012 at 06:32 PM
How would you address those situations which produce extremely consistent patterns of results (ie. statistical reliability) with uncertain validity due to, for example, being unable to identify what factors impact those results?
I don't think I'm framing my question well. Essentially, do you see a value in non-contextual reliability if it is used as a tool to help discover validity (eg. helping understand what the results mean or actually indicate)?
Posted by: Rachel Heslin | January 18, 2012 at 11:45 PM
@ Rachel:
Thanks for reading and formulating these questions.
"How would you address those situations which produce extremely consistent patterns of results (ie. statistical reliability) with uncertain validity due to, for example, being unable to identify what factors impact those results?"
There is only one way to address the scenario you pose. Forget about statistical reliability as being anything worth considering, and then focus on validity. My thesis is that statistical reliability is irrelevant and will NOT lead you anywhere. If you have uncertain validity (meaning no validity studies or inconsistent validity results,) then focus on how to get evidence of validity and forget about reliability. Consistent statistical reliability, in the absence of validity results, tells you nothing. You do not need to consider reliability in investigating validity. This is a difficult transition to make for psychologists. Reliability is not a path to understanding validity or producing better validity.
Take a look at the last paragraph of my last reply to Dean. This describes the issue, and I will cover it in another article.
"Essentially, do you see a value in non-contextual reliability if it is used as a tool to help discover validity (eg. helping understand what the results mean or actually indicate)?"
Your question, again, is hitting at the central issue. If I can rephrase it, "Gee, I don't know what purpose this test could serve, but since it has statistical reliability, shouldn't I be able to find a purpose (validity) for it?"
Statistical reliability tells you nothing, and you would be better served to examine the item content to find a clue as to possible validity. Statistical reliability simply suggests that these items seem hang together. Are they measuring something meaningful that might be interesting? Maybe yes, maybe no. Without validity data, statistical reliability could be artifact.
Your two questions are highlighting a major problem for psychological science in general, and quantitative psychology in particular. The following is from "Quantitative methods in psychology: inevitable and useless" by Aaro Toomela, Institute of Psychology, Tallinn University, Tallinn, Estonia. The issue is, What drives science? Questions or methodology. Starting from statistical reliability with uncertain validity, and then trying to find meaning is to put method before question. Here's what Toomela says:
"Science begins with the question, what do I want to know? Science becomes science, however, only when this question is justified and the appropriate methodology is chosen for answering the research question. [Scientific] Research question[s] should precede the other questions; methods should be chosen according to the research question and not vice versa. [However] Modern quantitative psychology has [got it backwards and] accepted method as primary; research questions are adjusted to the methods." I'll talk more about Toomela's ideas in subsequent articles.
Posted by: Norman Costa | January 19, 2012 at 12:56 AM
Did I fix the html for bold?
Posted by: Norman Costa | January 19, 2012 at 01:33 AM
Did I fix the html for bold?
Did I fix the html for bold?
Posted by: Norman Costa | January 19, 2012 at 01:37 AM
Norm, an excellent article. I think you ought to point out to Rachel the context in which you are referring to the validity and reliability of tests and measurments - one where psychology aspires to be treated as science.
Posted by: Ruchira | January 19, 2012 at 07:06 PM
@ Ruchira:
Good observation. I'm going to repeat this issue of psychology wanting to be recognized, by the larger scientific community, as a science on a par with all other sciences every chance I get.
Posted by: Norman Costa | January 19, 2012 at 09:08 PM
@ Moin:
"On Facebook you made the following observation: "Great piece. I never knew about the torpedoes, but very familiar with Bell Lab's serendipitous discovery of the CMBR.
"I thought you hit the nail on the head with this statement: "The reliability of a testing instrument as consistency of results, independent of purpose, does not exist anywhere else in science." And your (Costa's) definition of reliability. My only comments/questions at this time are as follows:
"1) when we are referring to validity, is one referring to both internal and external validity, or both?
"2) What about those cases where an external agent becomes the arbiter of reliability, as in intra- and inter-rater reliability. Is this really "reliability" or something else?
"Would like to hear your thoughts on this."
As with other comments from other readers on my posts, you understand what I am saying, and then pose a question that addresses a related issue that appears to be unclear or contradictory, in light of my article. This means that your thought process is going where it ought to go. I am going to demur for a bit because I have to bring out a couple of other issues before I can answer your specific questions about internal vs external validity, and intra-inter rater reliability. The problem for classical test reliability theory, that I will address, is that it is impossible to distinguish, unambiguously, between reliability and validity. However, I will solve this problem with my definition of reliability.
Posted by: Norman Costa | January 20, 2012 at 09:48 PM
Norman, I suppose my naive, from-a-distance question will have a vaguely anti-intellectual feel to it. It's something like 'if it ain't broke.' Validity? Here's my intuitive proxy for validity: if you examine the people a test picks out, does it roughly seem to be doing the right thing? Do people scoring high for Aspergers behave in similar ways? Do people scoring high for anti-social personality disorder behave more, well, anti-socially? Do people populating national academies of science tend to be far out in the IQ tails? Do the 'big five' personality tests pick out psychologically salient categories? By my lights, the most common tests on the market (aptitude, personality, psychological disorder etc) don't *seem* to be fundamentally/irreparably broken. To me, a sensible way of making tests more valid in this intuitive sense would be to try and increase their explanatory power - increase the correlation between scores and outcomes. So if for some reason it seems that Germans who score high on psychopathy don't do as much serial killing as you'd expect them to, then that concrete fact is a reason to tweak tests, etc.
Posted by: prasad | January 21, 2012 at 11:18 AM
@ Prasad:
Your entire last comment is a summary of where I am going. Put differently, given everything you said, what does reliability as consistency of scores without purpose add to the development of a test. NOTHING! You and others sit there scratching your heads and saying, "Gee, in my thinking it's a whole lot simpler to focus on validity. But maybe I'm being naive." Would that psychologists would scratch their heads and say the same thing.
Posted by: Norman Costa | January 24, 2012 at 04:36 AM