Psychological Science: The Theory of Test Reliability – Correcting 100 Year Old Mistakes – Part 2 (Norman Costa)
By
Norman Costa, PhD
What is Real, and What is Not? – Zeno's Paradoxes
The Greek philosopher Aristotle (384 BCE, Stageira, Chalcidice – 322 BCE, Euboea,) in his Physics, describes a number of paradoxes attributed to the Greek philosopher Zeno of Elea (ca. 490 BCE – 430 BCE.) Zeno was trying to illustrate the idea that everything we see and experience, different though they may appear or seem, is unitary and unchanging. Experiences and perceptions of change, motion, and time are really illusions. His paradoxes have fascinated philosophers, scientists, mathematicians, and theologians through the centuries to this very day. Here are three of his more popular paradoxes:
Achilles pursuing the tortoise paradox
“In a race, the quickest runner can never overtake the slowest, since the pursuer must first reach the point whence the pursued started, so that the slower must always hold a lead.” --Aristotle, Physics VI:9, 239b10
A fast runner gives a challenger, who is a slow runner, a head start of 100 meters. The fast runner expects to catch up to and then overtake the slow runner on the way to a winning finish line. According to Zeno this is an impossibility. The swift of foot must first run to the point at which the slow of foot started the race – 100 meters down the course. Then the swift athlete must run to the next spot where the slow athlete was, at the time that swift runner arrived at the position where the slow runner began the race. Next, the swift runner advances to next spot occupied by the slow runner, at the time the swift runner arrived at the prior starting point. The swift runner keeps repeating this sequence of advancing to the next spot that was previously occupied by the slow runner, as both are heading for the finish line. According to Zeno, there is an infinite progression of the swift runner approaching the slow runner, catching up to his prior spot, and then going on to the next spot that was occupied by the slow runner down the course. Since the fast runner must advance to an infinite number of spots previously occupied by the slow runner, it stands to reason that the fast runner will never overtake the slow runner.
The dichotomy paradox
“That which is in locomotion must arrive at the half-way stage before it arrives at the goal.” --Aristotle, Physics VI:9, 239b10
I would like to walk across a room, from one side to the other. As I begin, I must first arrive halfway across the room. From that point I must reach the next halfway point to the other side, and so on for an infinite number of halfway points. Since I must traverse an infinite number of halfway points, it stands to reason that I will never reach the other side of the room.
The arrow paradox
“If everything, when it occupies an equal space is at rest, and if that which is in locomotion is always occupying such a space at any moment, the flying arrow is therefore motionless.”
In order for motion to occur, an object must occupy a different space at different times. However, at any moment in time, an object is occupying a specific space and is at rest at that moment. Since for every one of an infinite number of moments an object is always at rest,it stands to reason that motion is impossible.
Zeno's conclusions are very clear. A swift runner, in pursuit, will never overtake a slower runner. I will never reach the other side of room. An arrow in flight will never hit its target, because it never moves. It is curious why so many people have been fascinated with the paradoxes and why they have attempted to explain or resolve them. After all, I could just get up and walk in the direction of the opposite wall, and it is a certainty I will reach it. I won't get into why Archimedes, Thomas Aquinas, Galileo, and Bertrand Russell all tried to solve Zeno's paradoxes, except to say that they were trying to explain the nature of reality, vis a vis the paradoxes.
What does this have to do with the theory of test reliability? The issue for test reliability is, in fact, the nature of reality – at least as far as test theory is concerned. In Zeno's three paradoxes we find the concept of infinity: an approaching infinity like the runner in pursuit, and walking from one side of the room to the other; and an infinity of unlimited partitions of time for the duration of the arrow's flight. For our purposes, the reason why infinity does not hinder the race, touching the opposite wall, or the arrow in flight is because infinity is a concept, not a reality. Infinity is a thought, an idea, an abstraction, a concept. Infinity does not exist. It is not real. However, the concept of infinity can be enormously useful in mathematics, physics, philosophy, theology, and playing games of paradox. Thinking of an abstraction as a reality is a problem for more than one issue in psychological test theory. It is very hard not to imbue abstractions with reality, especially when the abstraction is very useful and 'feels' so obviously concrete.
What is Real, and What is Not? – The Tautology
A tautology is a self-contained logical system. In a tautology, every statement is true. Why are all statements true? They are true, not in a philosophical, or scientific, moral sense. They are true only in an abstract, symbolic sense. They are true because the tautology says they are true. A tautology is also known as circular reasoning. One thing is defined in terms of another. The other thing is defined in terms of the first. Here is a very good and very simple example.
Who made you?
God made me.
Who is God?
God is the infinitely Supreme Being Who made all things.
What makes this a tautology? God is defined in terms of creation, and creation is defined in terms of God. God is the one who makes things, and I am one of those things that was made. Or, I was made by God, and God is the one who makes things. Within the circular, self-contained logic of the tautology, all statements are true, always. In this case, there are two statements: 1. God made me. 2. God is the infinitely Supreme Being Who made all things.
While it may not be apparent, tautologies can be useful, especially in pure mathematics. Sometimes mathematics that are derived from tautologies can be useful in helping scientists to solve problems in their areas of investigation. The temptation, however, is to see the mathematics or the tautology as accurately depicting the real world being studied in the science. This happens all the time. Occasionally, the tautologies and their derived mathematics can be applied inappropriately. The danger is that the results it produces can look so good and so right, that a careful critique of the application can be discarded, altogether. This is what happened to psychological test theory, as I will discuss below.
The Nobel physicist, Richard Feynman, talked about this in the second lecture of his series on The Nature of Physical Law (1965, Messenger Lectures at Cornell University.) The second lecture, The Relation of Physics and Mathematics, is as relevant for psychological science as it is for physics. Feynman could not emphasize enough that mathematics is not physics, and physics is not mathematics. This is especially a problem, he said, for those who come over to theoretical physics from mathematics. There are more than a few physicists and mathematicians that entertain the idea that the universe IS a differential equation. In psychological science we have the same problem for those in quantitative psychology. I will talk more about this, later in this article.
The Theory of Test Reliability – A Foundation of Sand
Test theory is a mathematical rationale for describing the behavior of tests, their items, and the scores that are generated from them. It has been this way since the foundations of test theory were laid down by the work of Charles Spearman, published from 1904 through 1913. Of course, Spearman's ideas were influenced by those who came before him, as well as by his contemporaries. Separate from the theory's mathematical rationale are the following: How they are used in the real world, why they are used, what testing applications will be developed, and the choice of statistics that are used for interpretation and making decisions. Test theory is an abstraction before it is anything else. It is developed as a quantitative theory in mathematical terms before applications are found, before tests are developed, and before decisions are made from test results. In science, this is business as usual and there is nothing remarkable about it.
Psychological test theory comes in two flavors: Classical Test Theory (CTT,) and Item Response Theory (IRT.) For the most part they are complementary approaches to psychological testing, but they do have some substantial differences. I am pointing out the two approaches in order to say that I am focusing on CTT, and will not discuss IRT in this series of articles on reliability theory. IRT will, eventually, be an important part of the discussion on reliability, but that will have to wait.
The Standards for Educational and Psychological Testing (1999) is clear about the preeminent position of validity in the use of tests. “Validity refers to the degree to which evidence and theory support the interpretations of test scores for proposed uses of tests. Validity is, therefore, the most fundamental [emphasis mine] consideration in developing and evaluating tests.” However, Classical Test Theory (CTT) is largely a theory of reliability. In fact, it is the core of test theory. We will see that reliability is defined, within the mathematical rationale of test theory, as the foundation of validity. The things that influence the reliability of a test will be seen to influence, in turn, the test's validity. So when we talk about CTT, we are talking about a theory of test reliability.
CTT begins and ends with the understanding that reliability in all other successful sciences is defined by consistency of observed results, without reference to purpose. CTT is the result of an attempt to emulate the ideas and processes of science in other fields. To this end, two fundamental concepts were invented by psychological science: parallel tests and true score. Each concept will define test reliability and, together, will be considered two sides of the same reliability coin.
Parallel Tests
Measuring consistency of test results could be determined by having someone retake a test. It seemed a matter of common sense that if the second administration yielded the same score as the first administration, for the same individual, then the test produced consistent results. It was, therefore, a reliable test. Repeating test administration for a group of test takers seemed a reasonable way to generalize the reliability of a test to more than one individual. Experience showed that with the best of tests and the best of test takers, not everyone would get the exact same score from the first to the second administration. Even if an individual got the same test score on both administrations, there might be differences in which items they answered correctly. Also, the average score for a group of individuals might differ from the first to the second administration. If these differences were few, or not large, then one could conclude that the test was more or less reliable. If differences between administrations were large and variable, then the test might not be reliable.
Spearman proposed using one of the recently developed correlation formulas to measure the relationship between scores on the first and second administrations of a test, for the same group of test takers. A correlation formula produced a single statistic, r, called a correlation coefficient. The absolute value of r ranged from 0.00 to 1.00. If r = 0.00, there was no relationship. If r = 1.00, there was a perfect relationship. A correlation coefficient of r = 0.93 is a high value, and indicates a high relationship between test scores from the first to the second administration. A correlation coefficient of r = 0.26 is a low value, and there is little to write home about the relationship of scores from the first administration to the second. A high value of r means a high relationship between scores on both test administrations, and is interpreted as high consistency of results, or high reliability. A low value of r means low reliability. A high value of r means that individuals who ranked higher than others on the first test, tended to be ranked higher than same group of others on the second test. Likewise, if someone was lower that others on the first test, then that person would tend to be lower on the second test.
Test reliability was consistency of scores. Consistency was defined in terms of rank order of test takers from one administration to the next. Consistency of rank order from one test to another was computed as a single value from a correlation formula. Therefore, test reliability was defined as the correlation of test scores within the same group from one administration to the next. No account is taken for any consistent changes in the average of group scores from one test to the other. No account is taken for the purpose, or validity, of the test.
There were some practical problems that arose. Using the exact same test from first to second administration produced correlation coefficients that seemed too high. It was reasoned that the apparent high consistency could be a function of remembering from one administration to the next. In theory, the different test administrations had to be independent of each other. Remembering what you answered the first time made the second administration dependent upon the first, and made the correlation coefficient artificially high. The solution to the problem was to use different but equivalent forms of the same test for each administration. A test of eighth grade arithmetic could be produced, for example, in several equivalent forms. Use test form A on the first administration, and test form B on the second. This would reduce, it was thought, the familiarity or memory factor that would influence the results on the second test. The process came to be known as parallel tests. Parallel tests are the same as equivalent forms of the same test.
Reliability is defined in terms of correlation coefficients between parallel tests. Another, seemingly, important use has been found for one of the greatest statistical inventions in all of social science: the correlation coefficient. Now, all that has to be done is define parallel tests and the theory of test reliability (as well as psychological science) is one step closer to scientific respectability. Parallel tests are defined as equivalent test forms that are highly correlated with each other. Parallel tests are reliable tests.
This is sounding a lot like God and Her creation. Reliability is defined in terms of parallel tests. Parallel tests are defined in terms of reliability. All statements within the tautology are always true.
true Score
It doesn't get any better when we get to true score. Please come back for Part 3.
"Reliability is defined in terms of parallel tests. Parallel tests are defined in terms of reliability."
Isn't this too strong? We typically have independently strong reasons for knowing when two questions are parallel. If I ask in one test for the sum of angles in a ten-sided figure and in another for that sum for a nine-sided one, I have good reason to suppose that the two questions are answered correctly for the right reasons by basically the same set of people. In this case reliability bootstraps upon my ability perform irrelevant modifications upon test questions, and there is no circularity. I suppose you could say we don't know which the 'irrelevant' differences are a priori, but A) it seems clear enough that we often can B) if that were the case we'd have bigger conceptual problems than testing!
Posted by: prasad | January 30, 2012 at 10:02 AM
@ Prasad,
Thanks for reading and commenting. Let's separate the common sense idea of equivalence, from the technical definition of reliability of a test as the correlation between equivalent tests. You are constructing equivalent items that will go into constructing equivalent tests. You are a teacher, and you don't want students passing on test items to another class. So you have equivalent test forms. There is no tautology here. Equivalent forms are equivalent forms.
Psychological Test Theory says that in order to measure reliability-as-consistent-scores you can give a test to your students, and then give them an equivalent test form the following week, and then compute the test-retest correlation coefficient. If the correlation coefficient is high, then you have high reliability, and the tests are bona fide parallel tests. If the correlation coefficient is low, then you have low reliability, and the tests are not parallel tests. Keep in mind that CTT is saying that reliability-as-consistent-scores is the same as reliability-as-consistent-rank-orders.
Now, there are several problems with this. One is that it is a tautology. That alone is not a show stopper. Mathematicians use them all the time. Sometimes they can be useful in finding helpful techniques for scientists. The criterion is one of utility. Did it work? Did it help? Is there anything better? Did it wind up confusing things and sending scientists down the wrong path?
A second problem is that it makes it impossible to distinguish, unambiguously, between reliability and validity. In the test-retest example I gave, the correlation coefficient was determined to be a measure of reliability. Now, if you gave the first test before the class on adding fractions, and then gave the same (or equivalent) test after the class, and then compute the correlation coefficient, what do you have? Is the correlation coefficient a measure of reliability, or a measure of the predictive validity of the test?
...continued on next comment
Posted by: Norman Costa | January 30, 2012 at 06:52 PM
...continued from prior comment
A third problem is that CTT says that an unreliable test can never be valid for anything. I am going to show that a test that has a zero reliability, according to CTT, can still be a valid test (successful when deployed for a purpose.) Consider a test of Java programming. It has 50 questions in a multiple choice format. The purpose of the test is to measure knowledge of Java, and to demonstrate the effectiveness of a particular computer-based instruction module. The class is made up of 50 students who don't know squat about programming.
The test is administered before the first class. An analysis of the results showed that the students answered as if they were randomly choosing among the multiple choices. For this group, the average score indicates no better than random correct answers. These students don't know anything. At the end of the course the same, or equivalent (parallel) test is administered. The average score is quite high, with a very small standard deviation.
Are the pre-test and post-test reliable tests? According to CTT, they are not reliable as measured by split-half correlations (poor man's parallel tests) or test-retest correlations. Neither test is a valid measure of Java knowledge, according to CTT. Yet, we can do this for 35 classes and demonstrate successful learning in every case. I will lay this out in more detail in the next and following articles.
Now, let's go back to your comment, which lays out a common sense example of equivalence. Your TEST A is successful in detecting no knowledge before class, and high level of knowledge after class. The success is documented for 9 separate classes. You create TEST B as equivalent in content to TEST A. What is the reliability of TEST B. If reliability is expectation of validity, then TEST B has high reliability, based on successful performance of TEST A in the past, and all things being equal. To bring in the CTT definition of reliability as correlation can now be seen as absurd.
We go back to my main thesis. There is only validity. Reliability is the expectation that a test will or will not be valid tomorrow. Consistency of observed results void of purpose, parallel tests, and correlation coefficients to measure reliability become meaningless.
Posted by: Norman Costa | January 30, 2012 at 06:53 PM
@ Prasad,
You keep going down a path of making things simpler. And that is exactly what will happen if we drop the CTT definition of reliability. I haven't even started on all the anomalies.
Posted by: Norman Costa | January 30, 2012 at 07:03 PM
Hi Norman, thanks for the very detailed response. I think I understand your line of argument better now. If I may provide a very lossy compression, reliability as compatibility-of-repeated-scores-on-parallel-tests a. says nothing about what the test measures or how well the test measures what it measures, and b. isn't a perfect test even of reliability itself since scores may vary for different reasons. In my line of work I think we have similar conceptual considerations.
Posted by: prasad | February 01, 2012 at 12:46 PM
@ Prasad:
Thanks. In the end I would like to be in agreement on the definition of reliability with the larger scientific community. Hopefully, other scientists will agree as such. Put another way, if I say that my definition agrees with the rest of science, then other scientists should be about to say Yes, or No.
As I mentioned at another time, there are differences in usage. Engineers, for example, put validity and reliability together. I like to think of validity in the engineering world as proof of concept. Yes, explosives on the end of a torpedo launched from a submarine is a workable means of sinking enemy shipping. The development of ever more details specifications for successful performance is a refinement of the definition of a torpedo. Only when tests show an acceptable level of valid performances will it be considered reliable.
Another difference is that the error term for the social sciences is a lot greater, usually, than what is considered reasonable. Most scientists don't know this, but the size of the error term for econometrics is even greater than for psychology. So we vary on the size of the error term that we usually encounter. However, science is sciences is science. Reliability is reliability is reliability - except for psychology.
Eventually, I will show that when the chapters and sections on reliability are removed from all books and standards on psychological testing, there will be no adverse impact upon validity. In fact validity of tests will improve.
Posted by: Norman Costa | February 01, 2012 at 10:31 PM