Psychological Science: The Theory of Test Reliability – Correcting 100 Year Old Mistakes – Part 2 (Norman Costa)
Norman Costa, PhD
What is Real, and What is Not? – Zeno's Paradoxes
The Greek philosopher Aristotle (384 BCE, Stageira, Chalcidice – 322 BCE, Euboea,) in his Physics, describes a number of paradoxes attributed to the Greek philosopher Zeno of Elea (ca. 490 BCE – 430 BCE.) Zeno was trying to illustrate the idea that everything we see and experience, different though they may appear or seem, is unitary and unchanging. Experiences and perceptions of change, motion, and time are really illusions. His paradoxes have fascinated philosophers, scientists, mathematicians, and theologians through the centuries to this very day. Here are three of his more popular paradoxes:
Achilles pursuing the tortoise paradox
“In a race, the quickest runner can never overtake the slowest, since the pursuer must first reach the point whence the pursued started, so that the slower must always hold a lead.” --Aristotle, Physics VI:9, 239b10
A fast runner gives a challenger, who is a slow runner, a head start of 100 meters. The fast runner expects to catch up to and then overtake the slow runner on the way to a winning finish line. According to Zeno this is an impossibility. The swift of foot must first run to the point at which the slow of foot started the race – 100 meters down the course. Then the swift athlete must run to the next spot where the slow athlete was, at the time that swift runner arrived at the position where the slow runner began the race. Next, the swift runner advances to next spot occupied by the slow runner, at the time the swift runner arrived at the prior starting point. The swift runner keeps repeating this sequence of advancing to the next spot that was previously occupied by the slow runner, as both are heading for the finish line. According to Zeno, there is an infinite progression of the swift runner approaching the slow runner, catching up to his prior spot, and then going on to the next spot that was occupied by the slow runner down the course. Since the fast runner must advance to an infinite number of spots previously occupied by the slow runner, it stands to reason that the fast runner will never overtake the slow runner.
The dichotomy paradox
“That which is in locomotion must arrive at the half-way stage before it arrives at the goal.” --Aristotle, Physics VI:9, 239b10
I would like to walk across a room, from one side to the other. As I begin, I must first arrive halfway across the room. From that point I must reach the next halfway point to the other side, and so on for an infinite number of halfway points. Since I must traverse an infinite number of halfway points, it stands to reason that I will never reach the other side of the room.
The arrow paradox
“If everything, when it occupies an equal space is at rest, and if that which is in locomotion is always occupying such a space at any moment, the flying arrow is therefore motionless.”
In order for motion to occur, an object must occupy a different space at different times. However, at any moment in time, an object is occupying a specific space and is at rest at that moment. Since for every one of an infinite number of moments an object is always at rest,it stands to reason that motion is impossible.
Zeno's conclusions are very clear. A swift runner, in pursuit, will never overtake a slower runner. I will never reach the other side of room. An arrow in flight will never hit its target, because it never moves. It is curious why so many people have been fascinated with the paradoxes and why they have attempted to explain or resolve them. After all, I could just get up and walk in the direction of the opposite wall, and it is a certainty I will reach it. I won't get into why Archimedes, Thomas Aquinas, Galileo, and Bertrand Russell all tried to solve Zeno's paradoxes, except to say that they were trying to explain the nature of reality, vis a vis the paradoxes.
What does this have to do with the theory of test reliability? The issue for test reliability is, in fact, the nature of reality – at least as far as test theory is concerned. In Zeno's three paradoxes we find the concept of infinity: an approaching infinity like the runner in pursuit, and walking from one side of the room to the other; and an infinity of unlimited partitions of time for the duration of the arrow's flight. For our purposes, the reason why infinity does not hinder the race, touching the opposite wall, or the arrow in flight is because infinity is a concept, not a reality. Infinity is a thought, an idea, an abstraction, a concept. Infinity does not exist. It is not real. However, the concept of infinity can be enormously useful in mathematics, physics, philosophy, theology, and playing games of paradox. Thinking of an abstraction as a reality is a problem for more than one issue in psychological test theory. It is very hard not to imbue abstractions with reality, especially when the abstraction is very useful and 'feels' so obviously concrete.
What is Real, and What is Not? – The Tautology
A tautology is a self-contained logical system. In a tautology, every statement is true. Why are all statements true? They are true, not in a philosophical, or scientific, moral sense. They are true only in an abstract, symbolic sense. They are true because the tautology says they are true. A tautology is also known as circular reasoning. One thing is defined in terms of another. The other thing is defined in terms of the first. Here is a very good and very simple example.
Who made you?
God made me.
Who is God?
God is the infinitely Supreme Being Who made all things.
What makes this a tautology? God is defined in terms of creation, and creation is defined in terms of God. God is the one who makes things, and I am one of those things that was made. Or, I was made by God, and God is the one who makes things. Within the circular, self-contained logic of the tautology, all statements are true, always. In this case, there are two statements: 1. God made me. 2. God is the infinitely Supreme Being Who made all things.
While it may not be apparent, tautologies can be useful, especially in pure mathematics. Sometimes mathematics that are derived from tautologies can be useful in helping scientists to solve problems in their areas of investigation. The temptation, however, is to see the mathematics or the tautology as accurately depicting the real world being studied in the science. This happens all the time. Occasionally, the tautologies and their derived mathematics can be applied inappropriately. The danger is that the results it produces can look so good and so right, that a careful critique of the application can be discarded, altogether. This is what happened to psychological test theory, as I will discuss below.
The Nobel physicist, Richard Feynman, talked about this in the second lecture of his series on The Nature of Physical Law (1965, Messenger Lectures at Cornell University.) The second lecture, The Relation of Physics and Mathematics, is as relevant for psychological science as it is for physics. Feynman could not emphasize enough that mathematics is not physics, and physics is not mathematics. This is especially a problem, he said, for those who come over to theoretical physics from mathematics. There are more than a few physicists and mathematicians that entertain the idea that the universe IS a differential equation. In psychological science we have the same problem for those in quantitative psychology. I will talk more about this, later in this article.
The Theory of Test Reliability – A Foundation of Sand
Test theory is a mathematical rationale for describing the behavior of tests, their items, and the scores that are generated from them. It has been this way since the foundations of test theory were laid down by the work of Charles Spearman, published from 1904 through 1913. Of course, Spearman's ideas were influenced by those who came before him, as well as by his contemporaries. Separate from the theory's mathematical rationale are the following: How they are used in the real world, why they are used, what testing applications will be developed, and the choice of statistics that are used for interpretation and making decisions. Test theory is an abstraction before it is anything else. It is developed as a quantitative theory in mathematical terms before applications are found, before tests are developed, and before decisions are made from test results. In science, this is business as usual and there is nothing remarkable about it.
Psychological test theory comes in two flavors: Classical Test Theory (CTT,) and Item Response Theory (IRT.) For the most part they are complementary approaches to psychological testing, but they do have some substantial differences. I am pointing out the two approaches in order to say that I am focusing on CTT, and will not discuss IRT in this series of articles on reliability theory. IRT will, eventually, be an important part of the discussion on reliability, but that will have to wait.
The Standards for Educational and Psychological Testing (1999) is clear about the preeminent position of validity in the use of tests. “Validity refers to the degree to which evidence and theory support the interpretations of test scores for proposed uses of tests. Validity is, therefore, the most fundamental [emphasis mine] consideration in developing and evaluating tests.” However, Classical Test Theory (CTT) is largely a theory of reliability. In fact, it is the core of test theory. We will see that reliability is defined, within the mathematical rationale of test theory, as the foundation of validity. The things that influence the reliability of a test will be seen to influence, in turn, the test's validity. So when we talk about CTT, we are talking about a theory of test reliability.
CTT begins and ends with the understanding that reliability in all other successful sciences is defined by consistency of observed results, without reference to purpose. CTT is the result of an attempt to emulate the ideas and processes of science in other fields. To this end, two fundamental concepts were invented by psychological science: parallel tests and true score. Each concept will define test reliability and, together, will be considered two sides of the same reliability coin.
Measuring consistency of test results could be determined by having someone retake a test. It seemed a matter of common sense that if the second administration yielded the same score as the first administration, for the same individual, then the test produced consistent results. It was, therefore, a reliable test. Repeating test administration for a group of test takers seemed a reasonable way to generalize the reliability of a test to more than one individual. Experience showed that with the best of tests and the best of test takers, not everyone would get the exact same score from the first to the second administration. Even if an individual got the same test score on both administrations, there might be differences in which items they answered correctly. Also, the average score for a group of individuals might differ from the first to the second administration. If these differences were few, or not large, then one could conclude that the test was more or less reliable. If differences between administrations were large and variable, then the test might not be reliable.
Spearman proposed using one of the recently developed correlation formulas to measure the relationship between scores on the first and second administrations of a test, for the same group of test takers. A correlation formula produced a single statistic, r, called a correlation coefficient. The absolute value of r ranged from 0.00 to 1.00. If r = 0.00, there was no relationship. If r = 1.00, there was a perfect relationship. A correlation coefficient of r = 0.93 is a high value, and indicates a high relationship between test scores from the first to the second administration. A correlation coefficient of r = 0.26 is a low value, and there is little to write home about the relationship of scores from the first administration to the second. A high value of r means a high relationship between scores on both test administrations, and is interpreted as high consistency of results, or high reliability. A low value of r means low reliability. A high value of r means that individuals who ranked higher than others on the first test, tended to be ranked higher than same group of others on the second test. Likewise, if someone was lower that others on the first test, then that person would tend to be lower on the second test.
Test reliability was consistency of scores. Consistency was defined in terms of rank order of test takers from one administration to the next. Consistency of rank order from one test to another was computed as a single value from a correlation formula. Therefore, test reliability was defined as the correlation of test scores within the same group from one administration to the next. No account is taken for any consistent changes in the average of group scores from one test to the other. No account is taken for the purpose, or validity, of the test.
There were some practical problems that arose. Using the exact same test from first to second administration produced correlation coefficients that seemed too high. It was reasoned that the apparent high consistency could be a function of remembering from one administration to the next. In theory, the different test administrations had to be independent of each other. Remembering what you answered the first time made the second administration dependent upon the first, and made the correlation coefficient artificially high. The solution to the problem was to use different but equivalent forms of the same test for each administration. A test of eighth grade arithmetic could be produced, for example, in several equivalent forms. Use test form A on the first administration, and test form B on the second. This would reduce, it was thought, the familiarity or memory factor that would influence the results on the second test. The process came to be known as parallel tests. Parallel tests are the same as equivalent forms of the same test.
Reliability is defined in terms of correlation coefficients between parallel tests. Another, seemingly, important use has been found for one of the greatest statistical inventions in all of social science: the correlation coefficient. Now, all that has to be done is define parallel tests and the theory of test reliability (as well as psychological science) is one step closer to scientific respectability. Parallel tests are defined as equivalent test forms that are highly correlated with each other. Parallel tests are reliable tests.
This is sounding a lot like God and Her creation. Reliability is defined in terms of parallel tests. Parallel tests are defined in terms of reliability. All statements within the tautology are always true.
It doesn't get any better when we get to true score. Please come back for Part 3.