December 2012

Sun Mon Tue Wed Thu Fri Sat
            1
2 3 4 5 6 7 8
9 10 11 12 13 14 15
16 17 18 19 20 21 22
23 24 25 26 27 28 29
30 31          

Blogs & Sites We Read

Blog powered by Typepad

Search Site

  • Search Site
    Google

    WWW
    http://accidentalblogger.typepad.com

Counter

  • Counter

Become a Fan

Cat Quote

  • "He who dislikes the cat, was in his former life, a rat."

« Keith Hill responds to Prasad's post | Main | Hauntings - a three-way review »

January 24, 2012

Comments

"Reliability is defined in terms of parallel tests. Parallel tests are defined in terms of reliability."

Isn't this too strong? We typically have independently strong reasons for knowing when two questions are parallel. If I ask in one test for the sum of angles in a ten-sided figure and in another for that sum for a nine-sided one, I have good reason to suppose that the two questions are answered correctly for the right reasons by basically the same set of people. In this case reliability bootstraps upon my ability perform irrelevant modifications upon test questions, and there is no circularity. I suppose you could say we don't know which the 'irrelevant' differences are a priori, but A) it seems clear enough that we often can B) if that were the case we'd have bigger conceptual problems than testing!

@ Prasad,

Thanks for reading and commenting. Let's separate the common sense idea of equivalence, from the technical definition of reliability of a test as the correlation between equivalent tests. You are constructing equivalent items that will go into constructing equivalent tests. You are a teacher, and you don't want students passing on test items to another class. So you have equivalent test forms. There is no tautology here. Equivalent forms are equivalent forms.

Psychological Test Theory says that in order to measure reliability-as-consistent-scores you can give a test to your students, and then give them an equivalent test form the following week, and then compute the test-retest correlation coefficient. If the correlation coefficient is high, then you have high reliability, and the tests are bona fide parallel tests. If the correlation coefficient is low, then you have low reliability, and the tests are not parallel tests. Keep in mind that CTT is saying that reliability-as-consistent-scores is the same as reliability-as-consistent-rank-orders.

Now, there are several problems with this. One is that it is a tautology. That alone is not a show stopper. Mathematicians use them all the time. Sometimes they can be useful in finding helpful techniques for scientists. The criterion is one of utility. Did it work? Did it help? Is there anything better? Did it wind up confusing things and sending scientists down the wrong path?

A second problem is that it makes it impossible to distinguish, unambiguously, between reliability and validity. In the test-retest example I gave, the correlation coefficient was determined to be a measure of reliability. Now, if you gave the first test before the class on adding fractions, and then gave the same (or equivalent) test after the class, and then compute the correlation coefficient, what do you have? Is the correlation coefficient a measure of reliability, or a measure of the predictive validity of the test?

...continued on next comment


...continued from prior comment

A third problem is that CTT says that an unreliable test can never be valid for anything. I am going to show that a test that has a zero reliability, according to CTT, can still be a valid test (successful when deployed for a purpose.) Consider a test of Java programming. It has 50 questions in a multiple choice format. The purpose of the test is to measure knowledge of Java, and to demonstrate the effectiveness of a particular computer-based instruction module. The class is made up of 50 students who don't know squat about programming.

The test is administered before the first class. An analysis of the results showed that the students answered as if they were randomly choosing among the multiple choices. For this group, the average score indicates no better than random correct answers. These students don't know anything. At the end of the course the same, or equivalent (parallel) test is administered. The average score is quite high, with a very small standard deviation.

Are the pre-test and post-test reliable tests? According to CTT, they are not reliable as measured by split-half correlations (poor man's parallel tests) or test-retest correlations. Neither test is a valid measure of Java knowledge, according to CTT. Yet, we can do this for 35 classes and demonstrate successful learning in every case. I will lay this out in more detail in the next and following articles.

Now, let's go back to your comment, which lays out a common sense example of equivalence. Your TEST A is successful in detecting no knowledge before class, and high level of knowledge after class. The success is documented for 9 separate classes. You create TEST B as equivalent in content to TEST A. What is the reliability of TEST B. If reliability is expectation of validity, then TEST B has high reliability, based on successful performance of TEST A in the past, and all things being equal. To bring in the CTT definition of reliability as correlation can now be seen as absurd.

We go back to my main thesis. There is only validity. Reliability is the expectation that a test will or will not be valid tomorrow. Consistency of observed results void of purpose, parallel tests, and correlation coefficients to measure reliability become meaningless.

@ Prasad,

You keep going down a path of making things simpler. And that is exactly what will happen if we drop the CTT definition of reliability. I haven't even started on all the anomalies.

Hi Norman, thanks for the very detailed response. I think I understand your line of argument better now. If I may provide a very lossy compression, reliability as compatibility-of-repeated-scores-on-parallel-tests a. says nothing about what the test measures or how well the test measures what it measures, and b. isn't a perfect test even of reliability itself since scores may vary for different reasons. In my line of work I think we have similar conceptual considerations.

@ Prasad:

Thanks. In the end I would like to be in agreement on the definition of reliability with the larger scientific community. Hopefully, other scientists will agree as such. Put another way, if I say that my definition agrees with the rest of science, then other scientists should be about to say Yes, or No.

As I mentioned at another time, there are differences in usage. Engineers, for example, put validity and reliability together. I like to think of validity in the engineering world as proof of concept. Yes, explosives on the end of a torpedo launched from a submarine is a workable means of sinking enemy shipping. The development of ever more details specifications for successful performance is a refinement of the definition of a torpedo. Only when tests show an acceptable level of valid performances will it be considered reliable.

Another difference is that the error term for the social sciences is a lot greater, usually, than what is considered reasonable. Most scientists don't know this, but the size of the error term for econometrics is even greater than for psychology. So we vary on the size of the error term that we usually encounter. However, science is sciences is science. Reliability is reliability is reliability - except for psychology.

Eventually, I will show that when the chapters and sections on reliability are removed from all books and standards on psychological testing, there will be no adverse impact upon validity. In fact validity of tests will improve.

The comments to this entry are closed.