Psychology's Quest for Scientific Respectability
Norman Costa Ph.D.
(Note: This article was originally published in two-parts in January and February of 2012
under the titles "Psychological Science: Mathematical Argument and the Quest for
Scientific Respectability - Part 1 and 2." The reason for combining the two was so that
it could be submitted for the 3QuarksDaily prize in Science Writing.)Part 1 - Mathematical ArgumentWe are reminded by Carl Sagan in his book, Cosmos, that the underpinning of modern
science with mathematics goes back to Pythagoras. In the search for truths in nature,
however, we no longer look for them in Pythagoras' mystical, even magical, power of
numbers. Today, mathematics is indispensable for science as method, and science as
content. We count, measure, perform basic operations (add, subtract, multiply, and
divide,) compute values, solve equations, use visual display to communicate quantitative
information, conduct statistical tests, and represent things and ideas with symbols and
relationships.The history of psychological science, even to the present day, has been a quest for scientific
respectability. Few things have been as important to this quest as the development of
mathematical argument for the science of psychology. Nothing has been more important,
or as far reaching, for mathematical argument in psychology, than the development of the
correlation coefficient. Because much of psychology (and the social sciences in general)
has been the examination of individual differences, it was inevitable that tools be developed
to express relationships and dependencies among different traits, capabilities, and just
about anything that could be measured and recorded about people.The rapid fire discoveries, in the 19th century, of fundamental laws of nature in physics,
chemistry, and life sciences created an air of expectation, pride, and optimism. Some
held the view that the final discovery of all laws of physical nature would be concluded
in the early part of the new century. Psychology envisioned its own role in this great leap
forward in knowledge and science. The development of mathematical argument was
about to elevate psychology to a level that was on par with the more successful physical
and life sciences – or so it was hoped.It is difficult to appreciate, today, how exciting it was for scientific psychology in the late
19th and early 20th centuries. The development of the correlation coefficient became the
Royal Road to scientific respectability, at least in the minds of the pioneers of
psychological science. Statistical correlation formulas provided powerful tools that could
be applied to a myriad of problems in the budding social and economic sciences. The
correlation coefficient led to the development of other powerful tools like multiple
correlation, canonical correlation, regression, and factor analysis. It gave impetus and
support to the development of other tools for mathematical argument, particularly the
concept of true score, and statistical tests.
A. The Correlation Coefficient, r
The idea of using mathematics to demonstrate a dependency or association between
two factors began with the work of Auguste Bravais (23 August 1811, Annonay, Ardèche
– 30 March 1863, Le Chesnay, France). One is more likely to read that Sir Francis Galton
FRS (16 February 1822, Birmingham, England – 17 January 1911, Haslemere, Surrey,
England) was the originator of mathematical correlation. This is not true, though Galton,
considered the father of modern psychometrics, was a genius developer of the descriptive
statistics that we use to this day: standard deviation, regression analysis, and the properties
of the bivariate normal distribution. He saw a need to quantify the relationship between
different variables in biometric studies, census and population data, psycho-physical data,
and in hereditary and eugenic research. He wanted to express relation and degree of relation
in his research.
Galton turned to the work of Auguste Bravais. Building upon Bravais' work he developed some
approaches and early indexes of association. Galton did not invent the correlation coefficient,
but he was the first person to apply correlation to data that he collected in the field. Famously,
Galton was the first person to give the correlation coefficient a single symbol, r. Karl Pearson
FRS (27 March 1857, Islington, London, England – 27 April 1936, Coldharbor, Surrey,
England,) a student of Galton, wrote of the contribution of his mentor, and the origin of
mathematical correlation in Philosophical Transactions of the Royal Society of London, 1897,
“The fundamental theorems of correlation were for the first time and almost exhaustively
discussed by [Auguste] Bravais ('Analyse mathématique sur les probabilités des erreurs
de situation d'un point.' Mémoires par divers Savans, T. IX., Paris, 1846, pp. 255-332)
nearly half a century ago. He deals completely with the correlation of two and three
variables. Forty years later Mr. J. D. Hamilton Dickson ('Proc. Roy. Soc.1886,p. 63) dealt
with special problem proposed to him by Mr. Galton, and reached on a somewhat narrow
basis* some of Bravais' results for correlation of two variables. Mr. Galton at the same
time introduced an improved notation which may be summed up in the 'Galton function' or
'coefficient of correlation.' This indeed appears in Bravais' work, but a single symbol is
not used for it.”
There was great enthusiasm for measuring association, but, none of the early approaches and
indexes was wholly satisfactory. Not until Galton took Karl Pearson under his wing as a
protege did the mathematics and statistics of association become firmly developed and
Karl Pearson was a brilliant mathematician and mathematical statistician who contributed to
the work of Galton and his students in developing statistical tools to measure association
(what we know as relatedness or correlation.) The single most important statistic of Pearson,
especially regarding the development of modern theory of psychological and educational
testing, was Pearson's Product Moment Correlation Coefficient.
Karl Pearson's work overlapped with Galton's other students, notably Charles Spearman FRS
(10 September 1863, London, England – 17 September 1945, London, England) in the United
Kingdom. Spearman was to develop his own measure of relationship, which paralleled
Pearson in concept, but used a different computational approach. We know it as Spearman's
Rank Order Correlation Coefficient. In time, Karl Pearson and Charles Spearman would have
differences on various aspects and uses of correlation formulas. Spearman was the most
successful in finding applications for statistical computations of association. He was the most
articulate and insightful statistician and psychologist when it came to applying correlation
analysis to the premier research problem of the day – the study of human intelligence and its
composition. He developed the techniques of factor analysis – derived and extended from
correlation analysis – that would be directed toward answering questions about the elemental
nature of human intelligence.
Edward L. Thorndike (August 31, 1874, Williamsburg, Massachusetts, U.S. – August 9, 1949,
Montrose, New York, U.S.) was pivotal in introducing Spearman's concepts to American
psychologists, because he, too, was looking for measures of association, and a way to
measure test reliability. Spearman's two important publications in 1904 had to be published
in America since the British Journal of Psychology had not yet been inaugurated. They were:
““General intelligence,” objectively determined and measured.” American Journal of
Psychology. 15, 201-293; and “The proof and measurement of association between two
things.” American Journal of Psychology. 15, 72-101. This was fortuitously helpful to
Thorndike in the development of his own views on mental and social measurement. Working
from Columbia University in New York City, he put Spearman's work into his very influential
texts. An introduction to the theory of mental and social measurement, New York: Teachers
College, Columbia University, Publishers. (1904, 1913, 1922.)
The period for development of concepts, and statistical formulations, for measures of
association – what we call correlation – was an exciting time among early
psychologists and psychometricians . Psychology, virtually alone, was in the
forefront of applied statistical development when it came to measures of
association. Not only would it do wonders for applied problems in psychology
and education, they believed it would bolster the image and credentials for
psychology as a science, and lead to the recognition of psychological science
from their contemporaries in other successful sciences.
B. Psychological Test Theory - The True Score and Reliability
Spearman was the first person to articulate the concepts of true score, and error score
(true score minus observed score,) and the idea of errors of measurement. The invention of
the true score, a mathematical construct, is one of the most important events in the
history and development of psychological and educational testing. It is literally true, that
from this single invention was spawned a world-wide, multi-billion dollar industry, as
ubiquitous and powerful, today, as anytime in its history.
When it came to test theory, Spearman was in the forefront, as well. After developing his
concept of true score, and applying it to the study of tests and testing, he came to propose
correlation as a measure of the reliability of a test. This was another extremely exciting
moment for psychology and education. It is impossible to overstate the professional pride
and collective sense of achievement among psychologists in America and Europe
(England in particular,) in what they perceived to be the elevation of psychology to the
status of legitimate science.
The rationale for the use of the correlation coefficient to measure reliability proceeded in
the following way. First, the principal researchers in the field had been querying their
colleagues in the more successful sciences about the nature of measurement and
scientific instrumentation. They saw direct parallels to measurement and tests in
psychology. The scientific psychologists wanted to emulate the same processes, and
develop analogous concepts that would be recognized and understood by their
In the manner of anecdote, in varied notes and writings, they talked about the counsel
received from their scientific colleagues in other disciplines. They learned, so they
thought, what was regarded as an essential element in scientific observation and
measurement. A measurement instrument was only as good as it was reliable. It had
to produce the same measurement score or results every time it was used, all things
being equal. If repeated measurements gave different scores, when the same results
were expected, then the measurement instrument, or test, was unreliable and of no
use to science. These informal comments from encounters with scientists in more
successful fields, never mention any extended discussion on the definition of reliability,
or examples of measurement. The only thing they remember and report is that
reliability is consistency of observed scores. There is no indication that they
interrogated their colleagues in other fields with any desire to understand the concept
of reliability any further.
Spearman, and all others in the early years of scientific psychology, interpreted this
counsel from non-psychological scientists in a very literal manner. Reliability was
consistency of measurements. This literal translation of the concept of reliability would
be a huge mistake. I shall cover this in a later article.
Second, Spearman believed that correlation analysis would measure consistency of
scores. The technique of correlation was straightforward. Administer a test, and then
re-administer the same test to the same sample of people. If the test was reliable, and
would produce consistent scores upon re-administration, then he should observe the
same relative standing of people on both test administrations. The people who tended
to be higher than the others on the first administration, should tend to be higher than
the others on the second administration. The same would be true for those who tended
to score lower than others on the first administration. They would tend, also, to be
lower on the second administration. Correlation coefficients, then as now, if they do
anything, indicate whether relative standing on one test, is related to relative standing
on another test.
The question never answered, because it was never asked, is why consistency of score
should be inferred from rank order of scores. Nothing in the definition of reliability says
anything about absolute values of individual scores, or mean performance of the
samples. This is the origin of the inability to distinguish, unambiguously, between
reliability and validity. I shall cover this in a later article.
Finally, from Spearman's conceptualization of true score, he reasoned that he could
parse the relative size of true variance from total observed variance, and use a
correlation coefficient as a means to estimate the ratio of one to the other. The ratio
would be a measure of consistency of scores, or reliability. The excitement over
progress in the field of mental and social measurement was validated for those
pioneering psychologists by Spearman's rationale for the use of the correlation
coefficient for measuring reliability. After all, the correlation coefficient was one of the
great inventions of the new sciences of statistics and psychology. I will discuss in a
later article that reliability as consistency of measurement scores was an incomplete
interpretation of what they heard from their non-psychologist colleagues. Also, the
concept of true score was a hypothetical abstraction and an assumption at that,
though it was treated as an axiom. We are going to find that this led to a
fundamental mistake. The error was in thinking as if the abstract concept of
true score was an actual reality.
Part 2 - The Problem of Measurement and the Greatest Scientific
Side-Step in the History of Psychology
The recognition of psychology as a proper science was a major goal of the pioneers
in experimental psychology, and those interested in the study of individual differences.
They understood that the basic function of science was to describe the properties of
things through observation and the recording of data. Key to the process of observation,
and essential for mathematical argument, is the concept of measurement.
The early psychologists saw, clearly, the parallel between measurement tools in
psychology (tests, scales, experimental apparatuses,) and measurement instruments
in the more successful sciences. In anecdotes, they told of conversations they had
with other scientists on the reliability of test instruments. They were told, so they
reported, that a measurement tool must yield the exact same results each time it is
used, assuming identical circumstances, if it is going to be considered reliable.
They translated their understanding of test reliability into what is formulated today in
the Standards for Educational and Psychological Testing1 – reliability is consistency
of observed results. In this definition there is no reference to the purpose or function
of the test instrument. To be reliable, it is only necessary for a test to be consistent
in yielding observed scores, independent of the purpose of the test. In another article,
I will show that the concept of reliability, without a statement of purpose, is an
illusion. It was a fundamental mistake for psychological science, and was completely
out of sync with the concept of test reliability in all other sciences.
They believed they addressed the issue of mathematical argument with the development
of descriptive statistics (beginning with Galton and Pearson,) the development of the
correlation coefficient to express relationship (Galton, Pearson, Spearman and others,)
and in the invention and development of factor analysis for research on the structure of
human intelligence (Spearman, Thurstone, and others.) The body of statistical tools,
tests, scales, experimental apparatus, and text books (Thorndike, Spearman,
Thurstone, Guilford, Guttman, and Gulliksen) were crowning achievements and
sources of pride for scientific psychology. Of great importance and satisfaction to
psychology was the development of the correlation coefficient, and its application to
assessing the reliability of measurement instruments. Eventually, we will see that
using the correlation coefficient to measure reliability was a fundamental mistake. The
mistake was brought about by an overvaluing of the correlation coefficient, and an
over-eagerness to develop mathematical argument for psychological science.
Recognition from the more successful sciences was not forthcoming, however. The
scientific community never looked closely at the psychologists' definition of reliability
that was without a statement of purpose, nor at the use of the correlation coefficient to
measure reliability. What they did focus on was psychology's understanding, or lack
of understanding, of the concept of measurement in science.
A. The Fuerguson Committee
To understand the failure of recognition from other sciences, It is necessary to go back
to the work of the Ferguson Committee, established by the British Association for the
Advancement of Science in 1932. The purpose of the committee was to determine
whether or not real scientific measurement was a possibility for the social sciences.
In other words: Could the field of psychology aspire to the level of a real science or
not? The Ferguson Committee, dominated by N. R. Campbell, an important figure in
the philosophy of science for the physical sciences, answered the question with a
resounding, “No!” The official report put its response in a highly technical treatment,
but the issues for psychological science come down to three elementary points:
1. The definition of measurement;
2. Establishing measurement standards (units of measurement) through research;
3. And physical and structural additivity.
During the time that the Ferguson Committee was seated, psychological science did
not address these issues, and, for the most part, did not understand them. Today,
psychological science has no definition of measurement, no research on establishing
measurement units, and no demonstration of physical or structural additivity.
In all of science, measurement is one of the simplest ideas to understand.
Measurement is a comparison to a standard. The standard for a second of time is
9,192,631,770 completed cyclical vibrations of an energized cesium atom. The
standard for length is the metre, and is the distance that light travels, in a vacuum,
in 1 ⁄ 299,792,458 of a second. The standard for mass is the kilogram, or one litre of
liquid water. Within each science or engineering discipline there are many measurement
standards, all of them requiring research to establish their properties and uses.
Electrical engineering uses measurement standards like ampere, volt, ohm, and so on.
Astronomy uses a measurement standard known as a Standard Candle, a class of
astronomical objects whose members have known luminosities.
N. R. Campbell's theory of scientific measurement was based on the concept of
additivity, both physical and structural. Physical additivity was akin to taking many
one-foot rulers and laying them end-to-end alongside a much longer object to be
measured. Add up the number of rulers, and you have a measure, in feet, of the
length of the object. Structural additivity was a set of mathematical axioms
developed by Otto Holding, and published in 1901. We can understand these axioms
with no more than our first course in algebra. For example,
1. a is equal to b ( a = b ) or not equal ( a < b; a > b ).
2. For any lengths a and b, a + b > a.
3. Order of operation doesn't matter, a + b = b + a.
4. Additive relation is indifferent for compound operations, a + ( b + c ) = ( a + b ) + c.
This meant that psychologists had to conduct experiments to demonstrate the
properties (or conceptual analogs) of physical and structural additivity in
psycho-physical, psycho-social, educational, and psychological measurement.
The experimental psychologist, Stanley Smith Stevens of Harvard University, served
with the committee. His response to the Ferguson Committee was to ignore their final
report. In short, he dismissed the matter entirely and felt that they simply got it wrong.
He did not bother to address the call for research that would address the issue of
measurement standards and the properties of their units. Additivity was not a concern
for him, because he did not accept the definition of measurement as a comparison to
a standard. Scientific psychology did as Stevens did, and ignored the Ferguson
Committee. In the main, it ignored the whole subject of whether measurement in
psychology qualified the discipline as a science on a par with all other successful
sciences. One notable exception, a voice crying in the wilderness, is the work of
psychologist, Joel Michell of the University of Sydney, Australia. He captures the idea
of comparison to a standard very nicely: Measurement is the numerical estimation and
expression of the magnitude of one quantity relative to another.
B. The Greatest Scientific Side-Step in the History of Psychology
The consequences to psychology as a science were significant, if not profound, but of
little concern to most psychologists. As of today, scientific psychology has NO coherent
definition of measurement. What scientific psychology has is a ridiculous definition and is
frequently cited from a 1947 paper by Stevens. It reads: “...[M]easurement, in the broadest
sense, is defined as the assignment of numerals to objects or events according to rules.”
Stevens adapted his definition from N. R. Campbell. According to Campbell,
measurement is the assignment of numerals to an attribute according to scientific laws.
When one reads further into Campbell's definition, it prescribes standards for comparison,
and research into units of measurements. That is what he meant by the phrase,
"...according to scientific laws."
Stevens went on to develop his theory of measurement scales, and it is well known
to all students of psychological research methods. He asserted, correctly, that
different types of measurement scales (nominal, ordinal, interval, and ratio scales)
are derived from different measurement operations that we use to produce them.
This is of utmost importance to psychological science because, depending upon
the type of measurement scale a researcher is using, different decisions must be made
about how to analyze the data.
Stevens thought he was developing a theory of measurement. He thought he could
produce a definition of measurement that was based upon the operations required to
produce the measurements. That is why he substituted the phrase, "...according to
rules," in his own definition. He was greatly influenced by the concept of
operationalism in the work of fellow Harvard faculty member, Percy Bridgman, a Nobel
Laureate in Physics. A close examination of his paper, however, shows that his theory
of measurement is really a self-contained, mathematical description of the properties of
different numerical scales, not a theory nor definition of measurement. He constrained
himself to the confines of the internal mathematics involved, and never ventured to
examine the relationship of a fundamental or derived measure to a standard. He was
stuck on the fact that the differing operations [different 'rules'] that were applied, would
impute different properties to the assigned numerals. He was correct, as far as it went,
but scientific psychology still does not know what measurement is.
In the end, S. S. Stevens' discarding of the Ferguson Committee's final report, his
substitution of a description of measurement scales for a theory of measurement,
abandoning comparison to a standard in favor of an operationalism-only description
of measurement, and the dismissal of the counsel of the best minds in the science,
philosophy and mathematics of measurement, all of this adds up to the greatest
scientific side-step in the history of psychology.
C. It Doesn't Have To Be This Way
It should be noted that the understanding of measurement as a comparison to a
standard was discussed by the reknowned psychologist, L. L. Thurstone, in 1929.
He was trying to conceptualize the subjective responses from a person on a perceptual
scale as resulting from a process of comparison. The specific technique he was using
is referred to as pairwise comparisons. This technique would imbue such data with
properties that made them amenable to mathematical representation and argument.
“A very important consideration in rendering objective a value-increment that is
essentially subjective is that as long as each of these psychological values is
considered separately, it can never register objectively. Every scientific
observation is, in fact, a comparative situation and neither of the terms of a
percept can be objectively recorded in isolation. This is true of cognitive as well
as of affective comparison.”
His views were encapsulated into his “Law of Comparative Judgment.” Unfortunately,
the idea of measurement as comparison to a standard did not find its way into
psychology nor into the "Standards For Psychological and Educational Testing," 1999.