Showing posts with label cognitive bias. Show all posts
Showing posts with label cognitive bias. Show all posts

Cognitive bias in the McGrew and McFall experiment

Review of “A scientific inquiry into the validity of astrology” [Download]

This article has been peer reviewed and published by ISAR International Astrologer, 41, No. 1, pp. 31-37. Copyright © 2014 by Kenneth McRitchie. [Download PDF]


Abstract. The McGrew and McFall experiment attempted to resolve a weakness that the authors identified in Shawn Carlson’s 1985 double-blind astrological chart matching and self-selection experiment. Both the astrologers and the test subjects in Carlson’s experiment might have failed to make correct selections because of the same non-astrological problem. The authors performed a replication yet they introduced their own problems and failed to acknowledge how cognitive biases influenced their results. One of these biases was the “birthday paradox” that the authors implemented in reverse as a counter-intuitive illusion based on a Poisson distribution. This illusion acted to raise astrologers’ confidence in their abilities. Another was the known tendency for people to have overly-positive illusions about themselves that the authors implemented by using a non-standard, open-ended questionnaire. The authors also neglected to test the self-selection ability of their experimental test subjects, thereby ignoring their own criteria of validity and the justification for their experiment.

The double-blind experiment, “A scientific inquiry into the validity of astrology” by psychology professors John N. McGrew and Richard M. McFall (1990), herein referred to as “the authors,” has long stood as one of the definitive tests against astrology. The Dutch researcher Rob Nanninga (1996) gave the experiment additional weight by his successful replication. Australian statistician Geoffrey Dean and Canadian psychologist Ivan Kelly described the experiment and its replication in their influential study “Is Astrology Relevant to Consciousness and Psi” (2003). For them, the McGrew and McFall experiment was convincing evidence that they used to support their arguments against astrology.

The McGrew and McFall experiment was intended to cover what its authors regarded to be a “methodological inadequacy” in Shawn Carlson’s (1985) famous double-blind test of astrology published in Nature. Carlson had tested whether reputable astrologers (N=29) could accurately identify the California Psychology Inventory (CPI) profiles of test subjects (N=100+ mostly University of California at Berkley students). For each subject’s natal chart, the astrologers were given the real CPI profile and two others randomly chosen from the other subjects. The astrologers were asked to rate the individual sections of the three CPIs compared to the natal chart and then to rank their first, second and third choice CPI. By using the same rating and ranking procedure, Carlson tested whether each test subject could identify their own natal chart profile, written by the astrologers, which they had to choose from two others. Also by the same procedure Carlson asked the test subjects to identify their own CPI profiles from two others. (1)


In his analysis, Carlson found that the astrologers did not perform their tasks any better than chance expectancy. For the test subjects’ tasks, Carlson became suspicious of the rating task and discarded the data. The astrologers never learned how well they performed on the individual sections of their written profiles. Although the results for the test subjects’ ranking task (first, second, and third choice) was unusual, Carlson accepted that data. That task had a control group whose members were not given their real astrological profiles, yet the control group successfully chose the pre-selected profiles at a significantly low probability against chance (p<.01, where significance is p<.05), whereas the actual test subjects chose their profiles at chance expectancy. Carlson attributed the surprising result for the control group to a “statistical fluctuation.” Carlson also found that the test subjects could not identify their own CPI profiles any better than would be expected by chance.


Despite these negative results for both the astrologers and the test subjects, the discarded data, and the statistical anomaly, Carlson concluded, “We are now in a position to argue a surprisingly strong case against natal astrology as practiced by reputable astrologers” (Carlson, 425).

The Carlson experiment has been controversial and its strengths and weaknesses have been discussed in various papers (Currey, 2011; McRitchie, 2011; Ertel, 2009; Vidmar, 2008; McGrew & McFall, 1990). For their part, McGrew and McFall argued that because Carlson’s test subjects had failed to identify their own CPI profiles, the astrologers might have failed to match natal charts to CPIs for the same non-astrological problem. Both the astrologers and the test subjects might have had difficulty in understanding the terminology and the graphical scales that the CPI used to describe personality and traits. The evidence required to validate the Carlson experiment, the authors argued, was inconclusive (M&M, 76).


More recently than McGrew and McFall’s 1990 paper, German psychologist Suitbert Ertel (2009) published a critical review of Carlson’s experiment in which he raised the serious issue that Carlson did not actually test his hypothesis but had incorrectly calculated his analysis. Ertel tested the stated hypothesis using Carlson’s data and, in a remarkable turnabout, the evidence showed that the astrologers had successfully matched the CPI profiles to natal charts in their two tasks at a statistically significant probability (p=.054 marginal, and p=.037). To this date, Ertel’s reassessment and the discovered evidence have remained unchallenged. As an intriguing example of scientific reversal, the Carlson experiment has since become one of the leading scientific studies in support of astrology.


Avoiding some biases but not others


The changed fortunes of the Carlson experiment occurred years after McGrew and McFall conducted their experiment and the authors could not know what would happen. McGrew and McFall were concerned with a specific weakness in the Carlson experiment. They designed their own independent research that stands as a separate inquiry into whether astrology is scientifically valid. The transformation of Carlson’s evidence did not directly affect the McGrew and McFall research and this is why their experiment deserves critical review.

To study the weakness they identified, McGrew and McFall, like Carlson, recruited a group of astrologers (N=6)  and a group of test subjects (N=23). However, unlike Carlson, they tested the identification abilities of the astrologers only. Although the authors could have done so, they did not test whether their test subjects could identify their own natal chart profiles that the astrologers could have written. Carlson had understood that such a test would be close to what astrologers actually do in practice and this would ensure the validity of his experiment. It is therefore disappointing that McGrew and McFall did not replicate this part of Carlson’s experiment. Even more so because Carlson had rejected this part of his data, other than a result that contained a large anomaly. Because the authors’ experiment did not test a method that was close to what astrologers do in practice and they did not try to resolve the anomaly, this represented a bias against astrology. McGrew and McFall gave no reasons for not testing their own test subjects.


McGrew and McFall developed their experimental protocol with the participation and approval of the six participating astrologers, all members of the Indiana Federation of Astrologers. Each astrologer would be asked to match the birth charts of the 23 test subjects to 23 packages of information, which included face photographs, for the subjects. To eliminate age clues, all 23 subjects were aged 30 or 31. This narrow age range meant that there was some similarity in the subjects’ astrological charts, which would make them difficult to differentiate. The information package for each subject was extensive, including answers to 61 personal, open-ended questions that the authors had asked the astrologers to create. Besides the photographs and the 61 questions, the package included important life events and the results of the two standardized psychological tests. The authors called their resultant information package the Personal Characteristics and Life History Summary (PCLHS).


The 61 questions in the PCLHS asked about such personal lifestyle characteristics that astrologers may be concerned with, including “hobbies, interests, religious beliefs, physical characteristics, personal talents and achievements, family background, dates of parent or sibling deaths, dates of moves across the country, health problems, attitudes toward authority, sex and commitment, pet peeves, favorite colors, punctuality, dependability, and variations in the personal energy cycle” (M&M, 77).


In the McGrew and McFall experiment or any double-blind test of astrology, care must be taken to ensure that preconceptions regarding astrology do not bias the results. This precaution applies to the experimenters themselves as well as to the test subjects. However, unlike Carlson, who isolated his own influence and carefully screened out subjects who had strong opinions about astrology, McGrew and McFall did not follow a similar test protocol. They did not try to avoid their own sampling bias when they selected a sample of test subjects from the respondents to their newspaper advertisement.


Instead of screening their test subjects for bias, McGrew and McFall relied on a cover story, but this method was not robust and provided leading clues. The authors told their test subjects that the research was about the possible effects of hormone levels associated with the diurnal cycle during birth and the subsequent development of children (M&M, 79). Each test subject had to provide certified documentation of the precise date, time, and place of their birth. The 61 questions in the PCLHS asked the subjects to describe very personal information and life events of the sort that would have been familiar from popular astrology columns. Referring to these questions, the authors state, “Neither the CPI nor any other standard psychological instrument contains this type of information” (M&M, 77). When asked after the test, two of the 23 subjects said they had guessed that the experiment was about astrology (M&M, 79). Evidently, the experiment’s cover story did not provide a reliable screen for potential biases.


A statistical illusion


In addition to the above weaknesses, an even more serious problem was that the design of the experiment included a statistical peculiarity that can bias an experiment, whether the content is astrology or anything else. In the Carlson experiment and its forerunners, participants tried to match each natal chart against a set of only two (Clark, 1961) or three (Marbell, 1981; Carlson, 1985) personality descriptions. The chance expectancy for each choice in these tests was always the same and did not diminish. The matching protocol that McGrew and McFall used in their experiment employed a known cognitive bias, a mathematical illusion.

By selecting specifically 23 test subjects, the authors seem to have been aware of the counter-intuitive effect known as the “birthday problem” or “birthday paradox” (Ma, 2010). Due to a cognitive bias, we do not expect that out of 365 days in a year there is at least a 50% chance of finding matching birthdays in any group of only 23 people. However, if we go in the opposite direction, it seems intuitively easy to confidently match at least 50% from one group of 23 to another group of 23 where we know that all members in the two groups have matches. The actual probability of making half of the matches is not 50% but nearly zero. The authors reinforced this illusion of overconfidence by stating that there were only “23 possibilities” in their experiment (M&M, 82).


The reason for the illusion is that matching problems, where each attempt removes a member from each group, fall into a Poisson distribution. Counter-intuitively, the chance of finding the matches converges quickly to very small, very similar probabilities regardless of the number of pairs to be matched, whether it is 10, 23, or 200. The probability of making 1 match is approximately .37, of 2 matches is .18, of 3 matches is .06, of 4 matches is .015, and of 10 matches is .0000001 (Ma, 2010). There is a high sensitivity to error that quickly escalates with each attempt. The probability of matching all 23 pairs is vanishingly small. For the results to reach the level of statistical significance, assuming significance at p<.05, the astrologers needed to match an average of slightly more than three charts. The authors did not suggest that they knew their method created an illusion of overconfidence and they did not warn the astrologers. They gave no reasons for changing the test design that Carlson and others had used, where this illusion was not possible.


Idiosyncratic strategies


Each astrologer worked alone to match each of the 23 charts with the corresponding 23 PCLHSs. The authors did not publish any tables or graphs of their test data and it is not possible to scrutinize the mean values of correct matches. The authors reported that correct matches ranged from zero to three with a median value of one match and none of the astrologers performed better than chance (M&M, 80). The astrologers had rated their confidence at a mean value of 73.5%, which implied making at least six correct matches. The correlation between their accurate matches and their confidence was non-significant (Pearson correlation r=.03). The authors found that the results were inconsistent among the astrologers in both their correct and incorrect matches, with a mean value of only 1.4 agreements for the 23 test cases, which was not significant (M&M, 81). Importantly, as the authors pointed out, the astrologers had adopted idiosyncratic strategies, as evidenced by the hodgepodge of questions they provided for the PCLHS and their lack of agreement in making matches (M&M, 81).

This observation of idiosyncratic strategies is crucial to understanding the results. One must ask why the astrologers made the unusual departure from their normal practice. Astrology texts contain descriptions of personality and potential development for the different natal chart configurations. These are fairly standard in agreement and astrologers normally apply these descriptions in their chart interpretations. However, McGrew and McFall did not ask the astrologers to interpret any natal charts but only to match them. If the astrologers did not interpret charts, then whose personality interpretation skills were being tested?


It is normal for astrologers to simply tell their clients what their personality, character development, achievements, and other potentials can be, based on what the astrological literature says about natal charts. These areas of personal potential may or may not have been acted upon, and it is up to the client to recognize their patterns of behavior and lifestyle through the consultative process. It is not normal for astrologers to ask clients to describe their own potentials. It is not normal for astrologers to use a questionnaire to ask clients about their potentials. Applied astrology does not assume that clients know their potentials. The McGrew and McFall experiment went against normal astrology and this represented a bias against the astrologers.


The authors’ experimental design reversed the astrologer and client roles. It placed the interpretive discipline on the wrong party. The test subjects had to describe their own potentials (normally done by the astrologer) by answering an ad hoc questionnaire that McGrew and McFall required the astrologers to create. The astrologers then had to judge the accuracy and usefulness of the descriptions they received (normally done by the client).


Astrology is concerned with providing descriptions of one’s personal potentials and how to make the best choices at different stages in life. This is not the same sort of information that is generated by psychological tests, which typically only measure the dimensions of personality traits. The astrologers in the McGrew and McFall experiment might have had the best intentions but they were given an enormous task. In their attempt to create a questionnaire that would cover the entire spectrum of human potential, the astrologers tended to adopt idiosyncratic strategies and they resorted to open-ended questions, perhaps hoping that the test subjects could provide enough insight into themselves through their own self-descriptions and narratives.


The problem with this approach is that it introduced an additional cognitive bias that astrological chart readings normally prevent. Psychological studies have shown that people tend to hold unrealistically positive illusions about themselves (Taylor and Brown, 1988). For example, tests have consistently shown that almost 80% of drivers perceive themselves as being in the top 50% in terms of driving skills (McCormick, Walkey and Green, 1986). Of course this is not mathematically possible. Positive illusions of self image in virtually all areas of life are not what astrologers would want to hear, but by asking non-standard open-ended questions about personal potential and interests, these were very likely the types of responses the astrologers got. The authors’ research methodology implemented a cognitive bias that worked against the astrologers.


Discussion


Because the astrologers accepted McGrew and McFall’s suggestion of creating questions for an ad hoc questionnaire, they accepted a flawed methodology and by their participation they committed themselves to the authors’ test design. McGrew and McFall did not suggest that they knew their questionnaire was open to the bias of positive self-illusion, and they did not warn the astrologers. This bias and the Poisson mathematical effect were cognitive biases or non-intuitive illusions that the authors introduced. These biases did not exist in the designs of prior double-blind astrological experiments, including the Carlson experiment that the authors were replicating.

Unlike the idiosyncratic, open-ended questionnaire created for the McGrew and McFall experiment, the Carlson experiment had used only the standardized CPI questionnaire. Ertel’s reassessment of Carlson’s experiment showed that the astrologers were able to use the CPI profiles to identify natal charts at a statistically significant probability (Ertel, 2009). Although astrologers are not in the habit of using standard psychological questionnaires, the positive results of the Carlson experiment suggest that McGrew and McFall’s astrologers might have fared better if they had restricted their evaluations to the information from the two psychological tests included in the PCLHS and ignored their own questionnaire. Standardized multiple-choice questionnaires force respondents to make specific choices and thereby reduce illusions of self image. Judging by Ertel’s reassessment of Carlson’s test, it is conceivable that targeted testing programs might show correlations between some astrological chart patterns and profiles from standardized personality questionnaires.


Although the astrologers used the information from the two psychological questionnaires to help identify charts, McGrew and McFall did not ask their test subjects to identify their own psychological profiles, as had been done in prior double-blind experiments. This missing psychological validation protocol presents a serious problem for the authors because it is uncertain how much the astrologers relied, or should have relied, on these psychological profiles. This uncertainty raises the same “methodological inadequacy” question that the authors identified in the Carlson experiment. The astrologers might have failed in their task for the same non-astrological reasons as before. Remarkably, by failing to test the test subjects, the authors did not try to resolve the psychological validation problem that they used to justify their experiment! Consequently, by their own reasoning the authors would have to judge their own experiment as equally inconclusive as Carlson’s.


Lessons to be learned


McGrew and McFall conclude their article with a sweeping rationalization. “Because each individual is unique, in practice an astrologer must use the birth information to ‘select’ the one correct interpretation that uniquely matches that individual from nearly countless possibilities, not just from 23 possibilities. Thus, our task can be seen as a simplification of the task that astrologers routinely undertake as a part of their daily practice” (M&M, 81-82).

This claim reverses the complexity of the astrologers’ normal practice compared to their tasks in this experiment. The claim that astrologers in practice must “select” a unique hit from countless possibilities of combined chart features is a misrepresentation. Astrologers read natal charts in much the same way as one would read any other type of map that has clear reference points, desired destinations, and indicators of opportunities and hazards. As anyone can understand, there is more than one way to read a map and reach a goal. Matching 23 pairs was not a simplified task and McGrew and McFall made a misleading claim. There were only 23 possibilities provided each match was performed correctly. The number of possible mismatches was staggering and cognitively incredible.


A replication the McGrew and McFall experiment was performed in 1996. Dutch researcher Rob Nanninga modeled his “Astrotest” double-blind experiment directly on the McGrew and McFall experiment and it contained all the same problems. Nanninga challenged 50 Dutch astrologers to correctly match seven natal charts to seven sets of personal information. In similar fashion to the McGrew and McFall experiment, Nanninga developed his questionnaire of non-standardized open-ended questions from ideas gathered from the participating astrologers. The questionnaire covered personal interests and background such as education, vocation, hobbies, interests, main goals, personality, relationships, health, religion, and so on, plus dates of important life events. To these, Nanninga added 24 multiple-choice questions taken from a standard personality test (Nanninga, 1996/97). Needless to say, the astrologers did not succeed in matching the charts any better than in the McGrew and McFall experiment.


Astrologers, students, researchers, and critical thinkers can learn from the McGrew and McFall experiment. The authors appeared to follow a strict scientific methodology by presenting an impressive analysis of their data. Yet, the authors failed to implement basic scientific protocols against biases, which they introduced through their test subject selection process, a Poisson matching process, and an ad hoc questionnaire of open-ended questions. The authors failed to evaluate the validity of their psychological test methodology, the very same problem that they had identified as the “methodological inadequacy” in the Carlson experiment that they used to justify their research. For these reasons the McGrew and McFall experiment can be regarded as inconclusive and might even qualify as a notable example of cognitive bias in a scientific experiment.


In retrospect, it is enlightening to read the authors’ account of how they worked their way through a “protracted negotiation period” to gradually gain entry and eventually win the trust of the initially skeptical astrologers. “The astrologers, understandably, were wary of becoming involved with research that might be biased against them or that would provide no opportunity for success” (M&M, 77). At least the authors were understanding towards the astrologers.


Acknowledgments


I am grateful to David Cochrane and Mark Urban-Lurain for their help on the birthday problem and Poisson distributions. I wish to thank International Astrologer for critical peer review, which provided valuable clarifications and suggestions.

Drafts of this article were sent to Professor Ivan Kelly and to Professor Christopher French for comment, but there were no replies.


Notes

1. The “profile self-selection” experiment authored by Neil Marbell in 1981, a forerunner leading to the Carlson experiment, attempted to methodically standardize the astrological interpretations presented to the test subjects in a way that the Carlson experiment did not do.
“The personality profiles were composed by individual astrologers from birth data alone, using all of the basic Ptolemaic factors of chart interpretation. Each profile was then revised by a committee of five astrologers, also blind to the subjects. This revision was necessary to review the interpretations and to make the profiles uniform in style, content, and overall presentation.” (Marbell, 1981, p. 4). 
Marbell claimed his experiment to be definitive in its successful outcomes, with high percentages of the subjects selecting their own chart interpretations from three presented. Despite the high percentages, the statistical probabilities of two of the tests were not significant (assuming significance at p<.05) due at least in part to the very small numbers of test subjects (N=5 or 6). Test 1 (using rigorous profiles in a laboratory setting): N=5, with 100% correct responses, and p<.000001. Test 2 (using less detailed profiles, mailed to subjects’ homes): N=6, with 66-2/3% correct responses and p=.1. Test 3 (biorhythm cover story, using both rigorous and less detailed profile items, conducted in subjects’ workplaces): N=5, with 75% correct responses, and p=.111. The Marbell experiment was notable for its cross-disciplinary participation, involving the design and review assistance of leading astrologers, notable academics, prominent skeptics, and even U.S. congressional representatives.

References
Carlson, Shawn. (1985) A double-blind test of astrology. Nature, (318, December), 419-425.
Clark, Vernon. (1961). Experimental astrology. In Search, (Winter/Spring), 102-112.
Clark, Vernon. (1970). An investigation of the validity and reliability of the astrological technique. Aquarian Agent, 1 (October), 2-3.
Currey, Robert. (2011). Shawn Carlson’s Double-Blind Astrology Experiment: U-Turn in Carlson’s Astrology Test? Correlation. 27(2), 7-33.
Currey, Robert. (2013). Research Sceptical of Astrology: McGrew & McFall, “A Scientific Inquiry into the Validity of Astrology”. Viewed 2013-13-08.
Dean, Geoffrey & Ivan W. Kelly (2003). Is Astrology Relevant to Consciousness and Psi? Journal of Consciousness Studies. 10 (6-7), 175-198.
Dobyns, Zipporah P. (1976). Results of the Vernon Clark experiment. Astrology Now, 1 (65, April), 81.
Ertel, Suitbert. (2009). Appraisal of Shawn Carlson’s Renowned Astrology Tests.  Journal of Scientific Exploration, 23(2), 125-137.
Joseph, R. A. (1975). A Vernon Clark model experiment distinguishing exceptionally gifted high performance from profoundly retarded low performance children. Journal of Geocosmic Research, 1, 55-72.
Klayman, Joshua, & Young-Won Ha. (1987). Confirmation, disconfirmation, and information in hypothesis testing. Psychological review, 94(2), 211-28.
Marbell, Neil. (1981). Profile Self-selection: A test of astrological effect on human personality. Synapse Foundation monograph. Republished (1986-87) NCGR Journal, (Winter), 29-44.
Ma, Dan. (2010). More about the matching problem.  A Blog on Probability and Statistics. Posted May 2, viewed Sep. 18, 2013.
Mannes, Albert & Don Moore. (2013). I know I’m right! A behavioural view of overconfidence. Significance, (August), 10-14.
McGrew, John H. & Richard M. McFall. (1990). A Scientific Inquiry Into the Validity of Astrology. Journal of Scientific Exploration, 4(1), 75-83.
McRitchie, Kenneth. (2011). Support for astrology from the Carlson double-blind experiment. ISAR International Astrologer, 40(2), 33-38.
McRitchie, Kenneth. (2006). Astrology and the Social Sciences: Looking inside the black box of astrology theory. Correlation, 24(1), 5-20.
Nanninga, Rob. (1996/97). The Astrotest: A tough match for astrologers. Correlation, 15(2) (Northern Winter), 14-20.
Nickerson, Raymond S. (1998). Confirmation bias: A ubiquitous Phenomenon in many guises. Review of General Psychology, 2(2), 175-270.
Rogers, Paul and Janice Soule. (2009). Cross-cultural differences in the acceptance of Barnum profiles supposedly derived from Western versus Chinese astrology. Journal of Cross-Cultural Psychology, 40(4), 381-389.
Taylor, Shelley E. & Jonathan D. Brown. (1988). Illusion and well being: A social psychological perspective on mental health. Psychological Bulletin, 103, 193-210.
Taylor, Shelley E. & Jonathan D. Brown. (1994). Positive Illusions and Well-Being Revisited. Psychological Bulletin, Vol. 116(1) (July), 21-27.

© 2014 Kenneth McRitchie

Support for astrology from the Carlson double-blind experiment

Review of "A double-blind test of astrology" 

This article has been peer reviewed and published by ISAR International Astrologer, 40, No. 2, pp. 33-38. Copyright © 2009 by Kenneth McRitchie. [Download PDF]

Abstract. The Carlson double-blind study, published in 1985 in Nature (one of the world’s leading scientific publications) has long been regarded as one of the most definitive indictments against astrology. Although the study might appear to be fair to uncritical readers, it contains serious flaws, which when they are known, cast a very different light on the study. These flaws include: no disclosure of similar scientific studies, unfairly skewed design, disregard for its own stated criteria of evaluation, irrelevant groupings of data, rejection of unexpected results, and an illogical conclusion based on the null hypothesis. Yet, when the stated measurement criteria are applied and the data is evaluated according to normal social science, the two tests performed by the participating astrologers provide evidence that is consistent with astrology (p = .054 with ES = .15, and p = .037 with ES = .10). These extraordinary results give further testimony to the power of data ranking and rating methods, which have been successfully used in previous astrological experiments. A critical discussion on follow-up studies by McGrew and McFall (1990), Nanninga (1996/97), and Wyman and Vyse (2008) is also included.

The research experiment conducted by Shawn Carlson, “A double blind test of astrology,” published in the science journal Nature in 1985 as an indictment of astrology, is one of the most frequently cited scientific studies to have claimed to refute astrology. A Google search for the title as a quoted string returns over 6,600 links.(1) Although the Carlson study drew initial criticism for numerous flaws when it was published, a more recent examination has found that despite the flaws, the data from the study actually supports the claims of the participating astrologers. This support lends further credence to the effectiveness of ranking and rating methods, which have been used in other, lesser known astrological experiments.

The Carlson astrology experiment was conducted between 1981 and 1983 when Carlson was an undergraduate physics student at the University of California at Berkeley under the mentorship of Professor Richard Muller. The flaws that have been uncovered in the Nature article include not only the omission of literature on similar studies, which is expected in all academic papers, but more serious irregularities such as skewed test design, disregard for its own criteria of evaluation, irrelevant groupings of data, removal of unexpected results, and an illogical conclusion based on the null hypothesis.

In concept and design, the Carlson experiment was not original. It was modeled after the landmark double-blind matching test of astrology by Vernon Clark (Clark, 1961). In that test astrologers were asked to distinguish between each of ten pairs of natal charts. One chart of each pair belonged to a subject with cerebral palsy and the other belonged to a subject with high intelligence. Another influential study was the “Profile Self-selection” double-blind experiment, which was led by the late astrologer Neil Marbell and privately distributed among contributors in 1981 before its eventual publication (Marbell, 1986-87). In that test, participating volunteers were asked to select their own personality interpretations, both long and short versions in separate tests, out of three that were presented.

In both of these prior studies, the participants performed well above significance in support of the astrological hypothesis as compared to chance. The Marbell study was extraordinarily qualified as it involved extensive input and review from astrologers, scientists, statisticians, and prominent skeptics. Carlson neglected to provide any review of these scientific studies that supported astrology or any other previous related experiments. 

The stated purpose of Carlson’s research was to scientifically determine whether the participating astrologers (members of the astrology research organization NCGR and others) could match natal charts to California Psychological Inventory (CPI) profiles (18 personality scales generated from 480 questionnaire items). Additionally, Carlson would determine whether participating volunteers (undergraduate and graduate students, and others) could match astrological interpretations, written by the participating astrologers, to themselves. These assessments, Carlson asserts, would test the “fundamental thesis of astrology” (Carlson, 1985: 419). 

From the time of its release, the Carlson study has been criticized for the extraordinary demands it placed on the participating astrologers, which would be regarded as unfair in normal social science. As with any controversial study, all references to Carlson’s experiments should include the scientific discourse that followed it, particularly the points of criticism that show weaknesses in the design and analysis. Notable among recent critics has been University of Göttingen emeritus professor of psychology Suitbert Ertel, who is an expert in statistical methods and is known for his criticism of research on both sides of the astrological divide. Ertel published a detailed review in a 2009 article, “Appraisal of Shawn Carlson’s Renowned Astrology Tests” (Ertel, 2009).

From a careful reading of Carlson’s article in light of the ensuing body of discourse, we can appreciate that the design of the experiment was intentionally skewed in favor of the null hypothesis (no astrological effect), which Carlson refers to, somewhat misleadingly as the “scientific hypothesis.” Some of the controversial features of the design are as follows:
  • The astrologers were not supplied with the gender identities of the CPI owners, even though the CPI creates different profiles for men and women. (Eysenck, 1986: 8; Hamilton, 1986: 10).
  • Participants were not provided with sufficiently dissimilar choices of interpretations, as the Vernon Clark study had done, but instead were given randomly selected choices. This may give the impression of a fair method, but given the narrow demographics of the sample, there is an elevated likelihood of receiving similar items from which to choose, which makes it unfair (Hamilton, 1986: 12; Ertel, 2009: 128)
  • The easier to discriminate and more powerful two-choice format, which had been used in the Vernon Clark study, was replaced with a less powerful three-choice format, which further elevated the chances of receiving similar items (Ertel, 2009: 128). No reasons are given for this unconventional format, although it can be surmised that Carlson was well aware of the complexities of a three-choice format from his familiarity with the Three-Card Monte (“Follow the Lady”) sleight-of-hand confidence game, which he had often played as a street psychic and magician (Vidmar, 2008).
  • The requirement for rejecting the “scientific hypothesis” was elevated to 2.5 standard deviations above chance (p = .006). In the social sciences, the conventional threshold of significance is 1.64 standard deviations with probability less than p = .05 (Ertel, 2009: 135).
  • Failure to consider the astrologers’ methodological suggestions or give an account of their objections. Carlson credits astrologer Teresa Hamilton with giving “valuable suggestions,” yet Hamilton complained later that “Carlson followed none of my suggestions. I was never satisfied that the experiment was a fair test of astrology” (Hamilton, 1986: 9).
Given this skewed design, the irregularities of which are not obvious to the casual reader, Carlson directs our attention to the various safeguards he used to assure us that no unintended bias would influence the experiment. He describes in detail the precautions used to screen volunteers against negative views of astrology, how the samples were carefully numbered and guarded to ensure they were blind, and the contents of the sealed envelopes provided to test participants.

The experiment consisted of several separate tests. The astrologers performed two tests, a CPI ranking test and a CPI rating test. The volunteer students performed three tests, a natal chart interpretation ranking test, a natal chart interpretation component rating test, and a CPI ranking test.

In the CPI ranking test, astrologers were given, for each single natal chart, three CPI profiles, one of which was genuine, and asked to make first and second choices. There were 28 participating astrologers who matched 116 natal charts with CPIs. Success, Carlson states, would be evaluated by the frequency of combined first and second choices, which is the correct protocol for this unconventional format. He states, “Before the data had been analyzed, we had decided to test to see if the astrologers could select the correct CPI profile as either their first or second choice at a higher than expected rate” (Carlson, 1984: 425).

In addition to this ranking test, the astrologers were tested for their ability to rate the same CPIs according to a scale of accuracy. This task allowed for finer discrimination within a greater range of choices. Each astrologer “also rated each CPI on a 1-10 scale (10 being the highest) as to how closely its description of the subject’s personality matched the personality description derived from the natal chart” (Carlson, 1985: 420).

As to the results of the astrologers’ three-choice ranking test, Carlson first directs our attention to the frequency of the individual first, second, and third CPI choices made by the astrologers, each of which he found to be consistent with chance within a specified confidence interval. This observation is scarcely relevant, given the stated success criteria of the first and second choice frequencies combined. Then, to determine whether the astrologers were successful, Carlson directs our attention to the rate for the third place choices, which, as already noted, was consistent with chance. Thus he declares that the combined first two choices were not chosen at a significant frequency.

“Since the rate at which the astrologers chose the correct CPI as their third place choice was consistent with chance, we conclude that the astrologers were unable to chose [sic] the correct CPI as their first or second choices at a significant level” (Carlson, 1984: 425). This conclusion, however, ignores the stated success criteria and is in fact untrue. The calculation for significance shows that the combined first two choices were chosen at a success rate that is marginally significant (p = .054) (Ertel, 2009: 129).

As to the results of the astrologers’ rating test (10-point rating of three CPIs against each chart), Carlson demonstrates that the astrologers ratings were no better than chance within the first, second, and third place choices made in the three-choice test. He shows a weighted histogram and a best linear fit graph to illustrate each of these three groups of ratings. Carlson directs our attention to the first choice graph as support for his conclusion for this test. The slope of this graph is “consistent with the scientific prediction of zero slope” (Carlson, 1985: 424). The slope is actually slightly downward. The graphs for the other two choices are not remarked upon, but show slightly positive slopes.

The notable problem with Carlson’s analysis of the 10-point rating test, however, is that this test had no dependency on the three-choice ranking test and even used a different sample size of CPIs.(2) According to the written instructions supplied to the astrologers, this rating test was actually to be performed before the three-choice ranking test (Ertel, 2009: 135). These 10-point ratings should not be grouped as though they were quantitatively related to the later three-choice test. Confirmation bias from the claimed “result” of the three-choice test, which Carlson presents earlier in his paper, suggests acceptance of irrelevant groupings in this 10-point rating test, presented later. When the totals of the ratings are considered without reference to the choices made in the subsequent test, a positive slope is seen, which shows that the astrologers actually performed at an even higher level of significance (p = .037) than the three-choice test (Ertel, 2009: 131).

The other part of Carlson’s experiment tested 83 student volunteers to see if they could correctly choose their own natal chart interpretations written by the astrologers. Volunteers were divided into a test group and a control group. Members of the test group were each given three choices, all of the same Sun sign, one of which was interpreted from their natal chart (Carlson, 1985: 421). Similarly, each member of the control group received three choices, all of the same Sun Sign, except none of the choices was interpreted from their natal charts, although one choice was randomly selected as “correct” for the purpose of the test.

For the results of this test, Carlson shows a comparison of the frequencies of the correct chart as first, second, and third choices for the test group and the control group (again ignoring his stated protocol to combine the frequencies of the first two choices). He finds that the results for the test group are “all consistent with the scientific hypothesis” (Carlson, 1985: 424). However, he does note an unexpected result for the control group, which was able to choose the correct chart at a very high frequency. He calculates this to be at 2.34 standard deviations above chance (p = .01). Yet, because this result occurred in the control group, which was not given their own interpretations, Carlson interprets this as a “statistical fluctuation.”

Yet the size of this statistical fluctuation is so unusual as to attract skepticism, particularly in light of Carlson’s other results. It is reasonable to think that the astrologers could write good quality chart interpretations after having successfully matched charts with CPI profiles. Yet, according to Carlson’s classification, the test group tended to avoid the astrologers’ correct interpretations and choose the two random interpretations, while the control group tended to choose the selected “correct” interpretations by a wide margin, as if they, the controls, had been the actual test subjects (Ertel, 2009: 132). This raises suspicion that the data might have been switched, perhaps inadvertently, but this is unverifiable speculation (Vidmar, 2008).

Like the participating astrologers, the student volunteers were also given a rating test; in this case for the sample chart interpretations they were given. They were asked to rate, on a scale of 1 to 10, the accuracy of each subsection of the natal chart interpretations written by the astrologers. “The specific categories which astrologers were required to address were: (1) personality/temperment [sic]; (2) relationships; (3) education; (4) career/goals; and (5) current situation” (Carlson, 1985: 422). This test would potentially have high interest to astrologers because of the distinction it made between personality and current situation, which is a distinction that is not typically covered in personality tests. Also, the higher sensitivity of a rating test could provide insight, at least as confirmation or denial, into the extraordinary statistical fluctuation seen in the three-choice ranking test.

However, based on a few unexpected results, Carlson decided that there was no guarantee that the participants had followed his instructions for this test. “When the first few data envelopes were opened, we noticed that on any interpretation selected as a subject’s first choice, nearly all the subsections were also rated as first choice” (Carlson, 1985: 424). On the basis of this unanticipated consistency, Carlson rejected the volunteers’ rating test without reporting the results.

As an additional test in this part of the experiment, the student volunteers were asked to choose from among three CPI profiles the one that was based on the results of their completed CPI questionnaire. The other two profiles offered were taken from other student volunteers and randomly added. Of the 83 volunteers who completed the natal chart interpretation choices, only 56 completed this task. As usual, Carlson compared the results of the three choices for the test and control groups taken individually (instead of the frequency of the first two choices taken together). Furthermore, in contravention to the logic of control group design, Carlson compares the two groups against chance instead of against each other (Ertel, 2009: 132). He found no significant difference from chance for the two groups.

There are plausible reasons that could explain why the test group was unable to correctly select their own CPI profiles, even though the astrologers were able to a significant extent as we have seen, to match CPI profiles with the students’ charts. The disappointing number of students who completed this task, despite having endured the 480-question CPI questionnaire, suggests that the students might have been much less motivated than the astrologers, for whom the stakes were higher (Ertel, 2009: 133). The CPI matching tasks, for both the volunteers and the astrologers, were especially challenging because of the three-choice format. The random selections of CPIs made within the narrow demographics of the sample population of students would have elevated the likelihood of receiving at least two CPI profiles that were too similar to make a discriminating choice and this would have had a negative impact on motivation.

In the conclusion of his study, Carlson claims: “We are now in a position to argue a surprisingly strong case against astrology as practiced by reputable astrologers” (Carlson, 1985: 425). However, this conclusion defies rationality. Ertel points out the logical flaw that such a conclusion cannot be drawn even if the tests had shown an insignificant result. “Not being able to reject a null hypothesis does not justify the claim that the alternate hypothesis is wrong” (Ertel, 2009: 134).

Despite its numerous flaws and unfair challenges, the Carlson experiment nevertheless demonstrates that the astrologers, in their two tests, were able to match natal charts with CPI profiles significantly better than chance according to the criteria normally accepted by the social sciences. Thus the null hypothesis must be rejected. As such, the Carlson experiment demonstrates the power of ranking and rating methods to detect astrological effects, and indeed helps to raise the bar for effect size in astrological studies. The benchmark effect size that had been attained by the late astrological researcher Michel Gauquelin was merely .03 to .07. Although these were small effects, they were statistically very significant due to large sample sizes (N = 500-1000 or more natal data) and had to be taken seriously (Gauquelin, 1988a). In Carlson’s experiment, which applied sensitive ranking controls, the effect size of the three-choice matching test with p = .054 is ES = .15, and the effect size of the 10-point rating test with p = .037 is ES = .10 (Ertel, 2009: 134).

Follow-up studies

Other experiments have attempted to address the earlier documented criticisms of the Carlson test. However, these experiments, each of which claims to confirm that astrological choices are made at no better than chance levels, have drawn criticism from astrologer Robert Currey (2011) and others as having fatal flaws. Each falls short of the Carlson study. Included here are the studies by McGrew and McFall (1990), Nanninga (1996/97), and Wyman and Vyse (2008).

The McGrew and McFall (1990) experiment was intended to include personal information of the sort typically used by astrologers but not found in standard personality profiles. Six “expert” astrologers, all members of the Indiana Federation of Astrologers but none of whom claimed professional accreditation, participated. Each astrologer was asked to match the birth charts of a sample of 23 volunteers to an extremely broad range of information gathered for each volunteer. This information included photo portraits, results from two standardized psychology tests, and written descriptions of personality and life events generated by 61 questions that were developed from input that the authors gleaned from the astrologers.

The use of photos in the McGrew and McFall study meant that special restrictions were imposed on the experiment to avoid age clues from the photos. The authors recruited volunteers who ranged from only 30 to 31 years of age. This narrow demographic, where natal charts would share numerous similarities, and the large amount of non-uniform information supplied for each volunteer, elevated the difficulty of the matching task. The Carlson study is regarded as unnecessarily complex because the astrologers were asked to choose the genuine CPI from among three. In the McGrew and McFall study however, astrologers were given the virtually impossible task of choosing each genuine set of personal descriptions and information from among no less than 23 sets! It is little wonder that this follow-up research was rejected for publication in Nature, which is an interesting story in its own right (Currey, 2011). The authors argue that the astrologers’ experimental task was a “simplification” of their ordinary business (McGrew and McFall, 1990: 82). On the contrary, it was much more complex and far more difficult than even Carlson's tasks. The reasons that the two authors provide for their judgment against astrology is not at all convincing.

The Nanninga (1996/97) experiment was modeled on the McGrew and McFall experiment and contained the same sorts of flaws. It was intended to settle a dispute argued in the local newspapers as to whether astrologers can or cannot predict. Through the newspapers, Nanninga offered a large cash prize to anyone who could match seven natal charts to seven sets of personality information. He attracted an unexpectedly large number of “astrologers,” from which he chose 50 based on their claimed astrological experience. The test subjects for the study were volunteers, all born “around 1958.” A test questionnaire for the volunteers, developed by Nanninga from ideas solicited from the astrologers, covered a very wide range of interests and background such as education, vocation, hobbies, interests, main goals, personality, relationships, health, religion, and so on, plus dates of important life events. To these Nanninga added 24 multiple choice questions taken from a standard personality test.

Like the McGrew and McFall experiment, Nanninga’s experiment used a very narrow demographic of volunteer subjects, making them difficult to astrologically differentiate, and he likewise presented a very large amount of non-uniform personal data written by the seven volunteers for the astrologers to sort through. Although Nanninga’s task involved seven matches instead of 23 and was therefore somewhat less complex than the McGrew and McFall task, it was nonetheless considerably more complex than the Carlson task, which has been criticized as being more complex than necessary. Nanninga’s study was not an improvement over the Carlson experiment and does not convincingly support his claims that astrology is in conflict with science and that astrologers increasingly confine themselves to statements that cannot be falsified (Nanninga, 1996/97: 20).

The Wyman and Vyse (2009) experiment was a low-budget classroom study modeled on the Carlson experiment but without the astrologers. In this experiment it was hypothesized that the use of a very transparent self-assessment questionnaire (the NEO Five-Factor Inventory) would enable volunteer participants to better identify their own profile scores than the CPI used by Carlson. Examples from this questionnaire include, “I try to be courteous to everyone I meet” (which contributes to A, Agreeableness in the resultant profile), and “I like to be where the action is” (which contributes to E, Extraversion). The authors asked 52 volunteers (introductory psychology class members and others) to identify their genuine five-factor personality profile from a bogus one and to identify their genuine astrological description from a bogus one. The astrological descriptions were created from the output of a commercial natal chart interpretation program, modified to remove all planetary, sign, and house clues and further simplified by the removal of all aspect information to provide 29 one- to four-sentence personality descriptions. The students succeeded at the personality profile task but failed at the natal chart description task.

Criticisms of the Wyman and Vyse experiment include: 

1. No test of astrologers’ skills and performance. 

2. The false assumption that both natal chart interpretations and psychology profiles “share a common purpose - to provide a description of the respondent's personality” (Wyman and Vyse, 2008: 287). Natal charts provide their value as descriptions of potential. 

3. The tender age of the volunteers (mean age of 19.3 years) whose life potential would be largely unrealized and somewhat idealized. 

4. Small sample size of natal charts (N = 52, where a sample of 100 would have been better). 

5. The exclusion of aspects from the astrological descriptions, arguably the most important component. 

6. Lack of synthesis of the chart components and a holistic approach. 

7. The unbalanced tasks of identifying an easy five-factor profile that parrots the subject’s input compared to the complexity of identifying a 29-factor partial astrological description of life potential. 

8. The false assumption that the positive and negative polarities of the signs mean “favorable” and “unfavorable” respectively and the listing (twice) of the sign Aquarius as both favorable and unfavorable. 

9. Incomplete disclosure of result details. Statistical inferences were drawn based on belief in astrology, but how many students in this small sample would dare, even anonymously, to declare belief in astrology in an experiment presided over by a professor, Stuart Vyse, who is a prominent astrology skeptic? Was it more than one? 

10. Students’ fear for their academic safety is a high stakes issue and could easily bias such as study as this one.

These errors and inadequacies in the Wyman and Vyse experiment arouse suspicions as to the accuracy of the modified astrological descriptions. Together, these flaws place the experiment well below the level of the Carlson experiment and raise serious doubts as to the authors’ conclusions. The study does nothing to fix the Carlson results. Although the simple five-factor personality profiles were identifiable by the students at a significant rate, the authors’ claim that the simplified astrological descriptions they devised should be equally identifiable is not convincing.

Discussion

The evidence provided by the Carlson experiment, when considered together with the scientific discourse that followed its publication, is extraordinary. Given the unfairly skewed experimental design, it is extraordinary that the participating astrologers managed to provide significant results. Given the irregularities of method and analysis, which had somehow remained transparent for 25 years, it is extraordinary that investigators have managed to scientifically assess the evidence and bring it into the full light of day. Now that the irregularities have been pointed out, it is easy to see and appreciate what Carlson actually found.

However, because of the unfairness and flaws in the Carlson experiment, this line of research needs to be replicated and extended in more stringent research programs that use adequate sample sizes of natal charts. The research done in the follow-up studies by McGrew and McFall (1990), Nanninga (1996/97), and Wyman and Vyse (2008) were on the whole better executed with regard to method and analysis than the Carlson experiment. Nonetheless, first-rate methods and analysis do not magically transform an experiment with faulty assumptions and design into first-rate science. These are the relatively routine parts of a research study that can often be rescued from their own problems, as we have seen with the Carlson study. With hindsight, it is evident that the editors of the science and psychology journals who published these studies failed to realize that astrology is a complex discipline with many variables, limitations, and pitfalls. Ultimately, it is important that would-be researchers learn from criticism and avoid fundamental blunders and misjudgments such as those outlined in this article. Astrological expertise should always be included in the peer review stage prior to publication.

There is much to be learned from the Carlson experiment. If natal charts can be successfully compared with self-assessment tests by the use of rating and ranking methods, as the Carlson experiment indicates, then astrological features might be easier to evaluate than was previously believed. New questions must now be raised. What would the results be in a fair test? Why did the astrologers choose and rate the CPIs as they did? Which chart features should be compared against which CPI features? Could more focused personality tests provide sharper insights and analysis? The door between astrology and psychology has been opened by a just crack and we have caught a glimpse of hitherto unknown connections between the two disciplines.

Notes

1. By comparison, a Google query of some other peer reviewed journal articles on astrology, searched as quoted strings, returns the following results:
  • “Is Astrology Relevant to Consciousness and Psi?” (Dean and Kelly, 2003) 8800 results.
  • “Are Investors Moonstruck?-Lunar Phases and Stock Returns” (Yuan et al, 2006) 3700 results.
  • “Objections to Astrology: A Statement by 186 Leading Scientists” (The Humanist, 1975) 3500 results.
  • “A Scientific Inquiry Into the Validity of Astrology” (McGrew and McFall, 1990) 2160 results.
  • “Raising the Hurdle for the Athletes’ Mars Effect” (Ertel, 1988) 1350 results.
  • “The Astrotest” (Nanninga, 1996) 970 results.
  • “Is There Really a Mars Effect?” (Gauquelin, 1988) 630 results.
  • “Science versus the Stars: A Double-Blind Test of the Validity of the NEO Five-Factor Inventory and Computer-Generated Astrological Natal Charts” (Wyman and Vyse, 2008) 265 results.
2. Carlson presents the 10-point rating test as a finer discrimination of the 3-choice ranking test, but the sample size is not the same. A sample of 116 natal charts is used in the 3-choice test (Carlson, 1985: 421, 423) and a different sample size is used for the 10-point rating test, which adds to the discrepancies already mentioned between these two tests and further emphasizes that they cannot be considered as a single test. Carlson does not give the sample size for the 10-point test, but it can be determined by measurement of the first, second, and third choice histograms in his article (Carlson, 1985: 421, 424). Each natal chart had to be the “correct” choice in one of these three “choices.”  By adding up these “correct hits,” Ertel shows 99 charts (Ertel, 2009: 130, Table 3). A more exacting scrutiny of the histograms by Robert Currey (in a forthcoming article) determines 100 charts.

© 2009 Kenneth McRitchie

References

Carlson, Shawn (1985). “A double-blind test of astrology.” Nature, (318), 419-425.
Clark, Vernon (1961). “Experimental astrology,” In Search, (Winter/Spring), 102-1 12.
Currey, Robert (2011). “Research Sceptical of Astrology: McGrew & McFall, ‘A Scientific Inquiry into the Validity of Astrology’ 1990.” Retrieved on 2011-07-02.
Currey, Robert (2011). “Research Sceptical of Astrology: Wyman & Vyse Double Blind Test of Astrology.” Retrieved on 2011-07-02.
Ertel, Suitbert (1988). “Raising the Hurdle for the Athletes’ Mars Effect: Association Co-varies with Eminence.” Journal of Scientific Exploration, 2(1), 53-82.
Ertel, Suitbert (2009). “Appraisal of Shawn Carlson’s Renowned Astrology Tests.” Journal of Scientific Exploration, 23(2), 125-137.
Eysenck, H.J. (1986). “A critique of ‘A double-blind test of astrology’.” Astropsychological Problems, 1(1), 27-29. 
Gauquelin, Michel (1988). “Is there Really a Mars Effect?” Above & Below: Journal of Astrological Studies, Fall, 4-7.
Hamilton, Teressa (1986). “Critique of the Carlson study.” Astropsychological Problems, 3, 9-12.
Marbell, Neil (1986-87). “Profile Self-selection: A Test of Astrological Effect on Human Personality.” NCGR Journal, (Winter), 29-44.
McGrew, John H. and Richard M. McFall (1990). “A Scientific Inquiry Into the Validity of Astrology.” Journal of Scientific Exploration, 4(1), 75-83.
Nanninga, Rob. (1996/97). “The Astrotest: A tough match for astrologers.” Correlation, Northern Winter, 15(2), 14-20.
Vidmar, Joseph (2008). “A Comprehensive Review of the Carlson Astrology Experiments.” Retrieved on 2010-08-01.
Wyman, Alyssa Jayne and Stuart Vyse (2008). “Science Versus the Stars: A Double-Blind Test of the Validity of the NEO Five-Factor Inventory and Computer-Generated Astrological Natal Charts.” The Journal of Psychology, 135(3), 287-300.