Showing posts with label NEO-FFI. Show all posts
Showing posts with label NEO-FFI. Show all posts

Confirmation bias in the Wyman and Vyse experiment

Review of “Science versus the stars: A double-blind test of the validity of the NEO Five-Factor Inventory and computer-generated astrological natal charts” [Download]

This article has been peer reviewed and published by Correlation, 29, No. 2, pp. 26-40. Copyright © 2014 by Kenneth McRitchie. [Download PDF]

Abstract. Psychologists Alyssa Jayne Wyman and Stuart Vyse designed their replication of Shawn Carlson’s double-blind astrological experiment to resolve the problem that the participants could not identify their own California Psychological Inventory profiles any better than their own astrological profiles, as written by reputable astrologers. To simplify these self-identification tasks and ascertain the validity of each of these profile types, the authors used the NEO-FFI psychological self-test versus computer-generated astrological profiles. No astrologers participated. The authors claimed that their test subjects could identify their NEO-FFI profiles at a significant rate but not the astrological profiles. However, a scrutiny of the experimental methodology shows evidence that the claimed findings were due to biases and inefficiencies introduced, perhaps unintentionally, by the authors. The authors’ selective sampling of student subjects, their use and modification of computer-generated astrological profiles, their requirement of a signed astrological knowledge and beliefs questionnaire, the institutionalized bias of their college, and the authority that the authors held over the students all provided opportunities that the authors used to circumvent the double-blind test protocols and sway the results to confirm their own biased beliefs. 

The 2008 experiment by psychologists Alyssa Jayne Wyman and Stuart Vyse (herein referred to as “the authors”) intended to replicate the influential Shawn Carlson double-blind astrological experiment published in Nature (1985). The part of Carlson’s study that most concerned the authors was an inconclusive result that raised questions on the usefulness of self-identification to validate psychological testing. Carlson’s test subjects could not identify their own psychological profiles any better than their astrological profiles (W&V, 288). This result was evidence that neither the astrological nor the psychological tests could be personally validated. 

To resolve this problem, Wyman and Vyse designed an experiment to make personal validation easier but also to study identifications that are falsely personal due to biases. One of these biases is that people generally hold an unrealistically enhanced or positive view of themselves, identified by Taylor and Brown (1988). Another, known as the P.T. Barnum effect, is the tendency for people to accept ambiguous descriptions as being unique to themselves, even though they are generally true for everyone, identified by Forer (1949). Finally, there are biases based on uncritical belief in astrology or in psychology. To prevent their own bias in their experiment, Wyman and Vyse needed to ensure that they did not design test methodologies that would make the self-identification tasks so obvious that they would pose little risk of disconfirming their validation hypotheses.

In his study of mainly University of California student test subjects (N=100+), Carlson claimed that the participating astrologers (N=29) could not accurately match California Psychological Inventory (CPI) profiles to the test subjects’ natal charts any better than would be expected by chance. For this test, the astrologers were given one real and two randomly selected bogus CPIs for each natal chart. They were asked to rate the individual sections of the three CPIs and then rank the three according to what they thought would be the best match to the natal chart.

Similarly, Carlson asked his student test subjects to identify their own astrological profiles that the astrologers had written. Given their real astrological profile and two others randomly selected from the other students, the subjects had to rate the individual sections in each of the three astrological profiles and then rank the three according to what they thought best matched themselves. By using the same procedure, the subjects were also required to identify their own CPI profiles from two others, randomly selected from the other subjects.

In his analysis, Carlson became suspicious of the data from the students’ astrological rating task and discarded it. However, Carlson accepted the data of the students’ ranking task even though it was unusual. That task had a control group whose members were not given their real astrological profiles, yet the control group successfully chose the pre-selected profiles at a significantly low probability against chance (p<0.01, where significance is p<0.05), and the actual test subjects chose their profiles at chance expectancy. Carlson attributed the surprising result for the control group to a “statistical fluctuation.” Carlson also found that the test subjects could not identify their own CPI profiles at a rate better than chance expectancy.

Despite these negative results, the discarded data, and the statistical anomaly, Carlson concluded, “We are now in a position to argue a surprisingly strong case against natal astrology as practiced by reputable astrologers” (Carlson, 425). 

At the time of their own experiment, Wyman and Vyse were unaware that Carlson’s data actually supported astrology. By following up on questions raised by American psychologist Joseph Vidmar (2008), German psychologist Suitbert Ertel (2009) published a reassessment of Carlson’s data. It turned out that Carlson had not followed his hypothesis and his analysis was incorrect. The astrologers had successfully matched the CPI profiles to natal charts at a statistically significant rate in both of their tasks. In consideration of this success, Ertel agreed with Vidmar’s suggestion that the unlikely performance of the control group was suspect. The statistical anomaly weakened Carlson’s claim because of the possibility that the data for the controls and the test subjects might have been switched, perhaps inadvertently.

Hypothesis formation

In an earlier replication of Carlson’s experiment, American psychologists John McGrew and Richard McFall (1990) had argued that both the astrologers and the test subject might have failed to identify the CPIs for the same non-astrological problem. The CPI is a sophisticated test instrument that requires training to interpret. All of Carlson’s participants might have had difficulty in understanding the terms and the graphical scales that the CPI used to describe personality traits (M&M, 76). Although Wyman and Vyse did not cite McGrew and McFall, this was the problem that they wanted to resolve. 

In their own replication, Wyman and Vyse did not try to test the skills of astrologers as Carlson had done, or as McGrew and McFall had done. Instead, they tested whether a very basic psychological test could be personally validated compared to very basic astrological interpretations. They looked for simplicity, ease of execution, replicable methods, and results that would explain differences.

Wyman and Vyse had found several two-choice CPI identification studies, with a choice between one real and one bogus CPI profile, which had been performed after Carlson’s three-choice experiment. These studies had provided some evidence for personal validation of the CPI. The results were in the right direction but less than statistically significant. A two-choice format appeared to be easier for the subjects to discriminate than a three-choice format. (W&V, 288) (1) 

As a further improvement, the authors used the newly-developed NEO Five-Factor Inventory (NEO-FFI) of personality. This questionnaire is based on the Big Five personality theory that assesses five broad domains or dimensions of personality: neuroticism (N), extraversion (E), openness (O), agreeableness (A), and conscientiousness (C). These dimensions are easy to grasp. For example, “I try to be courteous to everyone I meet” contributes to the Agreeableness score and “I like to be where the action is” contributes to the Extraversion score. The authors chose this test because, “The NEO-FFI contains only five dimensions, all of which are easily understood and well embedded in the vernacular language of personality description, whereas the CPI contains 18 dimensions, a number of which may be difficult for many people to evaluate” (W&V 297).

For the psychological part of their experiment, Wyman and Vyse hypothesized that by using the easier NEO-FFI and the two-choice format, their test subjects (N=52) would be able to identify their own personality profiles at a significant rate (W&V, 289). Evidence that supported this hypothesis would confirm the usefulness of the personal identification methodology to validate psychological testing, which the Carlson experiment had failed to do.

For the astrological part of their experiment, Wyman and Vyse asked whether their test subjects could identify their own natal charts by reading computer-generated astrological profiles (herein referred to as CGAPs). Each subject was given their real CGAP and one that was randomly selected from the other test subjects. However, unlike their NEO-FFI strategy, the authors found themselves unable to form a hypothesis for their CGAP strategy. “Because no previous study had used a two-choice test with astrological profiles, we were unable to make a hypothesis in this case” (W&V, 289). Either the authors had failed to research the literature or failed to acknowledge that they were performing a Vernon Clark type experiment. The Vernon Clark experiment (1961) was the first, and for many years the best known, double-blind chart-matching test of astrology. It had used a two-choice format.

Aside from the Vernon Clark oversight, the authors’ argument for their inability to form an astrological hypothesis raises a critical methodological issue. Wyman and Vyse rationalized their test strategy against Karl Popper’s (1963) argument for falsifiable scientific predictions. The purpose of hypothesis testing is not to confirm existing conjectures, theories, or beliefs but to seriously challenge, refute, and falsify them.
“It is easy to obtain confirmation, or verifications, for nearly every theory—if we look for confirmations. Confirmations should count only if they are the result of risky predictions; that is to say, if, unenlightened by the theory in question, we should have expected an event which was incompatible with the theory—an event which would have refuted the theory.” (Popper, 47-48)
Bias leads one to look for strategies that will validate one’s favorite theories. The CPI self-test had proved to be unexpectedly difficult to validate through a self-identification methodology. Wyman and Vyse chose the NEO-FFI self-test specifically because they predicted it would be easier and thus less risky. At the same time, the authors found that they were unable make an astrological hypothesis, which would in fact have been a much riskier prediction because the hypothesized event would be incompatible with current scientific thought. The authors’ claimed inability to form an astrological hypothesis went against Popper’s argument of testing risky and falsifiable conjectures. 

For any scientific evaluation, the default position of statistical inference is the null hypothesis. The null hypothesis holds that there is no relationship between two measured phenomena. In the authors’ test strategy, the measured phenomena were the test subjects’ natal charts and their CGAPs. If the authors’ test results confirmed the astrological null hypothesis it would necessarily disconfirm or falsify the inferred hypothesis of an empirical relationship, at least for the experimental method used. This was the unspoken hypothesis that the authors claimed they were unable to make yet which they were willing to test. If the test results rejected the astrological null hypothesis, then it would give support to this unspoken astrological hypothesis and the associated astrological theory.

Seeing what they were looking for

Falsifiable hypotheses, whether spoken or assumed, are necessary to test scientific conjectures and theory, yet they do not insure against confirmation bias. Psychologist Raymond Nickerson (1998) described confirmation bias as instances where “One selectively gathers, or gives undue weight to, evidence that supports one’s position while neglecting to gather, or discounting, evidence that would tell against it” (Nickerson, 175). Nickerson explains that confirmation bias can take the form of conscious and deliberate case-building, as illustrated by the practices of attorneys and debaters, where the bias is fairly obvious. But in its usual psychological sense, confirmation bias occurs by engaging in case-building unwittingly, without intending or even being aware of biased selectivity in the acquisition and use of evidence. 

A critical review of the experimental methodology used by Wyman and Vyse should determine whether their sample of test subjects was selected to favor their hypothesis, whether external factors were used to influence a favorable result, whether unequal methodologies were applied to the different hypotheses, and whether selective assumptions replaced diligent research in areas of expertise.

If Wyman and Vyse intended to be pragmatic and avoid errors in a subject not familiar to them, their unspoken astrological hypothesis nevertheless rested upon unexamined assumptions. The authors stated, “Astrologers’ natal charts and psychologists’ personality profiles share a common purpose—to provide a description of the respondent’s personality.” They stated that both psychology and astrology provide a “personality assessment” (W&V, 287). This comparison was not as simple and direct as the authors suggested. 

Even among psychologically minded astrologers, distinctions are made between psychological and astrological assessments. American psychiatrist Bernard Rosenblum (1983), who uses and teaches astrology, has argued that each of these two disciplines offers “different vantage points on the prism of the self.” Psychology was developed from a medical model and focuses on healing disturbed states, whereas astrology emphasizes the inner meaning of impediments to freedom, the cyclic patterns of life stages, and the identification of positive potential (Rosenblum, 13).

A consideration of these different vantage points suggests a sort of psychological uncertainty principle regarding the extent of personality information that one could expect to reliably evaluate from a questionnaire compared to the extended period of an individual’s life stages and personal development. Self-test questionnaires like the NEO-FFI or CPI may give a reliable point-in-time snapshot of personality traits but this information is prone to change as the person matures. American psychologists Brent Roberts and Daniel Mroczek (2008) found that personality traits change in adulthood. “In terms of individual differences in personality change, people demonstrate unique patterns of development at all stages of the life course, and these patterns appear to be the result of specific life experiences that pertain to a person’s stage of life” (Roberts & Mroczek, 31).

By contrast, natal chart interpretations, without the snapshot overlay of the current planetary alignments, must be understood as general assessments of personal potential or destiny developed over an entire lifetime. Instead of capturing a moment in time, the reliability of natal chart descriptions must be considered over the longer term. This is why astrologers have argued for testing mature test subjects. Previous double-blind astrological experiments, such as those by Vernon Clark (1961), Neil Marbell (1981), McGrew and McFall (1990), and Michael Shermer (c. 1999) understood the importance of using mature test subjects. Mature individuals have better knowledge of their own potentials and development. (2)

Other differences between psychology and astrology are diversity and context. Although many people, including many astrologers, have tried to draw comparisons between these two disciplines, astrological theory is highly nuanced in practically all aspects of life, from personal development to cyclic global phenomena. There is no convincing evidence to suggest that the diverse areas of applied astrology can be understood within current psychological models or that psychological theory can encompass astrological theory. Wyman and Vyse themselves commented on important differences of context between the two disciplines. In astrological theory, the relational system of critical moments is determined by the arrangement of celestial bodies, whereas in trait psychology theory, the causes of personality are determined by the individual’s genetic profile (W&V p.287-8).

These different contexts represent a dichotomy of perspectives. The principle “as above, so below” conveys the concept that astrological properties reflect the cycles of the celestial bodies in the macro-environment and that these properties are used and developed by everyone within that environment. This context applies to a global perspective and the resilience of shared properties and needs. To many scientists, such as influential British biologist Richard Dawkins, the psychological properties selected and expressed by genes pertain to individualism, competition, reproduction, and selfishness. Each of these different contexts sustains a distinctly different set of purposes and assessments. The authors did not instruct the test subjects to judge the astrological and psychological profiles independently within their respective contexts, but instead suggested direct comparisons. This approach represented a bias against astrology. (3)

Substantive modifications

Wyman and Vyse incorporated “substantive modifications” into their experiment for changes that had occurred since the Carlson experiment (W&V, 289). The authors used a positive test strategy to improve their results for personal validation. A positive test strategy is described by psychologists Joshua Klayman & Young-Won Ha (1987) as “a tendency to test cases that are expected (or known) to have the property of interest rather than those expected (or known) to lack that property.” Although this strategy is not necessarily equivalent to confirmation bias, Klayman & Ha warn “It can, however, lead to systematic errors or inefficiencies” (Klayman & Ha, 211). 

Presumably, Wyman and Vyse intended to incorporate the latest standards and tools for the two disciplines to demonstrate, by equally easy and unbiased methods, either the presence or the absence of personal validation for both the psychological and astrological tasks in their experiment. The authors chose the NEO-FFI questionnaire and a two-choice format as a positive test strategy to make it easier for their test subjects to identify their own psychological profiles. However, the authors’ use of a CGAP to provide astrological profiles may not have been a positive test strategy.  

In his experiment, Carlson had asked reputable astrologers to write natal chart profiles for the test subjects. Skilled astrologers should be able to weigh the properties in natal charts and identify those properties that are of interest. Instead of astrologer-written profiles, Wyman and Vyse substituted a “sophisticated” CGAP, generated by the Solar Fire version 5.0.19 software program (W&V p.289). The Solar Fire software has a variety of chart comparing and analytic features, and it is capable of accurately plotting the positions of planets and asteroids over many centuries, but it also has an interpretive option that lists the basic chart properties. To justify their substitution, the authors cited an advertisement for the software and a reference from an astrological organization, the National Council for Geocosmic Research, that recommended the software for research (W&V, 291).

Wyman and Vyse did not give reasons as to why they thought their use of the Solar Fire interpretive feature was an improvement over Carlson’s use of astrologer-written profiles. They did not mention Carlson’s astrologers and avoided this issue. In some respects, however, the authors’ choices of the NEO-FFI and a CGAP might appear to be steps towards a convergence that would justify their use together. The NEO-FFI does not discriminate gender identity and is less of a diagnostic tool for mental health disturbances than the CPI questionnaire. Most astrologers would probably agree that these factors bring the NEO-FFI somewhat closer to astrology than the CPI. The Solar Fire CGAP option might appear to be justified because it was created by astrologers and seemed to present standard interpretations based on the astrological literature. Stable, standardized test instruments are what psychologists would look for. Psychologists did not write the psychological profiles, so why should astrologers have written the astrological profiles?

The problem with this argument is that astrology and psychology are different disciplines and each places a different burden of complexity and comprehension on the test subjects. Each NEO-FFI self-test result presented only five briefly described personality dimensions and an evaluation as to whether the respondent was high, moderate, or low in each of the five traits. By contrast, each CGAP, after editing by the authors, consisted of 29 one-to-four sentence descriptions of potential that each subject had to comprehend in terms of their entire life and development. The authors even conceded the greater difficulty of their subjects’ astrological task, “The computer-generated astrological reports in the present study contained many more personality descriptions—29 separate personality statements—that may have made the task more difficult than the task with the NEO-FFI” (W&V, 297). 

No one, least of all astrologers, would seriously argue that a computer program could equal the chart interpreting skills of experienced astrologers. The Solar Fire program was incapable of weighing and integrating the many pieces of interpretation as a skilled astrologer would have done. To avoid bias, the authors’ research methodology needed to equalize the complexity of the subjects’ astrological and psychological tasks. Their substitution of the CGAPs introduced greater complexity and required greater effort to understand than the astrologer-written profiles used by Carlson and this biased the experiment against their astrological hypothesis.

A sanity check

Given the use of the CGAPs, the type of edits that Wyman and Vyse made to the CGAPs needs to be examined. Astrology is a discipline with which Wyman and Vyse had little familiarity and yet which they felt confident to modify as they saw fit. The authors removed all references to astrological signs, planets, and houses from the test subjects’ CGAPs. These modifications were necessary to avoid bias due to astrological knowledge and, if those edits were done correctly, this would not have been a problem. However, the authors also removed all information related to planetary aspects. Their explanation for these edits was that some natal charts had more planetary aspects than others and this could bias the tests in favor of the longer profiles. This was not a valid reason for removing all of the aspects.

Among astrologers, the planetary aspects are probably the least controversial feature of natal charts. There are various house and sign systems in use today, but there is nearly universal acceptance of the five traditional Ptolemaic aspects, although additional aspects are sometimes also used. A natal chart is considered to be an integrated system and aspects represent the potential conflicts and resolutions between the different parts of the system and thus can be regarded as an important integrative feature. In astrological practice, aspects are not optional.

To retain aspects in the astrological profiles, the authors could have included an equal number of the most important aspects in each CGAP. However, this would assume that Wyman and Vyse would recognize which aspects were the most important. To avoid errors, the authors could have consulted expert astrologers for a sanity test or “smoke test” of what edits would have been acceptable, assuming that the CGAP substitution itself would have been acceptable. Wyman and Vyse gave no reasons for not consulting with subject matter experts (SMEs) for any of the modifications they made.

A sanity check or would also have allowed astrological SMEs to strike down an obvious misrepresentation. Wyman and Vyse misunderstood the polarities of the odd-numbered (positive, masculine) signs as “favorable” and the even-numbered (negative, feminine) signs as “unfavorable.” It appears they mistook “positive” to mean favorable and “negative” to mean unfavorable in a literal and fundamentalist sense. In astrology, favorability is contingent upon function and it is incorrect to assign sign favorability in the absolute sense that the authors did.

The authors referenced other studies by non-astrologers who claimed to have tested this presumed favorable versus unfavorable determination and they devoted a section of their article to its analysis. To further compound their mistake, the authors twice described, and presumably evaluated, the sign Aquarius as being both odd-numbered and favorable and even-numbered and unfavorable (W&V, 289, 294). The authors did not provide an astrological source for their favorable vs. unfavorable sign theory and it appears to have derived from the folklore of pseudoskeptical inquiry.

The authors’ substantive modifications to their replication of the Carlson experiment had made the psychological task easier and more efficient but made the astrological task more difficult and inefficient. These design differences in the complexity of tasks and the authors’ unverified astrological edits favored the authors’ psychological hypothesis even before the testing began.

Selecting participants with the right stuff

Students are often used as experimental test subjects at colleges and universities, yet for experiments where the students have a vested interest in the outcome, objectivity cannot be expected. Typically, students are under financial and career pressures and will avoid risks that would jeopardize the considerable investment they put into their education. Wyman and Vyse recruited students (N=52) mainly from an introductory psychology course at Connecticut College. The author Stuart Vyse was a psychology professor at that college specializing in “irrational behavior, superstition, and belief in the paranormal” (W&V, 299). The students enrolled in the psychology course received a course credit for their participation in the experiment. That incentive to earn a credit represented a bias. However tenuous, that small stake in determining the success of the experiment had potential to influence how the students performed their tasks. There were other problems as well.

The test participants were between the ages of 18 and 22 years (M age = 19.3 years). Teenagers, even those in their late teens, lead relatively sheltered lives and are not suitable subjects for the authors’ test of astrology. Suitable subjects would have a solid grasp of self-knowledge and potentials through their own self-sustaining life experiences. Only mature individuals can be expected to have these necessary qualities. For young people, concepts of self in personal relationships, place in society, and sense of life purpose are still malleable and easily influenced. The use of teenagers as test subjects, especially if the authors were in positions of authority over them, would represent a bias that would favor the authors’ hypotheses.

The students in the experiment were in college to learn psychology not astrology. Colleges and universities in Western societies do not teach courses in astrology. This institutional bias must be considered. To ensure their academic success, the students might expect that they would need to compare astrology, which would be unfamiliar to them and their professors, with psychology, which was the mutually agreed upon discourse with their professors. The authors’ assertion that psychology and astrology “share a common purpose” would only have reinforced the primacy of psychology as the chief purpose of the experiment. These institutional and primacy biases favored the authors’ psychological hypothesis. 

In their article, Wyman and Vyse devote much space to the analysis of various biases based on data gathered from their astrological knowledge and beliefs questionnaire. Three weeks before the NEO-FFI and CGAP identification tests, the students were required to provide their birth date, time, and location. They were also required to complete and sign a questionnaire that disclosed their personal astrological knowledge and beliefs. This disclosure would have signaled that the intention of the experiment was not simply to evaluate samples of astrology and psychology but also to assess personal beliefs. The signed disclosures could be seen by someone who might make judgments of “irrational behavior, superstition, and belief in the paranormal.” Wyman and Vyse did not explain how they would protect the students against the potentially frightening implications that their personal beliefs might have on their educational investments. (4)

The authors warned the students that providing inaccurate birth information or inaccurate questionnaire responses would be a violation of the honor code of the college. This warning of possible expulsion from the college could cause the students to feel vulnerable and hesitant to make disclosures that might be unacceptable to those in authority. Belief in controversial knowledge like astrology at an institution where it is not taught would not be an academic asset but could very well be a liability. If the students sensed that the experiment had raised the stakes on their openness to unconventional beliefs through the knowledge and beliefs questionnaire, then the students had the opportunity to lower the stakes by making personal rationalizations that would restore their academic safety. Without intending to influence the students or even being aware of doing so, the authors’ disclosure questionnaire may have unwittingly created biases that impacted student responses on not only the questionnaire and also the later identification tasks. (5)

In total, the authors held a powerful influence over the students through means of the course credit bias, the learning primacy bias, the institutional exclusion bias, and the beliefs liability bias. These biases would have acted in concert as psychological pressures to not only overwhelm the experiment’s beliefs disclosure but also its double-blind methodologies to the point of irrelevance. The normal facility of teenagers to rationalize all cognitive dissonance to their own immediate interests would have swayed the students’ natural curiosity for different concepts like astrology and fostered an uncritical indoctrination that served to confirm the authors’ own biases and interests. This was not a rational environment in which to conduct a scientific evaluation.

Testing and results

The students were given four profiles or summaries: their real NEO-FFI profile and a bogus NEO-FFI selected at random from the other participants, and their real CGAP and a bogus CGAP also selected at random. “(Participants rated) each statement of all four personality summaries on a 1-9 point scale. In addition, participants provided a single overall accuracy rating for each summary, and we asked them to identify which of the two NEO-FFI reports they believed was their own and which of the two astrological summaries was their own. Last, participants considered all four of the personality reports and identified the one that they thought was the more accurate description of their personality” (W&V, 292). 

Because of the number and weight of the biases in play, the experimental results for astrology can almost be guessed as a foregone conclusion and cannot be regarded as conclusive. The students identified their real NEO-FFI profiles at a rate of 78.8% (p<.001) and they identified their real CGAPs at a rate even lower than the bogus CGAPs at 46.2%. When the students rated the “most accurate” of the four profiles, the results were: real NEO-FFI 54.9% (p<.001), real CGAP 19.6%, bogus CGAP 15.7%, and bogus NEO-FFI 9.8%. A P.T. Barnum effect was found in all four personality profiles. There was no effect for the so-called “favorable versus unfavorable” sun signs. Students who knew their sun sign gave higher accuracy scores to the sun sign statements over the non-sun-sign statements. Greater knowledge of astrology produced significantly lower scores for the bogus CGAPs but had no effect on the real CGAPs. There was no effect based on belief in astrology. In similar fashion to the astrological results, belief in psychology produced higher scores for the bogus NEO-FFI profiles, but had no effect on the real NEO-FFIs. 

Discussion

Lessons can be learned from the Wyman and Vyse experiment for the benefit of future Vernon Clark type matching tests. Self-test psychological profiles like the NEO-FFI and natal chart profiles do not apply to the same perspectives on life. The NEO-FFI profile can be useful for assessing groups of people, but what is its value as a tool for self-understanding if its results can be so easily identified (79%)? What deeper insights can an individual learn from it that they do not already know? By contrast, a natal chart interpretation purports to describe the native’s life potentials and their adaptations to social and everyday changes. These are more reflective concerns that are not easily understood or identified in a few minutes or hours, although mature people are much better at understanding these concerns than young people. One of the authors’ own statements stresses a useful criteria in this regard, “Still, a measure of the profile’s ability to predict future behavior would be more convincing” (W&V, 299). Why then did the authors test the natal chart profiles of young subjects who would be less certain about their future development than mature subjects who would already have experienced a greater part of their development and who could provide a more convincing measurement? 

What was the purpose of the Wyman and Vyse experiment? What did the authors expect to learn? Simply that the participants could identify and thus confirm the “validity” of their psychological and astrological profiles? The participants could have done that without the disclosures they made on the astrological knowledge and beliefs questionnaire. What Wyman and Vyse really wanted was to determine, or at least suggest, the causes of astrology. Specifically, they wanted to know whether astrology was due to the appeal of an unrealistically favorable self-image or whether it was due to a falsely personalized P.T. Barnum effect. 

Despite all their discussion, the authors’ analysis failed to demonstrate these or any other psychological causes. The authors concluded, “However, neither the present study nor Wunder’s result contradicts the basic premise that the favorableness of a personality description affects its acceptability because neither study directly measured favorability or unfavorability of the profiles” (W&V, 298). In other words, the authors’ experiment did not contradict the favorable self-image premise as a cause of astrology because they did not evaluate it. This unevaluated, uncontradicted “basic premise” in the authors’ experiment is an assumption and an example of the rational fallacy known as argument from ignorance. No one really knows, so let it be true.

What about the Barnum effect? “In the present study, we found evidence of the P.T. Barnum effect in participants’ evaluations of both the astrological reports and the NEO-FFI profiles” (W&V, 297). Hence the P.T. Barnum effect could not account for astrology any more than psychology in the experiment. Furthermore, “Those who correctly identified their astrological profile did not differ in their degree of belief in astrology” (W&V, 298). Hence the authors’ experiment with the astrological knowledge and beliefs questionnaire did nothing to explain acceptance of the CGAP or belief in astrology. This leaves open the possibility that the authors failed to find causes for astrology because they looked for the wrong causes.

It is disappointing that the authors had so little familiarity with the phenomenon that they studied. They might have developed a better experiment. Wyman and Vyse did not describe the existing literature of Vernon Clark tests other than the Carlson experiment nor did they describe the abstract principles of astrology relevant to the concrete details of their experiment. They were unappreciative of the learned skills, the methods, and the discourse used in astrology.  They did not consult astrological SMEs who could have advised them on whether their hypotheses and methods were in the ballpark. Their test of the so-called favorable versus unfavorable signs was an avoidable blunder. They failed to distinguish between the expressed traits of personality that manifest at a specific point in time and the emergent potential of character that develops over a lifetime. Instead, they suggested similarities in what amounts to a category mistake. The authors presumed they needed no familiarity with astrology because they assumed they already knew its causes and had the answers.

Although there are many deserved criticisms of the Shawn Carlson double-blind experiment, Carlson’s methodology was exemplary in some respects. As a shared flaw, Carlson, like Wyman and Vyse, did not use a sample of mature test subjects. But Carlson recruited reputable astrologers and they provided data that supported Carlson’s astrological hypothesis, as assessed by Ertel (2009). Carlson did not presume to look for any hidden psychological causes of astrology and thus he avoided the various biases that Wyman and Vyse introduced in their experiment. Because Wyman and Vyse were so concerned with causes, they did not evaluate the specific sign, house, and aspect components of the Solar Fire profiles for accuracy. Such an analysis might have provided more promising astrological insights.

Causes are nice to have in science because they enable easy prediction, but causes are not necessary for empirical observations or for new knowledge to be applied to good purpose. For the past 100 years, statistical inference has led the way in scientific research. It is not scientifically or epistemologically efficient to expect to know causes first before evaluating the evidence of correlations and relationships. In today’s science, the single-minded expectation of understanding causes and its attendant demand for a mechanism is an irrational argument and it prevents much of the astrological criticism by skeptics from being taken seriously within the scientific community.

Many thinkers, and Wyman and Vyse may be among them, recognize that individual personality is not entirely the manifestation of genes but that personality is also shaped by environment. Yet these environmental effects are most often simply acknowledged and then dropped because no one seems to know how to comparatively evaluate them for individuals with any reliability. Environmental factors remain a largely inaccessible problem of personality assessment, yet it is seldom recognized that astrology is explicitly structured as a study of how the native interacts with the environment in intimate detail.

To benefit from double-blind chart matching tests like the Wyman and Vyse experiment, it would be of interest to see which astrologers are best able to perform the matching tasks and then formulate these practices into structured concepts and theories that can be compared and methodically evaluated. Which astrologers have the best working models of aspects, signs, or houses? What parts of traditional astrological texts need to be examined and possibly updated? The informal qualitative research that astrologers typically do among themselves with case studies can be augmented by the more disciplined quantitative testing methods that Shawn Carlson implemented and Wyman and Vyse tried to improve. Future experiments should enable their participants—competent astrologers and mature test subjects—to give their best efforts in fair tests. The next experiment, which should be a true collaboration of researchers and knowledgeable astrologers, could be named “Science and Astrology: A double-blind test without bias.”

Notes

1. Two-choice versus three-choice tests becomes an issue in experiments where astrologers are asked to identify natal charts. Two-choice tests are more suitable for testing with heterogeneous samples. For example, the Vernon Clark double-blind experiment used a two-choice format to test whether astrologers could distinguish between ten sets of charts of people with cerebral palsy versus people with high intelligence. Three-choice tests, with the first two choices evaluated together, are more suitable for testing a relatively homogeneous sample that has a higher likelihood that two charts out of any given three might have insufficiently distinguishing features. For example, the Shawn Carlson double-blind experiment used a three-choice test because the sample consisted of students who were close in age and attended the same university. 

The test subjects in the Wyman and Vyse experiment were even more homogeneous than Carlson’s. Their sample consisted mainly of students in the same introductory psychology course within a narrower age range. Wyman and Vyse state, “To maximize the likelihood of correct identification, we used a simple two-choice task” (W&V, 289). However, it is more likely that the two-choice format used by the authors did not maximize the correct identifications.

2. A videoed test (c. 1999) by American researcher Michael Shermer is worth special mention because the video has only recently reappeared online on YouTube after having been taken down for over a year and could be taken down again. In the late 1990s Shermer was the publisher of Skeptic Magazine and had a TV show called Exploring the Unknown. On one of the shows, Shermer challenged American Vedic astrologer Jeffrey Armstrong to a double-blind test. Armstrong could not see or talk to the participants. He was given only the birth location, date, time, and gender of the nine participants and his three-minute readings of each participant were recorded. The readings were played to the participants and Shermer scored the accuracy of Armstrongs’ statements while Armstrong made notes while watching from another room. 

The scores for the nine participants were: 69%, 63%, 89%, 71%, 74%, 75%, 66%, 38%, and 21%. Unbeknownst to the participants or Armstrong, the readings for the last two participants had been switched. When these participants heard their real readings, the accuracy of these readings changed from 38% to 94% and from 21% to 92%. In total, the participants agreed with 105 out of 137 of Armstrong’s statements for an overall score of 77%. In the end, Shermer questioned the plausibility of the typical skeptical explanations. How should Armstrong have been able to make accurate statements 77% of the time just on generalities, logical guesses, or blind luck?

3. The argument that genes fundamentally cause personality is incomplete because it does not account for how genes themselves change. Recent research into identical twins has shown that genomic structures are not fixed but mutate and change such that twins are not as genetically identical as was once thought (Brogaard & Marlow, 2012). This explains why identical twins develop different personalities and traits. If genes are the cause of personality and yet mutate, then one needs to consider how environmental factors might empirically correlate with genetic changes and expressions of personality. Any single instance of genetic mutation could spontaneously emerge from the quantum state without an empirical cause, yet still be probable. The empirical tendencies of an organism’s quantum genetic mutations could be indeterminate and non-causal and yet be statistically measurable.

4. Judging by the figures in the authors’ intercorrelations table (W&V, 296) there were an estimated two or three participants out of the total of 52 participants who claimed astrological beliefs. These special participants may have been recruited outside of the classroom through “fliers posted around the campus” (W&V, 290) but this was not a large enough sample to be representative. According to a Gallup report by David Moor (2006), 25% of Americans believe in astrology.

5. American psychologist Bertram Forer’s influential 1949 classroom experiment into gullibility and personal validation used a deliberate confirmation bias to evoke a type of unwitting confirmation bias now commonly known as the P.T. Barnum effect. The Barnum effect is the tendency for people to regard a personality description as accurate when it appears to be unique to them, even though the description is written to be so vague that it applies to a wide range of people. 

Forer created a questionnaire that he called the DIB personality test, and administered the test to his students. A week later, immediately before a scheduled quiz, Forer asked the students to rate each description in the resultant profiles for accuracy to themselves and to hand in their signed results. Unbeknownst to his students, Forer had given each student the same profile. This profile was based on a newsstand astrology book from which Forer had intentionally selected statements to be vague and generally true for everyone. Analysis of the students’ signed results demonstrated that the students had rated their professor’s DIB test as being highly accurate. This result should come as no surprise because Forer had intentionally biased his test to confirm the effect he was looking for and because students, apart from any Barnum effect, have a known tendency please their teachers when it counts and are perhaps even more receptive to the tendency when they are just about to write a quiz. 

Similar tests have been performed countless times with various groups such as students, soldiers, and even astrological skeptics, with similar results and the Barnum effect is regarded to be a robust phenomenon (Rogers & Soule, 382). Many “rational skeptics” have argued that astrology operates by “cognitive bias” by their rationalization along the following lines: 1) The P.T. Barnum effect is the false assumption of information unique to oneself. 2) Astrology is information unique to oneself. 3) Therefore astrology is a P.T. Barnum effect. Consistent with this illogical reasoning, any information unique to anyone would necessarily be false and a Barnum effect. The assumption of a P.T. Barnum effect does not falsify astrological theory.

Acknowledgments

I am very grateful for critical input from Correlation’s peer review panel and for the advice of my associates in the astrological community. Special thanks to Anita Puronto for her editorial review for my final draft. 

Efforts were made to contact Professor Stuart Vyse for samples of the test materials and for discussion, but there was no reply. Drafts of this article were sent to Professor Christopher French and Professor Ivan Kelly for comment, but there were no replies.

© 2014 Kenneth McRitchie

References

Big Five Personality Test: Take the Test! Psychology Today. Viewed on 2013-12-07.
Brogaard, Berit & Kristian Marlow (2012). Identical Twins Are Not Genetically Identical: Potential consequences for the Minnesota Twin Study. Psychology Today, November 25.
Carlson, Shawn. (1985). A double-blind test of astrology. Nature, (318), December, 419-25.
Clark, Vernon (1970). An investigation of the validity and reliability of the astrological technique, Aquarian Agent, October, 1, 2-3.
Clark, Vernon (1961). Experimental astrology. In Search, (Winter/Spring), 102-112.
Currey, Robert. (2011). Shawn Carlson’s Double-Blind Astrology Experiment: U-Turn in Carlson’s Astrology Test? Correlation. 27(2), 7-33.
Currey, Robert (2011). Research Sceptical of Astrology: Wyman & Vyse Double Blind Test of Astrology. www.astrologer.com. Viewed on 2011-07-02.
Dean, Geoffrey & Arthur Mather (1977). Recent Advances in Natal Astrology: A Critical Review 1900-1976, Bromley & Kent, UK, Astrological Association. ISBN 0140223975 
Ertel, Suitbert. (2009). Appraisal of Shawn Carlson’s Renowned Astrology Tests. Journal of Scientific Exploration, 23(2), 125-137.
Forer, Bertram R. (1949). The fallacy of personal validation: A classroom demonstration of gullibility. The Journal of Abnormal Psychology, 44, 118-121.
Klayman, Joshua, & Young-Won Ha. (1987). Confirmation, disconfirmation, and information in hypothesis testing. Psychological review, 94(2), 211-28.
Marbell, Neil. (1981). Profile Self-selection: A test of astrological effect on human personality. Synapse Foundation monograph. Republished (1986-87, Winter) NCGR Journal, 29-44.
McGrew, John H. & Richard M. McFall. (1990). A Scientific Inquiry Into the Validity of Astrology. Journal of Scientific Exploration, 4(1), 75-83.
McKie, Robin. (2013). Why do identical twins end up having such different lives? The Guardian, June 2.
McRitchie, Kenneth. (2011). Support for astrology from the Carlson double-blind experiment. ISAR International Astrologer, 40(2), 33-38.
McRitchie, Kenneth. (2006). Astrology and the Social Sciences: Looking inside the black box of astrology theory. Correlation, 24(1), 5-20.
McRitchie, Kenneth. (2004). Environmental Cosmology: Principles and Theory of Natal Astrology. Cognizance Books. ISBN 0-9736242-0-5.
Moore, David W. (2005). Three in Four Americans Believe in Paranormal. Gallup News Service, June 16.
Nanninga, Rob. (1996/97). The Astrotest: A tough match for astrologers. Correlation, Northern Winter, 15(2), 14-20.
Nickerson, Raymond S. (1998). Confirmation bias: A ubiquitous Phenomenon in many guises. Review of General Psychology, 2(2), 175-270.
Popper, Karl. (1959). The logic of scientific discovery. New York: Basic Books. ISBN 0-203-99462-0.
Roberts, B. W. and Mroczek, D. (2008). Personality Trait Change in Adulthood. Current Directions in Psychological Science. 17 (1): 31–35.
Rogers, Paul and Janice Soule. (2009). Cross-cultural differences in the acceptance of Barnum profiles supposedly derived from Western versus Chinese astrology. Journal of Cross-Cultural Psychology, 40(4), 381-389.
Rosenblum, Bernard. (1983). The Astrologer’s Guide to Counseling: Astrology's Role in the Helping Professions. CRCS Publications. ISBN 0-916360-14-8.
Saucier, Gerard. (1998). Replicable Item-Cluster Subcomponents in the NEO Five-Factor Inventory. Journal of Personality Assessment. 70(2), 263-276.
Schäfer, Lothar. (2013) Infinite Potential: What Quantum Physics Reveals About How We Should Live. Random House. ISBN-10: 0307985954.
Shermer, Michael. (c. 1999). Michael Shermer debunked by Astrologer Jeffrey Armstrong on his own show. Exploring the Unknown. YouTube. Viewed on 2013-13-10. 
Taylor, Shelley E. & Jonathan D. Brown. (1988). Illusion and well being: A social psychological perspective on mental health. Psychological Bulletin, 103, 193-210.
Taylor, Shelley E. & Jonathan D. Brown. (1994). Positive Illusions and Well-Being Revisited. Psychological Bulletin, Vol. 116(1) (July), 21-27.
Vidmar, Joseph (2009). A Comprehensive Review of the Carlson Astrology Experiments. Correlation, 26(1).
Wyman, Alyssa Jayne & Stuart Vyse. (2008). Science Versus the Stars: A Double-Blind Test of the Validity of the NEO Five-Factor Inventory and Computer-Generated Astrological Natal Charts. The Journal of Psychology, 135(3), 287-300.

Support for astrology from the Carlson double-blind experiment

Review of "A double-blind test of astrology" 

This article has been peer reviewed and published by ISAR International Astrologer, 40, No. 2, pp. 33-38. Copyright © 2009 by Kenneth McRitchie. [Download PDF]

Abstract. The Carlson double-blind study, published in 1985 in Nature (one of the world’s leading scientific publications) has long been regarded as one of the most definitive indictments against astrology. Although the study might appear to be fair to uncritical readers, it contains serious flaws, which when they are known, cast a very different light on the study. These flaws include: no disclosure of similar scientific studies, unfairly skewed design, disregard for its own stated criteria of evaluation, irrelevant groupings of data, rejection of unexpected results, and an illogical conclusion based on the null hypothesis. Yet, when the stated measurement criteria are applied and the data is evaluated according to normal social science, the two tests performed by the participating astrologers provide evidence that is consistent with astrology (p = .054 with ES = .15, and p = .037 with ES = .10). These extraordinary results give further testimony to the power of data ranking and rating methods, which have been successfully used in previous astrological experiments. A critical discussion on follow-up studies by McGrew and McFall (1990), Nanninga (1996/97), and Wyman and Vyse (2008) is also included.

The research experiment conducted by Shawn Carlson, “A double blind test of astrology,” published in the science journal Nature in 1985 as an indictment of astrology, is one of the most frequently cited scientific studies to have claimed to refute astrology. A Google search for the title as a quoted string returns over 6,600 links.(1) Although the Carlson study drew initial criticism for numerous flaws when it was published, a more recent examination has found that despite the flaws, the data from the study actually supports the claims of the participating astrologers. This support lends further credence to the effectiveness of ranking and rating methods, which have been used in other, lesser known astrological experiments.

The Carlson astrology experiment was conducted between 1981 and 1983 when Carlson was an undergraduate physics student at the University of California at Berkeley under the mentorship of Professor Richard Muller. The flaws that have been uncovered in the Nature article include not only the omission of literature on similar studies, which is expected in all academic papers, but more serious irregularities such as skewed test design, disregard for its own criteria of evaluation, irrelevant groupings of data, removal of unexpected results, and an illogical conclusion based on the null hypothesis.

In concept and design, the Carlson experiment was not original. It was modeled after the landmark double-blind matching test of astrology by Vernon Clark (Clark, 1961). In that test astrologers were asked to distinguish between each of ten pairs of natal charts. One chart of each pair belonged to a subject with cerebral palsy and the other belonged to a subject with high intelligence. Another influential study was the “Profile Self-selection” double-blind experiment, which was led by the late astrologer Neil Marbell and privately distributed among contributors in 1981 before its eventual publication (Marbell, 1986-87). In that test, participating volunteers were asked to select their own personality interpretations, both long and short versions in separate tests, out of three that were presented.

In both of these prior studies, the participants performed well above significance in support of the astrological hypothesis as compared to chance. The Marbell study was extraordinarily qualified as it involved extensive input and review from astrologers, scientists, statisticians, and prominent skeptics. Carlson neglected to provide any review of these scientific studies that supported astrology or any other previous related experiments. 

The stated purpose of Carlson’s research was to scientifically determine whether the participating astrologers (members of the astrology research organization NCGR and others) could match natal charts to California Psychological Inventory (CPI) profiles (18 personality scales generated from 480 questionnaire items). Additionally, Carlson would determine whether participating volunteers (undergraduate and graduate students, and others) could match astrological interpretations, written by the participating astrologers, to themselves. These assessments, Carlson asserts, would test the “fundamental thesis of astrology” (Carlson, 1985: 419). 

From the time of its release, the Carlson study has been criticized for the extraordinary demands it placed on the participating astrologers, which would be regarded as unfair in normal social science. As with any controversial study, all references to Carlson’s experiments should include the scientific discourse that followed it, particularly the points of criticism that show weaknesses in the design and analysis. Notable among recent critics has been University of Göttingen emeritus professor of psychology Suitbert Ertel, who is an expert in statistical methods and is known for his criticism of research on both sides of the astrological divide. Ertel published a detailed review in a 2009 article, “Appraisal of Shawn Carlson’s Renowned Astrology Tests” (Ertel, 2009).

From a careful reading of Carlson’s article in light of the ensuing body of discourse, we can appreciate that the design of the experiment was intentionally skewed in favor of the null hypothesis (no astrological effect), which Carlson refers to, somewhat misleadingly as the “scientific hypothesis.” Some of the controversial features of the design are as follows:
  • The astrologers were not supplied with the gender identities of the CPI owners, even though the CPI creates different profiles for men and women. (Eysenck, 1986: 8; Hamilton, 1986: 10).
  • Participants were not provided with sufficiently dissimilar choices of interpretations, as the Vernon Clark study had done, but instead were given randomly selected choices. This may give the impression of a fair method, but given the narrow demographics of the sample, there is an elevated likelihood of receiving similar items from which to choose, which makes it unfair (Hamilton, 1986: 12; Ertel, 2009: 128)
  • The easier to discriminate and more powerful two-choice format, which had been used in the Vernon Clark study, was replaced with a less powerful three-choice format, which further elevated the chances of receiving similar items (Ertel, 2009: 128). No reasons are given for this unconventional format, although it can be surmised that Carlson was well aware of the complexities of a three-choice format from his familiarity with the Three-Card Monte (“Follow the Lady”) sleight-of-hand confidence game, which he had often played as a street psychic and magician (Vidmar, 2008).
  • The requirement for rejecting the “scientific hypothesis” was elevated to 2.5 standard deviations above chance (p = .006). In the social sciences, the conventional threshold of significance is 1.64 standard deviations with probability less than p = .05 (Ertel, 2009: 135).
  • Failure to consider the astrologers’ methodological suggestions or give an account of their objections. Carlson credits astrologer Teresa Hamilton with giving “valuable suggestions,” yet Hamilton complained later that “Carlson followed none of my suggestions. I was never satisfied that the experiment was a fair test of astrology” (Hamilton, 1986: 9).
Given this skewed design, the irregularities of which are not obvious to the casual reader, Carlson directs our attention to the various safeguards he used to assure us that no unintended bias would influence the experiment. He describes in detail the precautions used to screen volunteers against negative views of astrology, how the samples were carefully numbered and guarded to ensure they were blind, and the contents of the sealed envelopes provided to test participants.

The experiment consisted of several separate tests. The astrologers performed two tests, a CPI ranking test and a CPI rating test. The volunteer students performed three tests, a natal chart interpretation ranking test, a natal chart interpretation component rating test, and a CPI ranking test.

In the CPI ranking test, astrologers were given, for each single natal chart, three CPI profiles, one of which was genuine, and asked to make first and second choices. There were 28 participating astrologers who matched 116 natal charts with CPIs. Success, Carlson states, would be evaluated by the frequency of combined first and second choices, which is the correct protocol for this unconventional format. He states, “Before the data had been analyzed, we had decided to test to see if the astrologers could select the correct CPI profile as either their first or second choice at a higher than expected rate” (Carlson, 1984: 425).

In addition to this ranking test, the astrologers were tested for their ability to rate the same CPIs according to a scale of accuracy. This task allowed for finer discrimination within a greater range of choices. Each astrologer “also rated each CPI on a 1-10 scale (10 being the highest) as to how closely its description of the subject’s personality matched the personality description derived from the natal chart” (Carlson, 1985: 420).

As to the results of the astrologers’ three-choice ranking test, Carlson first directs our attention to the frequency of the individual first, second, and third CPI choices made by the astrologers, each of which he found to be consistent with chance within a specified confidence interval. This observation is scarcely relevant, given the stated success criteria of the first and second choice frequencies combined. Then, to determine whether the astrologers were successful, Carlson directs our attention to the rate for the third place choices, which, as already noted, was consistent with chance. Thus he declares that the combined first two choices were not chosen at a significant frequency.

“Since the rate at which the astrologers chose the correct CPI as their third place choice was consistent with chance, we conclude that the astrologers were unable to chose [sic] the correct CPI as their first or second choices at a significant level” (Carlson, 1984: 425). This conclusion, however, ignores the stated success criteria and is in fact untrue. The calculation for significance shows that the combined first two choices were chosen at a success rate that is marginally significant (p = .054) (Ertel, 2009: 129).

As to the results of the astrologers’ rating test (10-point rating of three CPIs against each chart), Carlson demonstrates that the astrologers ratings were no better than chance within the first, second, and third place choices made in the three-choice test. He shows a weighted histogram and a best linear fit graph to illustrate each of these three groups of ratings. Carlson directs our attention to the first choice graph as support for his conclusion for this test. The slope of this graph is “consistent with the scientific prediction of zero slope” (Carlson, 1985: 424). The slope is actually slightly downward. The graphs for the other two choices are not remarked upon, but show slightly positive slopes.

The notable problem with Carlson’s analysis of the 10-point rating test, however, is that this test had no dependency on the three-choice ranking test and even used a different sample size of CPIs.(2) According to the written instructions supplied to the astrologers, this rating test was actually to be performed before the three-choice ranking test (Ertel, 2009: 135). These 10-point ratings should not be grouped as though they were quantitatively related to the later three-choice test. Confirmation bias from the claimed “result” of the three-choice test, which Carlson presents earlier in his paper, suggests acceptance of irrelevant groupings in this 10-point rating test, presented later. When the totals of the ratings are considered without reference to the choices made in the subsequent test, a positive slope is seen, which shows that the astrologers actually performed at an even higher level of significance (p = .037) than the three-choice test (Ertel, 2009: 131).

The other part of Carlson’s experiment tested 83 student volunteers to see if they could correctly choose their own natal chart interpretations written by the astrologers. Volunteers were divided into a test group and a control group. Members of the test group were each given three choices, all of the same Sun sign, one of which was interpreted from their natal chart (Carlson, 1985: 421). Similarly, each member of the control group received three choices, all of the same Sun Sign, except none of the choices was interpreted from their natal charts, although one choice was randomly selected as “correct” for the purpose of the test.

For the results of this test, Carlson shows a comparison of the frequencies of the correct chart as first, second, and third choices for the test group and the control group (again ignoring his stated protocol to combine the frequencies of the first two choices). He finds that the results for the test group are “all consistent with the scientific hypothesis” (Carlson, 1985: 424). However, he does note an unexpected result for the control group, which was able to choose the correct chart at a very high frequency. He calculates this to be at 2.34 standard deviations above chance (p = .01). Yet, because this result occurred in the control group, which was not given their own interpretations, Carlson interprets this as a “statistical fluctuation.”

Yet the size of this statistical fluctuation is so unusual as to attract skepticism, particularly in light of Carlson’s other results. It is reasonable to think that the astrologers could write good quality chart interpretations after having successfully matched charts with CPI profiles. Yet, according to Carlson’s classification, the test group tended to avoid the astrologers’ correct interpretations and choose the two random interpretations, while the control group tended to choose the selected “correct” interpretations by a wide margin, as if they, the controls, had been the actual test subjects (Ertel, 2009: 132). This raises suspicion that the data might have been switched, perhaps inadvertently, but this is unverifiable speculation (Vidmar, 2008).

Like the participating astrologers, the student volunteers were also given a rating test; in this case for the sample chart interpretations they were given. They were asked to rate, on a scale of 1 to 10, the accuracy of each subsection of the natal chart interpretations written by the astrologers. “The specific categories which astrologers were required to address were: (1) personality/temperment [sic]; (2) relationships; (3) education; (4) career/goals; and (5) current situation” (Carlson, 1985: 422). This test would potentially have high interest to astrologers because of the distinction it made between personality and current situation, which is a distinction that is not typically covered in personality tests. Also, the higher sensitivity of a rating test could provide insight, at least as confirmation or denial, into the extraordinary statistical fluctuation seen in the three-choice ranking test.

However, based on a few unexpected results, Carlson decided that there was no guarantee that the participants had followed his instructions for this test. “When the first few data envelopes were opened, we noticed that on any interpretation selected as a subject’s first choice, nearly all the subsections were also rated as first choice” (Carlson, 1985: 424). On the basis of this unanticipated consistency, Carlson rejected the volunteers’ rating test without reporting the results.

As an additional test in this part of the experiment, the student volunteers were asked to choose from among three CPI profiles the one that was based on the results of their completed CPI questionnaire. The other two profiles offered were taken from other student volunteers and randomly added. Of the 83 volunteers who completed the natal chart interpretation choices, only 56 completed this task. As usual, Carlson compared the results of the three choices for the test and control groups taken individually (instead of the frequency of the first two choices taken together). Furthermore, in contravention to the logic of control group design, Carlson compares the two groups against chance instead of against each other (Ertel, 2009: 132). He found no significant difference from chance for the two groups.

There are plausible reasons that could explain why the test group was unable to correctly select their own CPI profiles, even though the astrologers were able to a significant extent as we have seen, to match CPI profiles with the students’ charts. The disappointing number of students who completed this task, despite having endured the 480-question CPI questionnaire, suggests that the students might have been much less motivated than the astrologers, for whom the stakes were higher (Ertel, 2009: 133). The CPI matching tasks, for both the volunteers and the astrologers, were especially challenging because of the three-choice format. The random selections of CPIs made within the narrow demographics of the sample population of students would have elevated the likelihood of receiving at least two CPI profiles that were too similar to make a discriminating choice and this would have had a negative impact on motivation.

In the conclusion of his study, Carlson claims: “We are now in a position to argue a surprisingly strong case against astrology as practiced by reputable astrologers” (Carlson, 1985: 425). However, this conclusion defies rationality. Ertel points out the logical flaw that such a conclusion cannot be drawn even if the tests had shown an insignificant result. “Not being able to reject a null hypothesis does not justify the claim that the alternate hypothesis is wrong” (Ertel, 2009: 134).

Despite its numerous flaws and unfair challenges, the Carlson experiment nevertheless demonstrates that the astrologers, in their two tests, were able to match natal charts with CPI profiles significantly better than chance according to the criteria normally accepted by the social sciences. Thus the null hypothesis must be rejected. As such, the Carlson experiment demonstrates the power of ranking and rating methods to detect astrological effects, and indeed helps to raise the bar for effect size in astrological studies. The benchmark effect size that had been attained by the late astrological researcher Michel Gauquelin was merely .03 to .07. Although these were small effects, they were statistically very significant due to large sample sizes (N = 500-1000 or more natal data) and had to be taken seriously (Gauquelin, 1988a). In Carlson’s experiment, which applied sensitive ranking controls, the effect size of the three-choice matching test with p = .054 is ES = .15, and the effect size of the 10-point rating test with p = .037 is ES = .10 (Ertel, 2009: 134).

Follow-up studies

Other experiments have attempted to address the earlier documented criticisms of the Carlson test. However, these experiments, each of which claims to confirm that astrological choices are made at no better than chance levels, have drawn criticism from astrologer Robert Currey (2011) and others as having fatal flaws. Each falls short of the Carlson study. Included here are the studies by McGrew and McFall (1990), Nanninga (1996/97), and Wyman and Vyse (2008).

The McGrew and McFall (1990) experiment was intended to include personal information of the sort typically used by astrologers but not found in standard personality profiles. Six “expert” astrologers, all members of the Indiana Federation of Astrologers but none of whom claimed professional accreditation, participated. Each astrologer was asked to match the birth charts of a sample of 23 volunteers to an extremely broad range of information gathered for each volunteer. This information included photo portraits, results from two standardized psychology tests, and written descriptions of personality and life events generated by 61 questions that were developed from input that the authors gleaned from the astrologers.

The use of photos in the McGrew and McFall study meant that special restrictions were imposed on the experiment to avoid age clues from the photos. The authors recruited volunteers who ranged from only 30 to 31 years of age. This narrow demographic, where natal charts would share numerous similarities, and the large amount of non-uniform information supplied for each volunteer, elevated the difficulty of the matching task. The Carlson study is regarded as unnecessarily complex because the astrologers were asked to choose the genuine CPI from among three. In the McGrew and McFall study however, astrologers were given the virtually impossible task of choosing each genuine set of personal descriptions and information from among no less than 23 sets! It is little wonder that this follow-up research was rejected for publication in Nature, which is an interesting story in its own right (Currey, 2011). The authors argue that the astrologers’ experimental task was a “simplification” of their ordinary business (McGrew and McFall, 1990: 82). On the contrary, it was much more complex and far more difficult than even Carlson's tasks. The reasons that the two authors provide for their judgment against astrology is not at all convincing.

The Nanninga (1996/97) experiment was modeled on the McGrew and McFall experiment and contained the same sorts of flaws. It was intended to settle a dispute argued in the local newspapers as to whether astrologers can or cannot predict. Through the newspapers, Nanninga offered a large cash prize to anyone who could match seven natal charts to seven sets of personality information. He attracted an unexpectedly large number of “astrologers,” from which he chose 50 based on their claimed astrological experience. The test subjects for the study were volunteers, all born “around 1958.” A test questionnaire for the volunteers, developed by Nanninga from ideas solicited from the astrologers, covered a very wide range of interests and background such as education, vocation, hobbies, interests, main goals, personality, relationships, health, religion, and so on, plus dates of important life events. To these Nanninga added 24 multiple choice questions taken from a standard personality test.

Like the McGrew and McFall experiment, Nanninga’s experiment used a very narrow demographic of volunteer subjects, making them difficult to astrologically differentiate, and he likewise presented a very large amount of non-uniform personal data written by the seven volunteers for the astrologers to sort through. Although Nanninga’s task involved seven matches instead of 23 and was therefore somewhat less complex than the McGrew and McFall task, it was nonetheless considerably more complex than the Carlson task, which has been criticized as being more complex than necessary. Nanninga’s study was not an improvement over the Carlson experiment and does not convincingly support his claims that astrology is in conflict with science and that astrologers increasingly confine themselves to statements that cannot be falsified (Nanninga, 1996/97: 20).

The Wyman and Vyse (2009) experiment was a low-budget classroom study modeled on the Carlson experiment but without the astrologers. In this experiment it was hypothesized that the use of a very transparent self-assessment questionnaire (the NEO Five-Factor Inventory) would enable volunteer participants to better identify their own profile scores than the CPI used by Carlson. Examples from this questionnaire include, “I try to be courteous to everyone I meet” (which contributes to A, Agreeableness in the resultant profile), and “I like to be where the action is” (which contributes to E, Extraversion). The authors asked 52 volunteers (introductory psychology class members and others) to identify their genuine five-factor personality profile from a bogus one and to identify their genuine astrological description from a bogus one. The astrological descriptions were created from the output of a commercial natal chart interpretation program, modified to remove all planetary, sign, and house clues and further simplified by the removal of all aspect information to provide 29 one- to four-sentence personality descriptions. The students succeeded at the personality profile task but failed at the natal chart description task.

Criticisms of the Wyman and Vyse experiment include: 

1. No test of astrologers’ skills and performance. 

2. The false assumption that both natal chart interpretations and psychology profiles “share a common purpose - to provide a description of the respondent's personality” (Wyman and Vyse, 2008: 287). Natal charts provide their value as descriptions of potential. 

3. The tender age of the volunteers (mean age of 19.3 years) whose life potential would be largely unrealized and somewhat idealized. 

4. Small sample size of natal charts (N = 52, where a sample of 100 would have been better). 

5. The exclusion of aspects from the astrological descriptions, arguably the most important component. 

6. Lack of synthesis of the chart components and a holistic approach. 

7. The unbalanced tasks of identifying an easy five-factor profile that parrots the subject’s input compared to the complexity of identifying a 29-factor partial astrological description of life potential. 

8. The false assumption that the positive and negative polarities of the signs mean “favorable” and “unfavorable” respectively and the listing (twice) of the sign Aquarius as both favorable and unfavorable. 

9. Incomplete disclosure of result details. Statistical inferences were drawn based on belief in astrology, but how many students in this small sample would dare, even anonymously, to declare belief in astrology in an experiment presided over by a professor, Stuart Vyse, who is a prominent astrology skeptic? Was it more than one? 

10. Students’ fear for their academic safety is a high stakes issue and could easily bias such as study as this one.

These errors and inadequacies in the Wyman and Vyse experiment arouse suspicions as to the accuracy of the modified astrological descriptions. Together, these flaws place the experiment well below the level of the Carlson experiment and raise serious doubts as to the authors’ conclusions. The study does nothing to fix the Carlson results. Although the simple five-factor personality profiles were identifiable by the students at a significant rate, the authors’ claim that the simplified astrological descriptions they devised should be equally identifiable is not convincing.

Discussion

The evidence provided by the Carlson experiment, when considered together with the scientific discourse that followed its publication, is extraordinary. Given the unfairly skewed experimental design, it is extraordinary that the participating astrologers managed to provide significant results. Given the irregularities of method and analysis, which had somehow remained transparent for 25 years, it is extraordinary that investigators have managed to scientifically assess the evidence and bring it into the full light of day. Now that the irregularities have been pointed out, it is easy to see and appreciate what Carlson actually found.

However, because of the unfairness and flaws in the Carlson experiment, this line of research needs to be replicated and extended in more stringent research programs that use adequate sample sizes of natal charts. The research done in the follow-up studies by McGrew and McFall (1990), Nanninga (1996/97), and Wyman and Vyse (2008) were on the whole better executed with regard to method and analysis than the Carlson experiment. Nonetheless, first-rate methods and analysis do not magically transform an experiment with faulty assumptions and design into first-rate science. These are the relatively routine parts of a research study that can often be rescued from their own problems, as we have seen with the Carlson study. With hindsight, it is evident that the editors of the science and psychology journals who published these studies failed to realize that astrology is a complex discipline with many variables, limitations, and pitfalls. Ultimately, it is important that would-be researchers learn from criticism and avoid fundamental blunders and misjudgments such as those outlined in this article. Astrological expertise should always be included in the peer review stage prior to publication.

There is much to be learned from the Carlson experiment. If natal charts can be successfully compared with self-assessment tests by the use of rating and ranking methods, as the Carlson experiment indicates, then astrological features might be easier to evaluate than was previously believed. New questions must now be raised. What would the results be in a fair test? Why did the astrologers choose and rate the CPIs as they did? Which chart features should be compared against which CPI features? Could more focused personality tests provide sharper insights and analysis? The door between astrology and psychology has been opened by a just crack and we have caught a glimpse of hitherto unknown connections between the two disciplines.

Notes

1. By comparison, a Google query of some other peer reviewed journal articles on astrology, searched as quoted strings, returns the following results:
  • “Is Astrology Relevant to Consciousness and Psi?” (Dean and Kelly, 2003) 8800 results.
  • “Are Investors Moonstruck?-Lunar Phases and Stock Returns” (Yuan et al, 2006) 3700 results.
  • “Objections to Astrology: A Statement by 186 Leading Scientists” (The Humanist, 1975) 3500 results.
  • “A Scientific Inquiry Into the Validity of Astrology” (McGrew and McFall, 1990) 2160 results.
  • “Raising the Hurdle for the Athletes’ Mars Effect” (Ertel, 1988) 1350 results.
  • “The Astrotest” (Nanninga, 1996) 970 results.
  • “Is There Really a Mars Effect?” (Gauquelin, 1988) 630 results.
  • “Science versus the Stars: A Double-Blind Test of the Validity of the NEO Five-Factor Inventory and Computer-Generated Astrological Natal Charts” (Wyman and Vyse, 2008) 265 results.
2. Carlson presents the 10-point rating test as a finer discrimination of the 3-choice ranking test, but the sample size is not the same. A sample of 116 natal charts is used in the 3-choice test (Carlson, 1985: 421, 423) and a different sample size is used for the 10-point rating test, which adds to the discrepancies already mentioned between these two tests and further emphasizes that they cannot be considered as a single test. Carlson does not give the sample size for the 10-point test, but it can be determined by measurement of the first, second, and third choice histograms in his article (Carlson, 1985: 421, 424). Each natal chart had to be the “correct” choice in one of these three “choices.”  By adding up these “correct hits,” Ertel shows 99 charts (Ertel, 2009: 130, Table 3). A more exacting scrutiny of the histograms by Robert Currey (in a forthcoming article) determines 100 charts.

© 2009 Kenneth McRitchie

References

Carlson, Shawn (1985). “A double-blind test of astrology.” Nature, (318), 419-425.
Clark, Vernon (1961). “Experimental astrology,” In Search, (Winter/Spring), 102-1 12.
Currey, Robert (2011). “Research Sceptical of Astrology: McGrew & McFall, ‘A Scientific Inquiry into the Validity of Astrology’ 1990.” Retrieved on 2011-07-02.
Currey, Robert (2011). “Research Sceptical of Astrology: Wyman & Vyse Double Blind Test of Astrology.” Retrieved on 2011-07-02.
Ertel, Suitbert (1988). “Raising the Hurdle for the Athletes’ Mars Effect: Association Co-varies with Eminence.” Journal of Scientific Exploration, 2(1), 53-82.
Ertel, Suitbert (2009). “Appraisal of Shawn Carlson’s Renowned Astrology Tests.” Journal of Scientific Exploration, 23(2), 125-137.
Eysenck, H.J. (1986). “A critique of ‘A double-blind test of astrology’.” Astropsychological Problems, 1(1), 27-29. 
Gauquelin, Michel (1988). “Is there Really a Mars Effect?” Above & Below: Journal of Astrological Studies, Fall, 4-7.
Hamilton, Teressa (1986). “Critique of the Carlson study.” Astropsychological Problems, 3, 9-12.
Marbell, Neil (1986-87). “Profile Self-selection: A Test of Astrological Effect on Human Personality.” NCGR Journal, (Winter), 29-44.
McGrew, John H. and Richard M. McFall (1990). “A Scientific Inquiry Into the Validity of Astrology.” Journal of Scientific Exploration, 4(1), 75-83.
Nanninga, Rob. (1996/97). “The Astrotest: A tough match for astrologers.” Correlation, Northern Winter, 15(2), 14-20.
Vidmar, Joseph (2008). “A Comprehensive Review of the Carlson Astrology Experiments.” Retrieved on 2010-08-01.
Wyman, Alyssa Jayne and Stuart Vyse (2008). “Science Versus the Stars: A Double-Blind Test of the Validity of the NEO Five-Factor Inventory and Computer-Generated Astrological Natal Charts.” The Journal of Psychology, 135(3), 287-300.