On Fagan and Holland’s Culture Fair Tests of Intelligence

This article concerns a form of intelligence test devised in the 2000s which supposedly measures intelligence but lacks the cultural bias normally contained within IQ tests and, as a result, shows that American races do not actually differ in mean intelligence or that, to the degree they do, this is due to cultural factors rather than genetic ones.

The basic idea behind the tests is this: due to culture, races have been exposed do differing sorts of information and so to get an unbiased test we should base the tests on information neither race has been previously exposed to and give them a brief exposure period to it in a controlled environment so can ensure that each race is given an equal opportunity to learn said information. For instance, we could allow people of each race to study the definitions of words they’ve never heard before and them give them a vocabulary test on those words after their study period. When we do this, advocates of the theory say, we find that races do not differ in their ability to acquire new information and so are equally intelligent.

There are many problems with this approach. Firstly, unless participants are made to acquire the novel information in a complex way or to perform a complex operation on the information in order to score well on a test, this will be a test of rote memory which is a pretty simple cognitive ability. Standard IQ tests include a test called a “forward digit span” which consist of reading participants a list of numbers and then seeing how many they can repeat back. Forward digit span tests seemingly measure the same cognitive ability that these “culture free” tests do, and we already know that black and white Americans either do not differ in forward digit span score or differ in score only slightly. By contrast, if you have people repeat back the list of numbers in the opposite of the order which they heard the numbers in, the so called backwards digit span test, you see significant racial gaps. This has been shown numerous times.

Dalliard (2013A)864 (w) / 407 (b)0.05 (ns)0.41
Dalliard (2013B)1609 (w) / 265 (b)0.01 (ns)0.43
Dalliard (2013B)2904 (w) / 481 (b)0.04 (ns)0.41
Dalliard (2013B)3085 (w) / 582 (b)0.160.36
Jensen (1975)669 (w) / 621 (b)0.250.57

Of course, both digit span tests concern the same information, so the difference in how each race performs on the two tests cannot be due to a difference in the information required. Rather, the difference reflects the fact that backward digit span tests requires test takers to perform a slightly complex operation on the information they’ve been given and the races primarily differ in their ability to perform complex operations on information rather than their ability to learn simple new facts. Thus, these culture fair tests may cause racial gaps to disappear because they remove complexity from IQ tests.

Another problem with this research is that the samples used all range from so small as to be useless to moderately inadequate. No study, including meta-analyses done with this research, contains even 200 black participants. Most research in this area has utilized samples of fewer than 50 blacks.

Troublingly, when samples are large enough to merely be moderately inadequate, Fagan and Holland consistently fail to actually report what the mean racial differences in their culture fair test scores are. Instead, once the tests were slightly improved to add a small degree of complexity to them and larger samples began to be utilized, Fagan and Holland only ever reported racial differences in their test after controlling for SAT scores.

The single exception to this involves a sample which is provable unrepresentative with respect to racial gaps in intelligence. And this brings up another problem, the representativeness of the samples utilized are usually suspect and in some cases demonstrably invalid. There is an instance where Fagan and Holland try to show that their samples are representative, but it relies on faulty statistical reasoning.

Another problem with this research is that it cannot test whether racial differences in information exposure are due to racial differences in behavior which itself may be caused by genes (e.g. the propensity to read) or some ultimately environmental factor. For this reason, it is obviously not a valid test of the hereditarian hypothesis.

With those general criticisms laid out, let’s turn to analyzing this literature one paper at a time (there are, to my knowledge, only four papers in this literature and they’re all done by the same two people and this actually constitutes another notable problem with this line of evidence).

Fagan and Holland (2002)

This line of work began with Fagan and Holland (2002). In this paper, participants in the trained group were given time to study the definitions either of rare words they were unlikely to have heard before or words taken from the PPVT, a common test of verbal intelligence. They were then given a test to how many words they remembered the definition of. Participants in the “untrained” group were either tested using PPVT words with no training or were tested on normal vocabulary words, again with no special studying of the definitions before hand.

As can be seen, in the trained condition significant racial gaps occurred (usually this means white vs blacks but sometimes the authors combine other groups like Hispanics into a “minority” category and minorities like Asians into a “majority” category, gaps in SD units). By contrast, in the untrained condition racial gaps were consistently statistically insignificant.

Fagan and Holland (2002)

Obviously, this lack of significance shouldn’t be taken to mean much of anything given the sample sizes we are dealing with. In any case, these results are consistent with the trained condition being a test of simple memory ability.

To determine whether their samples were representative, Fagan et all administered the PPVT-R to a new sample of 97 students (67 whites and 26 blacks). They found that “In the present study, the IQs of the 67 Whites averaged 98.1 (S.D. 12.4, range 67 to 131) and the IQs of the 26 Blacks averaged 82.6 (S.D. 13.7, range 62 to 112), a difference of 15.5 IQ points (t = 5.3, df 91, P < .0001).” These groups means are a bit low, but the racial gap is indeed representative of the general population. This finding cannot, however, be taken to demonstrate that the previous four experiments utilized a representative sample because you cannot determine the representativeness of these populations using sample sizes of 26 and 67. Even if this specific sample was representative, the sample is so small that the difference between this sample and one which was significantly unrepresentative would obviously not itself be statistically significant. Consequently, we lack justification for thinking that the first four experiments used a representative sample and, given that the participants all attended the same colleges, we have some reason to doubt their representativeness until better evidence is provided.

Clearly, all of my general criticisms concerning sample size, sample representativeness, the complexity of the tests, and the inability to actually test hereditarianism, apply to this paper.

In their paper, Fagan and Holland also tro to evidence that IQ tests are racially biased. By biased, I mean that the same IQ score does not indicate the same underlying level of intelligence for blacks and whites. Fagan et al. report that they compared 57 blacks and 87 whites who had roughly the same scores on the general vocabulary test and found that in this same group of individuals blacks scored higher than whites on the new word vocabulary test. This is not a compelling demonstration that IQ tests are racially biased for at least three reasons:

  1. The sample size is to small to justify any conclusion.
  2. Neither test is an adequate measure of IQ or general intelligence.
  3. Ignoring 1 and 2, this would evidence that lack of exposure to information is a cause of the black-white IQ gap rather than evidencing that vocabulary tests don’t measure the same thing in both groups.

Fagan and Holland (2007)

Let’s turn now to the various experiments reported on in Fagan and Holland (2007)

In experiment one (n=77 college students), participants were either tested on their ability to recognize faces to which they were recently exposed to for the first time or they were tested on their knowledge of vocabulary words taken from the PPVT. The black-white gap in PPVT score was 1.1 SD. Surprisingly, we are not told what the racial gap in simple memory ability is. Instead we are just told that a regression model using race and native language status to predict memory ability is statistically insignificant while a model using PPVT scores achieve significant predictive validity. No regression output is displayed so that we can see what the beta coefficients are. Instead, the authors give the following verbal report of the analysis: “Specifically, a stepwise multiple regression in which the independent variables of race, native language, and PPVT-R scores were used to predict recognition memory ability revealed that knowledge (PPVT-R scores) was the only significant predictor of processing ability (recognition memory) with a Multiple R of .36 (F(1/71)=10.4, Pb.002) and a beta value of .36 (t=3.2, Pb.002). If PPVT-R scores are omitted from the analysis, and race and native language are the only independent variables employed to predict information processing, no significant variance is obtained (F(2/70)= 1.9, PN.15).” Given the sample size of 77, this is obviously totally inappropriate. In sum, no attempt was made to show that this sample was representative, the sample is too small to test anything useful, and this is very obviously just a simple memory test and not a test of general intelligence.

In experiment two, Fagan et al gave 65 community college students one of two tests, one of which measured whether they understood various sayings requiring specific knowledge based on past exposure to specific information and one of which measured whether they understood sayings which required only general knowledge. The give knowing the meaning of “an apple a day keeps the doctor away” as an example of general knowledge and knowing that “home of the bean and the cod” refers to Boston as an example of specific knowledge.

Unlike in experiment one, this time we are given the results of each test by race: “comprehension based on specific knowledge on the part of the Whites at 65% correct (S.D. 20.7) was superior, t (35)=2.7, P<.01, to the comprehension of the African Americans at 48% (S.D. 10.0). Thus, when opportunity for exposure to information is allowed to vary, Whites are more apt to know the meanings of sayings than are African-Americans. But the same was not true when opportunity for information about the meanings of the sayings was generally available. Specifically, The performance of the Whites at 72% correct (S.D. 16.5) was, if anything, somewhat inferior to that of the African Americans at 80% correct (S.D. 13.1), although not significantly so, t(26)=1.4.”

Speaking of their second experiment from their 2007 paper, they write that “Whites and African-Americans, also matched for PPVT-R scores, differed significantly in general comprehension, t(16)=2.0, P<.03, one-tailed test. The African-Americans at 85.4% (S.D. 8.2, N=9) were superior to the Whites at 75.7% (S.D. 12.4, N=9).” of course, a sample size of 9 per group makes this worthless, a one sided significance test was presumably used because the difference wouldn’t have been significant other wise, and if we ignored all that we’d just conclude that African Americans score better on their test of “general comprehension” than they do on IQ tests which is something we already knew.

This experiment was largely repeated in experiment three where 68 community college students took a test on both sorts of sayings. They found that “comprehension based on specific knowledge on the part of Whites (a mean of 16.2 items correct, S.D. 3.4) was superior, t(84)=2.6, Pb.01) to that of Blacks (13.9 items, S.D. 3.3) while general comprehension was equal for Whites and Blacks at 17.8, S.D. 3.1 and 17.1, S.D. 3.3, respectively, t(84)=0.8.”

Again, the sample size make this work largely worthless. And understanding a saying is generally no more complicated than understanding a word meaning this measure still lacks any cognitive complexity. If we set the sample size issue aside, we might say that this is evidence that blacks and whites are equally likely to learn the meanings of sayings that are common to their environments, but that is not the same thing as saying that they are equally intelligence when given equal opportunities to become smart because knowing sayings is not the same as intelligence.

Fagan and Holland actually (seemingly unknowingly) prove this point. They claim that both of their sayings metrics correlate at 0.53 with IQ. This means they share about 25% of their variance with IQ and are primarily a measure of something other than the variety of cognitive abilities, and general intelligence, measured by IQ tests.

From their third experiment “A sample of 16 African-Americans and 17 Whites were closely matched on their comprehension of sayings requiring specific knowledge with mean scores of 60% (13.9 items correct out of 22, S.D. 3.3) and 62% (13.6 items, S.D. 2.9), respectively. Again, as in the second experiment, on general comprehension, the African-Americans were superior to the Whites, t(31)=2.1, Pb.05, with 77.3% correct (17.1 items correct, S.D. 3.3) for the African-Americans and 66% correct (14.6 items, S.D. 3.4) for the Whites. ” Again, the sample size is a joke and it is obvious that if you have two tests that are correlated and whites do better on one while both races do the same for the other, that if you compare people from each race who have the same score on the former then there will be a gap favoring blacks on the latter. This will be true, all else being equal, for any such two tests and is not evidence of test bias. (Even more obviously, it is not evidence that IQ tests are biased since IQ tests are not being utilized in this experiment!)

Moving on, in experiment four, 223 college students (130 from a private university and 88 from a community college) were given tests of their understanding of sayings, similarities, and analogies. To test knowledge of sayings, participants were given multiple choice questions directly about what sayings meant. In the similarities test, participants were given two words and then asked to select what they shared in common from several possible answers. For analogies, participants were given questions of the form X is to Y as Z is to what? And they were given several possibilities to choose from when selecting their answer.

Of the 223 participants, 179 were given these tests in the normal way and 44 were given the test based on nonsense words which they learned the meaning of just prior to the testing. In the group tested in the normal way, a 0.75 SD gap emerged favoring whites. In the group working with novel words, there was a statistically insignificant 0.20 SD gap favoring whites.

Much can be said of this experiment. First, the move to a sample largely from a private university makes it virtually certain that this sample is not representative with respect to IQ. Secondly, because only 44 participants were in the novel word condition this finding is basically worthless. Thirdly, the gap that was found in the novel word condition, 0.20 SD, is practically significant. Because of their small sample size, it was did not differ in a statistically significant way from zero, but it also doesn’t differ in a statistically significant way from a much larger effect size than 0.20. To say this is evidence of racial equality on this measure is to simply misunderstand the relevant statistics.

This experiment does improve on past designs in that the analogies test introduced a small degree of cognitive complexity into the test. Of course, this test also included a good deal of simple memory testing. It may be that the gap was significantly larger on, for instance, the analogies test than the sayings test. Unfortunately, the results are not given to us broken down by ability so we can’t tell.

Participants in experiments 3 and 4 were also given various other cognitive tests from which a G factor score was extracted. To determine their relation to general intelligence, Fagan and Holland’s tests should have been included in the set of tests from which G was extracted. We could then see the G loading of each of their tests. For unknown reasons, this is not what was done.

Instead, for each of the two experiments, they built a model predicting G scores using race, age, sex, educational attainment, family size, birth order, general knowledge, and specific knowledge. The total model explained 38% of the variance in G scores in both experiments. General knowledge had (standardized) beta values of 0.38 and 0.30 in these models while specific knowledge had beta values of .24 and .37. So in this model both general and specific knowledge didn’t share the vast majority of their variance with G.

In neither model was race a significant predictor of G, but that doesn’t tell anything useful given all the other variables included in the model and the small sample sizes involved.

The fourth experiment also involved another attempt to evidence test bias. They write “a sample of 45 African-Americans and 49 Whites were closely matched for their knowledge of sayings, similarities, and analogies based on specific information with mean z scores of −.08, S.D. 0.71 and mean z=− .06, S.D. .82, respectively, t(92)=−0.1. Given such equality, the African-Americans were now greater than the Whites, t(94)=5.0, Pb.0001, in knowledge based on general information with a mean z for the African-Americans of .29 (S.D. .81) and a mean z for the Whites of −.62 (S.D. .92). This z difference of .91 would translate into an IQ advantage of about 13.7 points for these particular African-Americans over these particular Whites.”

This comparison lacked an adequate sample, in terms of both size and representativeness, and this just tells us that if you control for a large group gap, a small group gap in a correlated variable flips direction, which is obvious and since neither correlated especially strongly with G as measured by normal intelligence tests this of course tells us nothing about whether IQ tests are biased.

Fagan (2008)

Let’s now turn to Fagan (2008), a paper which reported on two relevant experiments.

In experiment 1, the sample consisted of 484 students most of which attended a private university. These participants were given three tests measuring their ability to learn the meanings of new words, sayings, and analogies.

With respect to new word meanings, participants were first allowed to infer the meaning of these words through seeing them used in sentences. Importantly, these sentences made the meanings of the terms very clear and there is no indication that this inference was ever meant to be difficult. For this reason, we don’t have good grounds for supposing that there is much difference between this and simply giving them the definitions of the words. An example can be seen below.

After studying the meaning of the new words, participants were given a simple test measuring how many definitions they had been able to memorize in the study time given. On its face, this looks like a test of rote memorization.

To learn the meanings of sayings participants were simply given explicit definitions of sayings to study.

They were then tested, via multiple choice questions, to see how many sayings they could remember the meaning of. This is obviously mostly a test of rote memorization.

In the case of analogies, participants were first given the definitions of sets of new words to study.

Then they were tested to see whether they could discern the analogy between these two words and two familiar words. This test is partly a function of simple memory ability, but also probably measures the ability to think in terms of analogy as well.

In the sample, the race gap Fagan’s “new knowledge” tests was insignificant. Moreover, this new knowledge test correlated at .58 with SAT-V scores, and 0.72 after correcting for measurement error. Granted, verbal intelligence is only one aspect of full scale IQ, so this test probably does not correlate highly enough with FSIQ to share the majority of its variance with it (0.70). But this test is clearly a better measure of intelligence than were some of the simpler tests Fagan was using earlier, probably because it is partly a measure of thinking in analogies rather than just simple memory ability.

However, in this same sample the racial SAT score gap was only around 0.33 SD, meaning this sample was not representative of the normal cognitive gap seen between races and this finding can be explained entirely by the combination of the test still partly being a function of simple memory as well as the elite nature of the sample in which normal racial gaps in IQ are not present to begin with.

In experiment two, the sample consisted of 696 students of which 153 were minorities. These students mostly went to community college meaning they are a less elite sample than what was used in experiment one. The tests given were the same as in experiment one. Once again, this test correlated quite well with SAT-V scores (0.63 / 0.79 after correction).

As for the results, now that we are using a less elite (but still not necessarily representative) sample with a test that highly correlates with IQ… we are not given the racial difference in any test given. Instead, we are told that race was not predictive of new learning ability once SAT scores were controlled for: “The regression analysis yielded a multiple R of .63, F (2,690) = 231.3, p < .001, with Beta values of .03 (t = 0.9, n.s.) and .63 (t = 2 1.0, p < .001) for minority status and brief SAT scores, respectively, for the prediction of new learning.”

Of course, this is hardly surprising since the SAT-V test and the new knowledge test share most of their variance in common! No theory about racial IQ gaps predicts that there will be large racial gaps in one cognitive test after controlling for another cognitive test that highly correlates with the former cognitive test. In fact, there is evidence suggesting that if you control for general intelligence minorities have roughly the same working memory scores as whites and score more higher than whites on tests of long term memory.

Frisby et al. (2017)

For all these reasons, this paper gives us no good reasons to doubt a genetic model of group differences in IQ.

Fagan and Holland (2009)

The final paper to consider in this literature is Fagan and Holland (2009).

The sample consisted of 633 college students who were enlisted in undergraduate psychology courses. This included only 121 minorities. Also, it is worth noting that Asians were considered to be part of the “majority” even though they are literal minorities.

As in previous experiments, the participants were tested on their understanding of a novel set of words, sayings, and similarities. For the tests on sayings and similarities, participants were given definitions of new words or sayings to study and then later asked multiple choice questions about what the sayings meant or what two words had in common. In the case of novel words, participants studied sentences that clearly demonstrated the word’s definitions without being given explicit definitions and were then tested on their understanding of the word’s meanings. Scores on all three measures were then combined into a single measure of “new learning ability”.

Participants were also again given a subset of the verbal section of the SAT. Their new learning score correlated at 0.66 with this SAT-V score prior to correcting for measurement error and 0.83 after said correction.

Regarding the relationship between race and new learning ability, we once again are not told anything useful or interesting. As Fagan did in his 2008 paper, they just say that race doesn’t predict new learning ability once SAT scores are controlled for : “Past knowledge (i.e. brief SAT scores), but not race, was expected to be related to the ability to process new information. The results were as predicted. A regression analysis yielded a multiple R of .65, F (2/625) = 230.9 P<.0001, with Beta values of .04 (t= 1.3, P>.18) and .65 (t=21.0, P<.0001) for majority–minority status and brief SAT scores, respectively, for the prediction of new learning.” As I’ve already explained, results like this give us no reason to doubt any theory of racial IQ differences.

Final Thoughts

From the outset, this line of research has obvious theoretical problems stemming from the fact that it can’t test what the causes of racial differences in information exposure are. As I’ve tried to show, a close look at the research also reveals many problems in terns of sample size, sample representativeness, and the strategy of data analysis undertaken. As a whole, this line of research cannot justify thinking that IQ tests are racially biased or that information exposure is an important cause of racial IQ differences.

One thought on “On Fagan and Holland’s Culture Fair Tests of Intelligence

  1. One thing I think you should have covered in this essay is the fact that no attempt was made to make these studies’ results psycho-metrically interpretable. What I’m talking about is the fact that they didn’t employ Measurement Invariance testing or Latent Variable Modeling at all. They just made a (very bad) test, gave it to some people, and then concluded that because there were no racial differences in their 100% unbiased test, the IQ test must be biased somehow. It isn’t in the statistically relevant and real sense of the term “bias” (Boorsboom 2006). Other than that, you did an excellent analysis here.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s