Friday, 5 October 2007

Sex and IQ

So, if I accept that it is possible for genetically driven differences in IQ, why do I not trust the evidence?

Well I'm not a psychometrician, but I have looked into the data a few times, and been far from impressed. Let's take a study which received some considerable press:

"A study to be published later this year in the British Journal of Psychology says that men are on average five points ahead on IQ tests.
...
Their research was based on IQ tests given to 80,000 people and a further study of 20,000 students."
Naturally the paper hadn't yet been published, but I deliberately sought the paper out when it was, it is the "study of 20,000 students":

Irwing & Lynn (2005). British Journal of Psychology 96(4): 505-24.

A meta-analysis is presented of 22 studies of sex differences in university students of means and variances on the Progressive Matrices. The results disconfirm the frequent assertion that there is no sex difference in the mean but that males have greater variability. To the contrary, the results showed that males obtained a higher mean than females by between .22d and .33d, the equivalent of 3.3 and 5.0 IQ conventional points, respectively. In the 8 studies of the SPM for which standard deviations were available, females showed significantly greater variability (F(882,656)=1.20, p<.02), whilst in the 10 studies of the APM there was no significant difference in variability (F(3344,5660)=1.00, p>.05).
Richard Lynn is a stalwart of the IQ and race/sex field, and the credibility of his work goes to the heart of the matter.

This paper has been critiqued elsewhere, but there are a few fundamental aspects of the paper that really strike a scientist coming to this from outside the field, and I want to talk about a few of them.

Where does the "five points ahead on IQ tests" figure come from? Well it comes from taking the standard deviation normalised IQ difference between males and females determined by the meta-analysis (0.31), and multiplying it by what is generally taken as the population IQ standard deviation (15 points) to estimate a 4.65 point difference between men and women.

It is always a bit dodgy taking a difference detected in your study sample and then extrapolating your effect size out into the general population. There are good reasons to think that the standard deviation in the sample is less than the general population (since university students are selected to some degree), but we can get an idea of how bad that idea is by looking at the papers involved in the study where the standard deviations are reported (note the scores given are on the progressive matrices, not IQ scores, my understanding is that you need to approximately double the values to get the IQ difference) - one study reports actual IQ standard deviations of around 10 points, and since the largest study doesn't have standard deviations reported, looking at the next two largest studies, these also have values around 10.

So how did they get this 0.31 figure? Well the first thing they do is exclude half of all the subjects as an 'outlier'. We can see in the list of studies that as well as a large number of small studies, there is a large Mexican study with 45% of the total number of subjects in the whole meta-analysis. But it only showed a male advantage of .06 of a standard deviation (that's about 1 IQ point assuming standard deviation of 15). It isn't quite clear how this study can be an outlier if it contains half of all the subjects.

Ok, we've now upped out estimate of the mean difference between men and women from .14 (95% CI .11-.27) including the Mexican study (which we've moved from a 1.4 IQ point difference to a 2.1 point difference by assuming a standard deviation of 15 rather than 10 points), to .21 (95% CI .18-.28) by excluding the Mexican study. But that isn't 5 IQ points yet, we've only got to 3!

So now we need to do something really bad, instead of weighting the studies by sample size (because, you know, tiny studies are crapper, have much higher variance, and are much more likely to be positive and have a larger effect size due to publication bias, and because that is just how you estimate overall effect size when combining together results from studies of differing sample size) we'll just look at median effect size. That's right, instead of weighting all the results by how many subjects there were in each study we're going to line up all the studies in order of effect size, don't worry about how many subjects were in each one, and find the study in the middle - that's our effect size. I don't think we need to justify this approach at all, let's just do it and report all our results in that form. Way hey, that gives us .31 of a standard deviation difference, that's 4.65 IQ points if we assume standard deviation of 15, that's practically 5 IQ points - go men!

As Steve Blinkorn points out:

"The ten studies with estimated differences above the median cover a total of only 2,591 participants, whereas the ten studies with differences below the median account for 15,735 participants — the four largest differences come from samples of 111, 173, 124 and 300, the four smallest from samples of 844, 172, 9,048 and 1,316. Choosing to use the median is a flawed and suspect tactic."
Now we need to put the icing on the cake, let's make an outragious claim that is contradicted by our own data:

"These results are clearly contrary to the assertions of a number of authorities including Eysenck (1981), Court (1983), Mackintosh (1996, 1998a, 1998b) and Anderson (2004, p. 829). These authorities have asserted that there is no difference between the means obtained by men and women on the Progressive Matrices. Thus, the tests 'give equal scores to boys and girls, men and women' (Eysenck, 1981, p. 41); 'there appears to be no difference in general intelligence' (Mackintosh, 1998a, ); and 'the evidence that there is no sex difference in general ability is overwhelming' (Anderson, 2004, p. 829). Mackintosh in his extensive writings on this question has sometimes been more cautious, e.g. 'If I was thus overconfident in my assertion that there was no sex difference… if general intelligence is defined as Cattell's Gf, best measured by tests such as Raven's Matrices… then the sex difference in general intelligence among young adults today …is trivially small, surely no more than 1-2 IQ points either way' 1998b, p. 538). Contrary to these assertions, our meta-analyses show that the sex difference on the Progressive Matrices is neither non-existent nor 'trivially small' and certainly not '1-2 IQ points either way', that is, in favour of men or women. Our results showing a 4.6 to 5 IQ point advantage for men is testimony to the value of meta-analysis as compared with impressions gained from two or three studies."
That's right, even though the correct analysis of our data shows a 1.4 IQ point advantage for men let's claim that anyone suggesting a difference of '1-2 IQ points either way' is totally wrong and that only our completely dodgy analysis is the correct interpretation. One in the eye for you hairy lesbian feminists!

No attempt is made to estimate publication bias naturally, it is just asserted that there cannot be a file drawer effect because none of the studies was directly comparing male and female IQ in their primary study design. Blinkhorn again:

"My own file drawer turned out to contain an analysis of data from...the advanced matrices...This yielded an advantage of 0.07 standard deviations for females. The sample is larger than all but five of those found by Irwing and Lynn."

In my own research, if I don't detect a difference between men and women (you ought to check) then I probably wouldn't report the data split by gender - but of course this immediately introduces a publication bias - as only those studies where a difference has been found will have data suitable for including in a meta-analysis - and thus any effect will be overestimated.

If these are the kind of shenanigans people like Lynn can get up to right in front of our eyes, then what's going on behind the scenes? I cannot trust the data of these people because I do not respect them as scientists.

[interestingly, this study did not support the claim that men have a higher standard deviation in IQ scores than women - which is often posited to contend that while men and women may have equal mean IQs, there are more men in the extremes of the distribution, thus making more 'geniuses' men. Presumably they'd have tried harder to push that result if they hadn't been so happy with their "five points ahead on IQ tests" figure.]


UPDATE
Irwing and Lynn have replied to Blinkhorn's criticisms, and consequently many of mine, here, and Blinkhorn replies (thanks to potentilla in the comments). I'm unimpressed by their excuses but I'll repeat them here. They start off by saying:

"We believe that the principal error of Blinkhorn’s criticism is that he does not consider our result in the context of several other studies showing that adult males have an IQ advantage of around 4–6 IQ points."
That is disingenuous the say the least, Blinkhorn criticises the study's methodology, you cannot resort to the results of other studies to support your conclusions, it must stand or fall on its own merits.

They go on to say:

Blinkhorn criticizes us for not adopting the principle of weighting results by sample size, and for excluding the very large study from Mexico. This misses a central point of metaanalysis. We carried out a number of tests for moderator variables (factors that cause underor overestimates of the sex difference) and found strong evidence for two: these were the type of test and the tendency of some universities selectively to recruit either brighter men or brighter women. In the presence of strong moderators, many of the studies in the sample provide biased estimates of the sex difference in IQ score. It is clear from the box plot (Fig. 1) that the Mexico results conform to estimates from the most male-biased samples, which provide substantial underestimates of the sex difference in IQ. Given the strong probability of bias in this sample, to weight it by its sample size (9,048) would risk a serious underestimate of the population sex difference in IQ. For this reason, we followed the advice of a definitive article on meta-analysis10 and took the median of estimates, including Mexico, which equated to 4.6 IQ points." [PJ - note that the median is unaffected by inclusion or exclusion of the Mexico study since it is nearer the low extreme values and only the value of the middle study affects the median]
Now I confess, I don't quite see what they are saying here. The talk of sex-selection refers to studies which found either higher variance in males or females - they hypothesise that these differences in variance are the result of greater selection for males or females (lower variance, smaller standard deviations, means that the gender was more highly selected at that university, and thus shows less variance). But, and this is a fucking great big but, the Mexican study doesn't report variance by gender - what Irwing & Lynn are saying is that if you look at the figure the effect size for the Mexico study is more like the studies with a 'pro-female' selection bias (this is difficult to figure out as an analysis because this paper lists 10 studies, whereas only 6 studies were identified as pro-female in the original paper***) i.e. they are asserting that it had pro-female bias without any evidence that this was actually so. As Blinkhorn says:

"Their argument here is circular: the sex difference is vanishingly small compared with their sample of smaller, less representative groups, so therefore there must be a bias."
But, of course, they aren't just excluding the Mexico study (note no other studies are excluded as outliers), sticking with means still gives a lower estimate of effect size than Irwing and Lynn report even without the Mexico study - so taking the median makes a big difference. They claim this is necessary due to heterogeneity in the sample but this is inapproriate given the range of study sizes from 30 subjects to nearly 10,000.

Looking at my plot of log transformed variance ratios* against effect sizes suggests that there may be a relationship between the difference in standard deviations between men and women (interpreted by Irwing & Lynn as due to differing selectivity of universities), although the regression is only significant at alpha=.05 if you exclude the far right value**. They could easily have incorporated this relationship into their analysis and run a regression model if they were worried about this effect. Look at the point of no difference between variance (when standard deviations are the same, log ratio 0.0), the effect size is about .2, we can see from this that their overall sample is actually biased towards male-selective studies (reflected in their estimate of 6 female-selective, and 13 male-selective), and by Irwing & Lynn's interpretation their overall sample is biased in favour of studies where men are over selected and thus biased against women! This regression line doesn't take into account study size and only includes studies with available variance data, but if differential variance was the only effect at play here the effect size would still be smaller than the median estimate (and more like the mean exc. Mexico) as we can see from the estimated effect size (about .2, 2-3 IQ points depending on population variance) when there is zero difference in variance between men and women. This rather highlights why the median does not trump the mean when data is heterogeneous, instead you have to explore the effect of modifier variables, as Rosenthal (who Irwing & Lynn reference as justification for using medians) says:

"When several approaches to central tendency yield different results, the
reasons for such differences need to be explored."
But we know why they are discrepant, because so many studies were tiny, and by having more pro-male studies the median is almost guaranteed to fall in the high end of estimates because that is where the mid-point will fall.

With reference to generalising the findings to the population at large, they say:

"Many of Blinkhorn’s difficulties stem from his assumption that our focus was on university students. This makes little sense, because the IQ difference in students is dependent on which population is considered, whereas the sex difference in the general population, our actual focus of interest, is highly stable."

But, of course, this is no explanation, their study was of university students whatever their self proclaimed focus was on, and they should at least have reported what the IQ differences they actually found were - before then generalising to a population that they had no evidence for.

* You need to log transform ratios to make them symmetrical - take women with a 10x greater variance than men, the same as men, or men having 10x the variance of women - straight ratios gives you figures of 10, 1, and 0.1 but log ratios gives you 1, 0, -1. Think about how they'll look when plotted on a line, the raw ratio data will crowd the oints where men have higher variance than women in between 0 and 1, while when women have a higher variance than men it will stretch from 1 to 100, or 1000, or infinity. But when the data is log transformed men having a higher variance is treated just the same as women having a higher variance, but with a negative sign - i.e. it is symmetrical.

** I've substituted a regression line that takes into account study size, and it looks a lot nicer than the original fit too. But just look at how the studies are skewed to the pro-male side, suggesting that there may be serious over estimation of the effect size - I don't think that was what Irwing & Lynn wanted us to conclude!

*** There's somemething funny going on here but I'm not sure what. How can the original paper have found 13 pro-male (I assume they exclude the study where the difference in standard deviation is only at the second decimal place), and 6 pro-female studies, consistent with the data in the table, yet their reply to Blinkhorn has 10 pro-male and 10 pro-female studies? It just makes me even more concerned about their methodology.



UPDATE 2
A reader points out the following from Lynn & Irwing (2004):

"The second kind of poor quality study consists of those with small sample sizes that are liable to produce anomalously large chance effect sizes that obscure the true relationship. Some meta-analysts ignore differences in sample sizes and accord all studies equal weight irrespective of sample size. This is reasonable for certain data sets where all the studies have about the same sample sizes. Where this is not the case, some meta-analysts deal with this problem by ignoring studies with samples below a certain size, while others weight the studies by the sample sizes. These two solutions amount to much the same thing because weighting by sample size dilutes and may effectively eliminate the contribution of studies with small samples. Where the meta-analyst has a number of large samples, the simplest procedure is to ignore small samples and confine the analysis to studies where sample sizes are considered acceptable." [my emphasis]

Hoist by their own petard methinks.

8 comments:

potentilla said...

And here is Blinkhorn's response to Irwing and Lynn's counter to his critique!

It could be that the race question is more difficult to dispose of, though.

potentilla said...

Oh, and their counter explains why they dealt with the Mexican study as they did and why they took used the median.

(I am not trying to support either side here, just wandering bemused in the thickets of statistical analysis).

pj said...

Cheers, hadn't seen the Irwing and Lynn response (I always dislike these incredibly slow exchanges you get in science).

I'm not impressed by Irwing & Lynn: "We believe that the principal error of Blinkhorn’s criticism is that he does not consider our result in the context of several other studies showing that adult males have an IQ advantage of around 4–6 IQ points."

That is not a reply to the criticism, just a claim that they are right!

I'll update the post.

pj said...

potentilla, I catually haven't been trying to disprove the claim that women have lower IQ than men - merely to point out some of the methodological limitations that can lie behind these claims.

The race question is slightly different in that I think most people accept that there are differences in IQ scores between ethnic groups - the question is what is driving this.

I'm far from convinced at the moment that anyone has managed to control for social and economic differences adequately (linear regression really does have its limitations).

potentilla said...

Thank you, this is really interesting. I wish I had learned stats properly when I had the chance many years ago. I can often follow the logic of statistical arguments when someone sets them out, and sometimes put my finger on what seems to me to be a problematic argument, but I can't do the analysis myself ab initio. The reason I like GNXP is because they are very data-focussed.

The present govt (well, anyway, the Blair govt) drives me mad because it makes so little attempt to support its policy interventions with evidence. Stephen Law is currently engaged in doing much the same thing in his argument that private schools should be banned.

pj said...

Ben Goldacre had a piece in the guardian on evidence based social policy.

I've never understood why government is so resistant to using and generating evidence to base policy decisions on. My understanding, from talking to analysts working in government, is that policies are decided on first (for political and ideological reasons) and then analysts are tasked with finding evidence, or rather distorting evidence, to back that up.

Are you saying that Stephen Law is engaged in evidence free argument, or in evidence based argument?

potentilla said...

Evidence-free argument (or rather, to be fair, evidence-lite argument; he does cite something from the Sutton Trust IIRC, but it's much too little to support his contention that (a) children of rich parents are over-represented (statistically) in "top jobs" and (b) that a significant cause of this is the existence of the option of private schooling for the rich). I mean, both of those things may be true, but he has satisfactorily demonstrated neither of them.

I think one reason (apart from, as you suggest, ideological imperatives) that govts are resistant to evidence-based social policy is sheer ignorance. Few people learn about statistics at school, and even the ones who do probably mostly don't have its application to sociial policy emphasized to them. Even people with scientific training, dare I say it, don't necessarily make the connection.

It's one of the many things wrong with the currrent curriculum that people can end up with no idea why statistics are important (and how easily they can be manipulated).

potentilla said...

And thanks for the Ben Goldacre quote; I remember reading at the time, but had forgotten that he was actually suggesting using prospective RCTs for social policy interventions. Someone called btbLondon makes some good points in the comments about that. But too much polcy-making, as far as I can see, isn't even based on a serious attempt to tease out the different variables involved and distinguish correlation from causation.

See, for instance, the recent MTAS fiasco, during which the govt changed to a new recruitment methodology for junior doctors based on practically no evidence as to its suitablity versus the old method.