Pyjamas in Bananas: education

So, if I accept that it is possible for genetically driven differences in IQ, why do I not trust the evidence?

Well I'm not a psychometrician, but I have looked into the data a few times, and been far from impressed. Let's take a study which received some considerable press:

"A study to be published later this year in the British Journal of Psychology says that men are on average five points ahead on IQ tests.
...
Their research was based on IQ tests given to 80,000 people and a further study of 20,000 students."

Naturally the paper hadn't yet been published, but I deliberately sought the paper out when it was, it is the "study of 20,000 students":

Irwing & Lynn (2005). British Journal of Psychology 96(4): 505-24.

A meta-analysis is presented of 22 studies of sex differences in university students of means and variances on the Progressive Matrices. The results disconfirm the frequent assertion that there is no sex difference in the mean but that males have greater variability. To the contrary, the results showed that males obtained a higher mean than females by between .22d and .33d, the equivalent of 3.3 and 5.0 IQ conventional points, respectively. In the 8 studies of the SPM for which standard deviations were available, females showed significantly greater variability (F(882,656)=1.20, p<.02), whilst in the 10 studies of the APM there was no significant difference in variability (F(3344,5660)=1.00, p>.05).

Richard Lynn is a stalwart of the IQ and race/sex field, and the credibility of his work goes to the heart of the matter.

This paper has been critiqued elsewhere, but there are a few fundamental aspects of the paper that really strike a scientist coming to this from outside the field, and I want to talk about a few of them.

Where does the "five points ahead on IQ tests" figure come from? Well it comes from taking the standard deviation normalised IQ difference between males and females determined by the meta-analysis (0.31), and multiplying it by what is generally taken as the population IQ standard deviation (15 points) to estimate a 4.65 point difference between men and women.

It is always a bit dodgy taking a difference detected in your study sample and then extrapolating your effect size out into the general population. There are good reasons to think that the standard deviation in the sample is less than the general population (since university students are selected to some degree), but we can get an idea of how bad that idea is by looking at the papers involved in the study where the standard deviations are reported (note the scores given are on the progressive matrices, not IQ scores, my understanding is that you need to approximately double the values to get the IQ difference) - one study reports actual IQ standard deviations of around 10 points, and since the largest study doesn't have standard deviations reported, looking at the next two largest studies, these also have values around 10.

So how did they get this 0.31 figure? Well the first thing they do is exclude half of all the subjects as an 'outlier'. We can see in the list of studies that as well as a large number of small studies, there is a large Mexican study with 45% of the total number of subjects in the whole meta-analysis. But it only showed a male advantage of .06 of a standard deviation (that's about 1 IQ point assuming standard deviation of 15). It isn't quite clear how this study can be an outlier if it contains half of all the subjects.

Ok, we've now upped out estimate of the mean difference between men and women from .14 (95% CI .11-.27) including the Mexican study (which we've moved from a 1.4 IQ point difference to a 2.1 point difference by assuming a standard deviation of 15 rather than 10 points), to .21 (95% CI .18-.28) by excluding the Mexican study. But that isn't 5 IQ points yet, we've only got to 3!

So now we need to do something really bad, instead of weighting the studies by sample size (because, you know, tiny studies are crapper, have much higher variance, and are much more likely to be positive and have a larger effect size due to publication bias, and because that is just how you estimate overall effect size when combining together results from studies of differing sample size) we'll just look at median effect size. That's right, instead of weighting all the results by how many subjects there were in each study we're going to line up all the studies in order of effect size, don't worry about how many subjects were in each one, and find the study in the middle - that's our effect size. I don't think we need to justify this approach at all, let's just do it and report all our results in that form. Way hey, that gives us .31 of a standard deviation difference, that's 4.65 IQ points if we assume standard deviation of 15, that's practically 5 IQ points - go men!

As Steve Blinkorn points out:

"The ten studies with estimated differences above the median cover a total of only 2,591 participants, whereas the ten studies with differences below the median account for 15,735 participants — the four largest differences come from samples of 111, 173, 124 and 300, the four smallest from samples of 844, 172, 9,048 and 1,316. Choosing to use the median is a flawed and suspect tactic."

Now we need to put the icing on the cake, let's make an outragious claim that is contradicted by our own data:

"These results are clearly contrary to the assertions of a number of authorities including Eysenck (1981), Court (1983), Mackintosh (1996, 1998a, 1998b) and Anderson (2004, p. 829). These authorities have asserted that there is no difference between the means obtained by men and women on the Progressive Matrices. Thus, the tests 'give equal scores to boys and girls, men and women' (Eysenck, 1981, p. 41); 'there appears to be no difference in general intelligence' (Mackintosh, 1998a, ); and 'the evidence that there is no sex difference in general ability is overwhelming' (Anderson, 2004, p. 829). Mackintosh in his extensive writings on this question has sometimes been more cautious, e.g. 'If I was thus overconfident in my assertion that there was no sex difference… if general intelligence is defined as Cattell's Gf, best measured by tests such as Raven's Matrices… then the sex difference in general intelligence among young adults today …is trivially small, surely no more than 1-2 IQ points either way' 1998b, p. 538). Contrary to these assertions, our meta-analyses show that the sex difference on the Progressive Matrices is neither non-existent nor 'trivially small' and certainly not '1-2 IQ points either way', that is, in favour of men or women. Our results showing a 4.6 to 5 IQ point advantage for men is testimony to the value of meta-analysis as compared with impressions gained from two or three studies."

That's right, even though the correct analysis of our data shows a 1.4 IQ point advantage for men let's claim that anyone suggesting a difference of '1-2 IQ points either way' is totally wrong and that only our completely dodgy analysis is the correct interpretation. One in the eye for you hairy lesbian feminists!

No attempt is made to estimate publication bias naturally, it is just asserted that there cannot be a file drawer effect because none of the studies was directly comparing male and female IQ in their primary study design. Blinkhorn again:

"My own file drawer turned out to contain an analysis of data from...the advanced matrices...This yielded an advantage of 0.07 standard deviations for females. The sample is larger than all but five of those found by Irwing and Lynn."

In my own research, if I don't detect a difference between men and women (you ought to check) then I probably wouldn't report the data split by gender - but of course this immediately introduces a publication bias - as only those studies where a difference has been found will have data suitable for including in a meta-analysis - and thus any effect will be overestimated.

If these are the kind of shenanigans people like Lynn can get up to right in front of our eyes, then what's going on behind the scenes? I cannot trust the data of these people because I do not respect them as scientists.

[interestingly, this study did not support the claim that men have a higher standard deviation in IQ scores than women - which is often posited to contend that while men and women may have equal mean IQs, there are more men in the extremes of the distribution, thus making more 'geniuses' men. Presumably they'd have tried harder to push that result if they hadn't been so happy with their "five points ahead on IQ tests" figure.]

UPDATE
Irwing and Lynn have replied to Blinkhorn's criticisms, and consequently many of mine, here, and Blinkhorn replies (thanks to potentilla in the comments). I'm unimpressed by their excuses but I'll repeat them here. They start off by saying:

"We believe that the principal error of Blinkhorn’s criticism is that he does not consider our result in the context of several other studies showing that adult males have an IQ advantage of around 4–6 IQ points."

That is disingenuous the say the least, Blinkhorn criticises the study's methodology, you cannot resort to the results of other studies to support your conclusions, it must stand or fall on its own merits.

They go on to say:

Blinkhorn criticizes us for not adopting the principle of weighting results by sample size, and for excluding the very large study from Mexico. This misses a central point of metaanalysis. We carried out a number of tests for moderator variables (factors that cause underor overestimates of the sex difference) and found strong evidence for two: these were the type of test and the tendency of some universities selectively to recruit either brighter men or brighter women. In the presence of strong moderators, many of the studies in the sample provide biased estimates of the sex difference in IQ score. It is clear from the box plot (Fig. 1) that the Mexico results conform to estimates from the most male-biased samples, which provide substantial underestimates of the sex difference in IQ. Given the strong probability of bias in this sample, to weight it by its sample size (9,048) would risk a serious underestimate of the population sex difference in IQ. For this reason, we followed the advice of a definitive article on meta-analysis10 and took the median of estimates, including Mexico, which equated to 4.6 IQ points." [PJ - note that the median is unaffected by inclusion or exclusion of the Mexico study since it is nearer the low extreme values and only the value of the middle study affects the median]

Now I confess, I don't quite see what they are saying here. The talk of sex-selection refers to studies which found either higher variance in males or females - they hypothesise that these differences in variance are the result of greater selection for males or females (lower variance, smaller standard deviations, means that the gender was more highly selected at that university, and thus shows less variance). But, and this is a fucking great big but, the Mexican study doesn't report variance by gender - what Irwing & Lynn are saying is that if you look at the figure the effect size for the Mexico study is more like the studies with a 'pro-female' selection bias (this is difficult to figure out as an analysis because this paper lists 10 studies, whereas only 6 studies were identified as pro-female in the original paper***) i.e. they are asserting that it had pro-female bias without any evidence that this was actually so.

As Blinkhorn says:

"Their argument here is circular: the sex difference is vanishingly small compared with their sample of smaller, less representative groups, so therefore there must be a bias."

But, of course, they aren't just excluding the Mexico study (note no other studies are excluded as outliers), sticking with means still gives a lower estimate of effect size than Irwing and Lynn report even without the Mexico study - so taking the median makes a big difference. They claim this is necessary due to heterogeneity in the sample but this is inapproriate given the range of study sizes from 30 subjects to nearly 10,000.

Looking at my plot of log transformed variance ratios* against effect sizes suggests that there may be a relationship between the difference in standard deviations between men and women (interpreted by Irwing & Lynn as due to differing selectivity of universities), ~~although the regression is only significant at alpha=.05 if you exclude the far right value~~**. They could easily have incorporated this relationship into their analysis and run a regression model if they were worried about this effect. Look at the point of no difference between variance (when standard deviations are the same, log ratio 0.0), the effect size is about .2, we can see from this that their overall sample is actually biased towards male-selective studies (reflected in their estimate of 6 female-selective, and 13 male-selective), and by Irwing & Lynn's interpretation their overall sample is biased in favour of studies where men are over selected and thus biased against women! This regression line ~~doesn't take into account study size and~~ only includes studies with available variance data, but if differential variance was the only effect at play here the effect size would still be smaller than the median estimate (and more like the mean exc. Mexico) as we can see from the estimated effect size (about .2, 2-3 IQ points depending on population variance) when there is zero difference in variance between men and women. This rather highlights why the median does not trump the mean when data is heterogeneous, instead you have to explore the effect of modifier variables, as Rosenthal (who Irwing & Lynn reference as justification for using medians) says:

"When several approaches to central tendency yield different results, the
reasons for such differences need to be explored."
But we know why they are discrepant, because so many studies were tiny, and by having more pro-male studies the median is almost guaranteed to fall in the high end of estimates because that is where the mid-point will fall.

With reference to generalising the findings to the population at large, they say:

"Many of Blinkhorn’s difficulties stem from his assumption that our focus was on university students. This makes little sense, because the IQ difference in students is dependent on which population is considered, whereas the sex difference in the general population, our actual focus of interest, is highly stable."

But, of course, this is no explanation, their study was of university students whatever their self proclaimed focus was on, and they should at least have reported what the IQ differences they actually found were - before then generalising to a population that they had no evidence for.

* You need to log transform ratios to make them symmetrical - take women with a 10x greater variance than men, the same as men, or men having 10x the variance of women - straight ratios gives you figures of 10, 1, and 0.1 but log ratios gives you 1, 0, -1. Think about how they'll look when plotted on a line, the raw ratio data will crowd the oints where men have higher variance than women in between 0 and 1, while when women have a higher variance than men it will stretch from 1 to 100, or 1000, or infinity. But when the data is log transformed men having a higher variance is treated just the same as women having a higher variance, but with a negative sign - i.e. it is symmetrical.

** I've substituted a regression line that takes into account study size, and it looks a lot nicer than the original fit too. But just look at how the studies are skewed to the pro-male side, suggesting that there may be serious over estimation of the effect size - I don't think that was what Irwing & Lynn wanted us to conclude!

*** There's somemething funny going on here but I'm not sure what. How can the original paper have found 13 pro-male (I assume they exclude the study where the difference in standard deviation is only at the second decimal place), and 6 pro-female studies, consistent with the data in the table, yet their reply to Blinkhorn has 10 pro-male and 10 pro-female studies? It just makes me even more concerned about their methodology.

UPDATE 2
A reader points out the following from Lynn & Irwing (2004):

"The second kind of poor quality study consists of those with small sample sizes that are liable to produce anomalously large chance effect sizes that obscure the true relationship. Some meta-analysts ignore differences in sample sizes and accord all studies equal weight irrespective of sample size. This is reasonable for certain data sets where all the studies have about the same sample sizes. Where this is not the case, some meta-analysts deal with this problem by ignoring studies with samples below a certain size, while others weight the studies by the sample sizes. These two solutions amount to much the same thing because weighting by sample size dilutes and may effectively eliminate the contribution of studies with small samples. Where the meta-analyst has a number of large samples, the simplest procedure is to ignore small samples and confine the analysis to studies where sample sizes are considered acceptable." [my emphasis]

Hoist by their own petard methinks.

I've been interested by the recent coverage of this report from UNICEF which has been widely reported in the UK media as showing that we have the worst childhoods in the industrial world. Now I can see that childhood here is not necessarly a bed of roses, but I was somewhat dubious that it was likely to be worse than say Poland, or Russia, but everyone seems to have been commenting on it as accurately reflecting the state of the world. Now I'm sure there's some truth in it, but I was sufficiently intrigued to have a closer look.

Looking at the report methodology reveals an amusing methodological conceit - they z-score their indicators -

A common scale

Throughout this Report Card, a country’s overall score for each dimension of child well-being has been calculated by averaging its score for the three components chosen to represent that dimension. If more than one indicator has been used to assess a component, indicator scores have been averaged. This gives an equal weighting to the components that make up each dimension, and to the indicators that make up each component. Equal weighting is the standard approach used in the absence of any compelling reason to apply different weightings and is not intended to imply that all elements used are considered of equal significance.
In all cases, scores have been calculated by the ‘z scores’ method – i.e. by using a common scale whose upper and lower limits are defined by all the countries in the group. The advantage of this method is that it reveals how far a country falls above or below the average for the group as a whole. The unit of measurement used on this scale is the standard deviation (the average deviation from the average). In other words a score of +1.5 means that a country’s score is 1.5 times the average deviation from the average. To ease interpretation, the scores for each dimension are presented on a scale with a mean of 100 and a standard deviation of 10.

So however close the raw scores are on an indicator scale they will be forced into a normal distribution with one country right at the bottom and one at the top, and this distribution will be given the same sort of weight as another distribution with massive disparities - i.e. a country that is at the top of the distribution for, say, immunisation rates, even if these rates are all very similar (a range of 80-100%) but scores badly on, say, infant mortality (2-16/1000) where there is a wide range of outcomes will come out the same as a country where the converse is the case - eyeballing it this is the case for Russia or Poland versus Austria. If we look at this section - the Health & Safety of Children measure shows the Netherlands (#2) at 112-113 and Ireland (#19) at 91; this represents Infant Mortality Rates of 4.9 and 5/1000, Low Birth Weight of 5.3% and 5%, Immunisation Rates of 96% and 81%, and Deaths from Accidents of 9 and 14/100,000. Now obviously Ireland is worse than the Netherlands, but the difference in ranking (#2 vs #19) does not seem to convey the message that the Irish have a similar rate of low-birth weight, similar infant mortality, 20% worse immunisation rates, and 50% worse accidental death in the under 19s - it makes it look like children in Ireland are a diseased subclass (rather than my first thought, which is that all their kids are dying in road accidents due to the stupid provisional licence system, and the generally unsafe roads).

There are also some elements that seem a bit unwise, taking immunisation rates as measuring

the comprehensiveness of preventative health services for children. Immunization levels also serve as a measure of national commitment to primary health care for all children

they note that

Vaccination is cheap, effective, safe, and offers protection against
several of the most common and serious diseases of childhood (and failure to reach high levels of immunization can mean that ‘herd immunity’ for certain diseases will not be achieved and that many more children will fall victim to disease.

but the very phenomenon of herd immunity means that pecentage immunisation should not be considered a uniform linear scale where 10% more immunised is always the same as 10% less immunised in terms of outcome - so if you need 90% coverage for herd immunity then an improvement from 90 to 95% coverage is not as significant as improvement from 85 to 90%. An additional factor is that as an indirect measure of health services for children immunisation is a lousy measure in countries (such as the UK) with a strong recent history of anti-immunisation campaigns (so the UK vaccination rate peaked at 90%+ after MMR was introduced before dipping again to <85% after the MMR controversy around 1998).

More later, perhaps I will get JP to discuss why the UK figures for the "Percentage of 15-19 year-olds not in education, training or employment" is actually a feature of poor population statistics, and how reliable subjective survey reports are when compared between countries.

Update: Doesn't look like JP is going to post - but basically the figures for 15-19yr olds not in education is calculated by subtracting the number in education from the latest population estimate. You may remember the controversy over the 'missing' young men in the last census, the ONS was forced to fiddle the figures and just arbitrarily add in thousands more of them. So, of course, by doing this they've suddenly massively upped the number not in education (since these extra young men are purely nominal and have no further evidence for their existence it is hardly likely they'd be enrolled in school!).

Friday, 5 October 2007

Sex and IQ

Saturday, 25 August 2007

The effect of shifting socioeconomic stucture on GCSE results

Thursday, 15 February 2007

Worst Childhood On Earth 2

Worst Childhood On Earth!!!

Pyjamas in Bananas

Blog Archive

Blogroll