**Summary**

Carrying out a meta-analysis using raw Hamilton Rating Scale for Depression (HRSD) change scores derived from the Kirsch et al 2008 PLoS Medicine paper I found that the effect size was larger than that found by Kirsch et al, and for paroxetine and venlafaxine this effect size exceeded the NICE criterion for 'clinical significance' (a difference of 3 points on the HRSD score). This suggests that the findings of Kirsch et al are both dependent on their particular method of analysis and cannot be generalised to all antidepressants included in their analysis.

**Results**

The weighted mean difference (WMD) with a random effects model (see right) shows an overall effect size of 2.7 (95% CI 2.0-3.4) with paroxetine, fluoxetine, nefazadone, and venlafaxine 3.4, 2.1 (non-sig), 1.7, and 3.5 respectively, so both paroxetine and venlafaxine both exceeded the NICE criteria for 'clinical significance' of 3 HRSD points.

Analysing the standardised mean difference gave an overall effect size of .32 (95% CI .24-.41), with paroxetine, fluoxetine, nefazadone, and venlafaxine having effect sizes of .41, .24, .21, and .40 respectively.

There was minimal difference between using fixed versus random effects, and excluding the mild study gives an overall effect size of 2.81, with fluoxetine statistically significant at 2.85.

**Methods***

The PLoS Medicine paper gives the data for individual studies in Table 1. It reports that the measure of effect size 'd' is the change score divided by the standard deviation (SD) of the change score so the SD can be derived from the change score and 'd'.

The raw HRSD change scores and SDs were entered, along with the sample sizes from Table 1, into the Cochrane Collaboration RevMan Analyses (v 1.0.5) software to perform a weighted mean difference (WMD) meta-analysis with random effects.

Subgroups were defined to analyse paroxetine, fluoxetine, venlafaxine, and nefazadone separately. Sensitivity analyses were performed by omitting the outlying fluoxetine study of subjects with mild depression ('ELC 62 (mild)').

For completeness a fixed effect analysis was also carried out, as well as an analysis looking at the standardised mean diffence (this is the difference in change scores normalised to the standard deviation of the change scores) using Hedges adjusted g (similar to Cohen's d but includes an adjustment for small sample bias), although this is not in fact appropriate for studies which have used the same outcome measure (HRSD scores in this case) .

**Links**

This study is an updated version of this analysis and this analysis, but deriving the SD for the SMD more accurately. I also discuss the Kirsch et al paper here and here.

*** UPDATE 11/3/8**

Following on from Robert Waldmann's findings, despite my protestations to the contrary, it looks like the confidence intervals of 'd' in Kirsch et al are a poor guide to the standard deviation of the change score, and the effect size 'd' may actually be the HRSD change score/SD change score, so the above analysis was corrected to be based on this new SD measure. Unsurprisingly it makes little difference.

## 11 comments:

It sounds like this should be a paper rather than a blogpost.

I agree with lemmuslemmus - can you submit this as a "Comment"?

Done - we'll see if it appears.

Well it has appeared under my pimp name here.

We'll see if it gets any bites.

Hi PJ,

I just wnt to say "Thanks" for debunking this dodgy paper. It will cause substantial harm nevertheless, as the press already got it terribly wrong, but at least some people try to stop this dangerous media frenzy and get the facts right.

As a long time depression sufferer, helped by the very drugs that "don't work", I am grateful for your analysis.

I am spreding the word in Germany, hopefully people will realise what is really going on with Kirsch and his "findings".

All the best,

TearsforFears

Cheers TearsforFears,

I'm not sure I necessarily want to allege that Kirsch et al are being deliberately misleading - but they ought to have checked alternative (and dare I say more conventional) methods of analysis before publishing such radical findings, but more importantly, their inflammatory interpretation.

But you're right, the damage has likely been done, the media can't really be trusted to report on science or medicine without getting it wrong, it is a pity that this time the authors seem to have contributed more than their fair share to the misunderstanding.

I agree with your conclusion that Kirsch's results/conclusions depend critically on his use of the standardised mean difference (SMD). Plotting the (raw) change scores against baseline HRSD gives a very different picture.

I didn't understand why you use a rather complicated method to derive the standard deviations (SD) of the change scores via the confidence interval for the SMD. Since Kirsch calculated SMD (d) as change score divided by SD of change score, surely the SD can be reobtained as change score divided by SMD? Or have I misunderstood something?

Jeremy Franklin

Although the paper says it used the change score/SD of change score - this wouldn't actually be an SMD (where you have to use the SD of the baseline or outcome measures) and the reference given is to a paper on Hedges' g, which is like Cohen's d, and is calculated by dividing by some function of the baseline/outcome SD.

Also, if you divided the change by the SD then the SD of the effect size (what they call 'd') would have to be 1 - which it doesn't appear to be from the confidence intervals.

I discuss it here on badscience.

AWesome ! I am trying to send the people Brad DeLong sent to me over to you. I have an idea as to why your results differ from theirs (such that you are right and they are wrong).

It implies that you can do a Wu-Hausman test to test if their meta-analysis is biased.

I'm not sure I'm not confused, since I'm not sure I understand exactly what they did. I am quite sure that a reasonable thing to do is to calculate the sample size weighted mean improvement with treatment and with placebo and compare them. I am quite confident that using reported standard deviations to calcuate the weights when averaging can lead to a bias in the result of the meta-analysis (and I think it does in this case because I guess studies in which more patients respond to the SSRI have a greater variance in change under treatment than ones where fewer respond)

All at great length and written for the general public here

http://tinyurl.com/344t3c

My analysis weights by standard error - so it does include using the reported SD that I've derived from the confidence intervals of the SMD.

A quick calculation suggests that weighting by raw sample size gives an overall effect size of 2.7, with paroxetine 2.6, fluoxetine 2.0, nefazadone 1.8, and venlafaxine 4.7 - which isn't massively at odds with my analysis.

They do report that their analyses were "weighted for the inverse of the variance", "analytic weights are derived from the sample size and the SDc", and "deriving its analytic weight from its standard error". I don't have all the data here, but back-of-the-envelope, I don't see how even weighting by variance alone could get you down to a 1.8 effect size.

Looking at a simple regression of the difference in change scores against the difference in SMDs suggests that a difference of 1.8 in HRSD scores is comparable to a difference of .22 in SMDs - and their effect size of .32 translates to 2.7 HRSD points - with 3 HRSD points being an SMD difference of .36 - and an SMD difference of .5 translates to a change score difference of 4.2 HRSD points.

So, again, not sure where they get their effect size of 1.8 HRSD points from given their effect size of a .32 SMD difference.

Post a comment