Tuesday, 11 March 2008

Statistics and depression

Robert Waldmann has some statistical thoughts on the Kirsch et al meta-analysis of anti-depressants:

Just can't let it go. SSRI Meta-analysis meta-addiction

Caveat lector

Caveat Lector II

The Simplest Meta-Analysis Problem

That Hideous Strength

Prozac Fan Talks Back

Personal chat with pj

In particular, in response to my confusion about where they got an effect size of 1.8 from:

"Actually I think I understand how Kirsch et al got their results. I get a weighted average difference of change of 1.

notation: changeij is the average change in HRSD of patients in study i who got the SSRI (j=1) or the placebo (j=0). Nij is the sample size of patients in trial i who get j pills.

In one calculation I used changeij/dij as the standard deviation of the change and thus (changeij/dij)^2/Nij as the estimated variance of the average change. The *separately* for drug and placebo data, I calculated the precision weighted average over i of changeij. This gave me an average change of 7.809 for the placebo and 9.592 for the SSRI treated for a difference of 1.78.

I guess this is what they did. I the confidence intervals are screwy and d is as described in the paper."

He also finds that it looks like Kirsch et al did indeed divide the change score by the standard deviation of the change score to obtain their 'd' measure - the wide confidence intervals that I thought argued against this seem to be a particular idiosyncrasy of their study. Fortunately this makes very little difference to my previous analyses (I've updated the effect sizes in this study, but the only real impact is on calculating proper SMD effect measures where these are larger, because the estimated SD is smaller).

I'll repeat one of my comments here:

"Hmm, that would be very annoying if someone had based their analyses on the confidence intervals being, you know, normal confidence intervals.

Looking back at the data it seems you're right that it is by essentially carrying out the meta-analysis on two entirely separate populations, the drug changes, and the placebo changes, and then subtracting one from the other, that they get their very low estimate of HRSD change.

That is a very odd way of doing things indeed, it is basically assuming that each study is really two separate and entirely unrelated studies, one on how people improve with drugs, and one on how they improve with placebo, so the way to analyse them is to ignore the study design and just try and estimate the pooled effect size for each group (drug and placebo) as if they were unrelated. It partly has an effect because the SDs depend on response, and because sample sizes are skewed towards drug groups in some studies (so the placebo group is much smaller than the drug group).

Taking your SD = change/d approach, and just plugging it into a meta-analysis program (SE weighting, fixed effects) giving an overall effect size of 1.9, it is interesting to note that the fluoxetine trials contribute half as much to the drug analysis (in terms of weighting) compared to the placebo analysis!

But as before, it is also interesting to see that segregating by drug gives effect sizes from 3.6 to .6 (or, given the silly form of this analysis, comparing individual drug groups to pooled placebo subjects 3.8 to -.2)."

I'll expand on that last bit - basically if Robert is right, and it is the best explanation I've found (looking back at the paper there are tantalising suggestions that it is correct because they report model statistics separately), then they have assumed that there are two entirely separate populations, the drug group and the placebo group, and that each trial is simply an attempt to estimate the size of the improvement in HRSD score within each group, ignoring any information about which placebo group went with which drug group in any particular trial (this is an approach that chimes with their regression analysis approach looking at each group separately).

When I attempted to replicate this sort of analysis as I mention above I find that the effect sizes are 9.6 and 7.7 (difference of 1.9) with the drug groups paroxetine, fluoxetine, nefazadone, and venlafaxine 9.6, 7.5, 10.6, and 11.5 respectively, making differences (to overall placebo) of 2.0, -.2, 3.0, and 3.8, although compared to their relevant placebo groups these differences are 3.0, .6, 1.8, and 3.6, giving you an idea of how the placebo groups vary by the drug study they are in.

Robert also finds a particularly telling aspect of the study:
"In my view in passing from the publication biased 3.23 to the final 1.78 only 0.6 of the change is due to removing the publication bias and 0.85 is due to inefficient and biased meta analysis.if the subsample of studies with references (I guess published studies) is analyzed with the method of Kirsch et al the weighted average improvement with SSRI is 9.63 and the weighted average improvement with placebo is 7.37 so the added improvement with SSRI is 2.26.If I have correctly inferred which studies were publicly available before Kirsch et al's FOIA request, I conclude that they would have argued that the effect of SSRI's is not clinically significant based on meta analysis of only published studies."
As mentioned in the comments, here's a pretty graph showing the effect size adjusted by regression on the baseline HRSD scores (to a baseline severity of 26 points) giving an overall effect of 3.0 (the grey line, as we'd expect from the regression lines which reach 'clinical significance' at baseline = 26) we get 2.4, 2.9, 3.2, and 3.7 for the effect sizes in nefazadone, flouxetine, paroxetine, and venlafaxine respectively, although the differences between drugs doesn't seem to be stastically significant (the closest is nefazadone versus venlafaxine).


Robert said...

I am very confused. I just sat down and saw that blogger had not accepted my comment because I didn't copy the wiggly letters correctly. Yet somehow you know about it.

I get it, you have ESP which is how you understand depression so well.

Robert said...

I agree that it is unreasonable to calculate weighted means for ssri and placebo data as if the ssri and placebo patients in the same study were in independent studies.

In particular the improvement with SSRI and with placebo are positively correlated across trials*.

I regressed the sum of the improvement with SSRI and placebo on n/(n+pn) the fraction of patients which got the SSRI.

As we already know the coefficient was negative -3.2. Not significant but, since it is negative, results from separate averaging and comparing are biased against SSRI's

* Of course one would guess that this correlation is positive given different selection of patients (62 mild has small changes for both) and probably different application of the Hamilton scale (excellent but partly subjective) and different clinical management of the patiens

Robert said...

I have applied my approach (first take differences then use weights based only on sample sizes which would be efficient if all disturbances had the same variance) separately for each SSRI.

Of course I find the same large differences you do with venlafaxine and paroxotine performing better than Prozac which out perfoms Nefazodone.

Here are my results (with obvious abreviations

dchange regressed on

prozac | 2.140214
ven | 3.445599
nef | 1.687342
par | 3.391007

STATA is convinced that Nefazodone is significantly less effective than the weighted average of the other drugs

Dchange regressed on

nef | -1.389004 .6742132 -2.06
cons | 3.076346 .4660721 6.60

pj said...

It makes you wonder why the drug companies haven't adopted this rather obvious approach (surely they have hundreds of biostatisticians sitting around) instead of appealng to anecdote.

It seems that both paroxetine and venlafaxine have 'clinically significant' effect sizes, fluoxetine is more equivocal but excluding the 'mild' study suggests that it has an effect size comparable to the other two drugs, even if it doesn't exceed 3 HRSD points. Nefazadone, which has been withdrawn in Europe, and which is far from being an SSRI (it is a 5-HT2A antagonist with weaker serotonin and noradrenaline transporter antagonism), seems to have a pretty small effect size.

Obviously we must bear in mind the strong effect of baseline severity on the effect sizes (the nefazadone studies all had patients from the less severe end), but it really does look like Kirsch et al have overplayed their hand.

pj said...

Saying that, I've just had a look at the adjusted means (adjusted separately using my regression lines) and find that at baseline=23 the overall effect size (weighted by sample size) is 1.7, with 1.1, 1.6, 1.8, and 2.3 for nefaz, floux, parox, and venla respectively. And at baseline=26 overall effect is 3.0 (as we'd expect from the regression lines which reach 'clinical significance' at baseline=26) we get 2.4, 2.9, 3.2, and 3.7 for these respective effect sizes - suggesting there is more to the differences in drug effect size than simply study baseline severity because nefazadone is clearly below the others, and venlafaxine above.