Saturday, 15 March 2008

Kirsch et al reply

Blair Johnson and the other authors of the Kirsch et al paper in PLoS Medicine reply to the responses to their paper here. Some relevant parts for the discussions here:

" we reported in our results, this difference was more apparent than real, disappearing when we controlled for baseline severity. It is worth noting that Turner et al. (2008) found between-group effect size (d) estimates of 0.40 for venlafaxine and 0.26 for nefazodone, both of which are close to the mean of 0.40 for all 12 newer antidepressants and are identical to those for fluoxetine (0.26) and paroxetine (0.42)."

"Leonard took the trouble of re-analyzing the data from our Table 1 and concluded that a clinically significant difference emerged at a lower point of severity than we concluded in our article (i.e., 26 vs. 28). We are grateful that his work confirms our major conclusion, which is that the efficacy of anti-depressants depends on the initial severity of depression. Unfortunately, however, his estimates of the standard deviation underlying each effect size relied on between-subjects’ rather than within-subjects’ formulations. In examining improvement in response to drug or placebo, individual trials conventionally control for the correlation between the HRSD scores at baseline. We adopted this convention in our analyses of drug and placebo improvement. Reassuringly, the analyses at the end of our Results section pertaining to each trial’s drug vs. placebo comparison also used a between-subjects variance formulation and confirmed that clinical significance emerges in the vicinity of an HRSD score of 28."

"We found a nonsignificant benefit of drug compared to placebo for moderately depressed patients. Yet, consistent with our other conclusions, the difference between drug and placebo grows at higher levels of depression. Davies commented on the fact that there were few samples with scores below the category of very severe depression on the Hamilton Rating Scale of Depression (HRSD), a limitation that our Discussion mentioned. "

I note that they don't engage with the finding by 'Leonard' (that's me that is) that there is no real decrease in placebo response with increasing severity, nor do they address my concerns that their use of the measure 'd' (mean change divided by SD of the change) biases the effect size (expressed in HRSD change scores), nor that looking at raw HRSD changes suggests that paroxetine and venlafaxine exceed the NICE 'clinical significance' criteria. I'm not quite sure what they mean by referring to within-subjects variance versus between-subjects variance (since I've changed the analysis based on Robert Waldmann's findings I don't know which analysis they looked at), they could be referring to normalising to the change score SD, which makes little difference compared to my previous analyses, or to analysing the drug group and placebo groups separately, which is just plain statistically wrong (and seems to be what they did, note that my analysis of separate regression lines produces the same results as looking at the between-subjects regression). They refer to their analyses at the end of their results section as confirming their 'within-subjects' results, I wonder if they mean their Figure 4 (repeated here), you might want to compare that to my regression (and their Figure 2) - and decide for yourself whether that confirms that the threshold for 'clinical significance' of 3 HRSD points difference is at baseline HRSD of 28 points as they claim, or 26 as I find.

They also don't really seem sufficiently contrite over their claim that in 'moderate' depression antidepressants should be avoided, given that it was based on a single study plus extrapolating a regression line.
The only real finding, that is robust, is that the difference between placebo and antidepressant response seems to increase with baseline HRSD severity. Although Kirsch et al emphasise that the level at which this difference becomes 'clinically significant' is in severe depression, it is worth noting that in fact the level at which it is significant (around 26 according to both my and their analysis of raw HRSD figures) is pretty much the middle of the pack in terms of the baseline severity of the studies (which were pretty much all in the 'very severe' range over 23 HRSD points - see that figure). [Their finding that the differences between the drugs may be largely explained by the differing baseline of the studies is not unreasonable].

'PJ Leonard' has submitted a response, titled 'Analytical differences', pretty much repeating what I said above:
"It is good of Johnson et al to reply to the responses here. However, I do not think they have sufficiently dealt with some of the reservations concerning their paper.

In particular, I do not think that they have engaged with my finding that using the raw HRSD change scores reveals that the placebo response does not in fact decrease with increasing baseline severity on the HRSD.

I am not clear exactly what they mean when they say that I have used between-subjects analyses to suggest that the effect size (when analysing the raw HRSD change scores) is larger than presented in their paper, whereas they have used within-subjects analyses.

My analyses utilise conventional methods for meta-analysis where the effect size in each study is analysed directly, whereas it seems likely that the low estimated effect size in HRSD units in this study is the result of carrying out the meta-analytic weighting on the drug and placebo groups separately (a 'within subjects' analysis?), and then comparing the effect sizes thus obtained (which would explain the lack of forest plots in the paper).

This is not an acceptable analytic technique because it ignores that there is a relationship between the improvement in placebo and drug groups from the same study, but that the placebo and drug groups from any given study can have grossly different weightings when considered separately (e.g. there would be half as much weighting to the results from the fluoxetine trials in the drug analysis as the placebo analysis, the result of, for example, different sample sizes between the experimental arms).

Normalising the HRSD change to the change standard deviation in each group separately is also unnaceptable because a larger change in HRSD score in the drug group could be associated with a greater variance, although this does not appear to be the case in this study.

Robert Waldmann estimates that there is more bias in analytical method in this paper than publication bias present in the data itself:

I note that Figure 4 in the paper of Kirsch et al is actually more consistent with my finding of 'clinical significance' at a baseline of 26 (this threshold is found both by regression on the difference scores, or separate regressions for each group's change score) than their suggestion of 28 points, this difference is undoubtedly because this figure looks at raw HRSD scores, as did my analyses, and because the NICE 'clinical significance' threshold of d > .5 is actually stricter than the NICE threshold of an HRSD difference > 3.

I concur that there is a relationship between baseline HRSD severity and effect size but it is worth noting that almost all studies examined had baselines over 23 points (and were thus in APA/NICE categories of 'very severe' depression) so the threshold of 26 points is a fairly average baseline severity for the studies analysed in this paper (as can be seen from my regression plots or their Figure 4). Any generalisation to less severe categories of depression is unwarranted given that it would depend on extrapolating the regression line to a region with only a single study."

No comments: