Pyjamas in Bananas: Kirsch et al reply again

Huedo-Medina, Johnson, and Kirsch have submitted a further response on PLOS Medicine in reply to further comments by myself and others, after their last reply.*

Placebo response and severity

Interestingly, while I observed that they needed to:

"clarify their position on the claim that placebo response decreases with increasing baseline severity, since this appears to be an artefact"

Rather than address this observation they instead repeat the claim, saying that it is the 'unique' contribution of their article:

"without the within-group analyses it would not have been possible to conclude that placebo responses were lessening as initial severity of depression increased (whereas drug response remained constant; see our article’s Figures 2 and 3). This unique contribution of our article contradicts Wohlfarth’s conclusion that it contained “nothing new.”"

Flawed meta-analytic methods

Further, they defend their bizarre and biased analytical method:

"One of the main concerns in the new commentaries centred on one of our main analyses, which evaluated change for drug and placebo groups without taking a direct difference between them. Thus, effect sizes were calculated separately for each group for this analysis, though the analysis combined them. Leonard regarded this practice as “unorthodox” and Wohlfarth regarded it as “erroneous because the effect size in an RCT is defined as the difference between the effect of active compound and placebo.” First, these concerns ignore the fact that our article’s between-group analyses confirmed the major trends present in the analyses that considered within-group change. Specifically, both sets of analyses concluded that antidepressants’ efficacy was greater at higher initial severity, attaining clinical significance standards only for samples with extremely severe initial depression. Second, although the commentators may be correct that our within-group analyses are relatively innovative in this literature, it does not mean that they were wrong. To the contrary, these statistics are in conventional usage elsewhere (e.g., 3, 4, 5), as Waldman’s commentary implies...Finally, the analyses did incorporate a direct contrast between drug and placebo (see Table 2, and Model 2c, for example).
...
Although alternative weighting strategies may yield somewhat different results, the choices converge well both for the overall mean difference and for analyses of the trends across the literature. As an example, Leonard (04 March 2008) reported replicating our meta-regression patterns using alternative precision weights.
Importantly, as our article documented (Figures 2 & 3), the size of the difference between drug and placebo grows as the samples’ initial severity increases to extremely severe depression (but is very small at lower observed levels of initial severity). Because the overall differences between drug and placebo depended on initial severity, it is misleading to consider the overall difference in isolation."

But this does not address my objection that:

"This is not an acceptable analytic technique because it ignores that there is a relationship between the improvement in placebo and drug groups from the same study, but that the placebo and drug groups from any given study can have grossly different weightings when considered separately (e.g. there would be half as much weighting to the results from the fluoxetine trials in the drug analysis as the placebo analysis, the result of, for example, different sample sizes between the experimental arms)."

And as I say about the more conventional analysis they claim supports their 'unorthodox' analysis:

"I note that Figure 4 in the paper of Kirsch et al is actually more consistent with my finding of 'clinical significance' at a baseline of 26 (this threshold is found both by regression on the difference scores, or separate regressions for each group's change score) than their suggestion of 28 points..."

And Robert notes:

"The available unbiased estimate of the overall average benefit of NDA’s is equal to 2.65 HRSD units, which is considerably higher than Kirsch et al’s biased estimate [of 1.8]."

So while my and Robert's analyses confirm that the effect size of antidepressants increases with increasing baseline severity, they also show that their claim that placebo responses decrease with baseline severity of depression are false, and that Kirsch et al report effect sizes that are considerably biased downwards.

It is worth thinking about the references they give to support their analytical method (numbers 3,4, & 5, notice they are either in psychology or education journals), the most recent (and thus most easily available) is reference 5, Morris & DeShon (2002) 'Combining Effect Size Estimates in Meta-Analysis With Repeated Measures and Independent-Groups Designs' in Psychological Methods 7(1) 105-25. It is about combining the results from repeated measures designs and independent group designs, concentrating on training effectiveness, organizational development, and psychotherapy, and is not an article about medical meta-analysis:

"The issue of combining effect sizes across different research designs is particularly important when the primary research literature consists of a mixture of independent-groups and repeated measures designs. For example, consider two researchers attempting to determine whether the same training program results in improved outcomes (e.g., smoking cessation, job performance, academic achievement). One researcher may choose to use an independent-groups design, in which one group receives the training and the other group serves as a control. The difference between the groups on the outcome measure is used as an estimate of the treatment effect. The other researcher may choose to use a single-group pretest-posttest design, in which each individual is measured before and after treatment has occurred, allowing each individual to be used as his or her own control.1 In this design, the difference between the individuals’ scores before and after the treatment is used as an estimate of the treatment effect."

I hope you can already see why this situation is not comparable to a meta-analysis of double blind randomised placebo controlled drug trials because these repeated measures designs would not be appropriate (you can have within-subjects cross-over designs but that is not what is being discussed here) because we know placebo effects are very important in drug trials so we require the use of placebo control arms. Therefore the Kirsch et al meta-analysis only involved independent groups and there is no need to worry about combining repeated measures and independent groups, and, as Morris & DeShon say:

"When the research base consists entirely of independent-groups designs, the calculation of effect sizes is straightforward and has been described in virtually every treatment of meta-analysis"

That is there is no need to use this unusual method because perfectly good methods already exist for analysing this data.

So Morris & DeShon are concerned with what to do when you have no control group for some of your studies - which is not the case in double blind RCTs because a study without a control group is considered an invalid measure of drug effects. The other two references, number 4, Gibbons et al (1993), and number 3, Becker (1988), also emphasise this aspect of the method (I haven't read these studies):

"With this approach, data from studies using different designs may be compared directly and studies without control groups do not need to be omitted."

But we have no need to do this, so we have no need for the analytical method used by Kirsch et al, and we have no need for this method precisely because medical meta-analysis consider only double blind RCTs and explicitly rejects studies without control groups precisely because an estimate of placebo responses in each trial is considered essential.

So we have no reason to use the method of Kirsch et al, but what reasons do we have for not using these "innovative...statistics...in conventional usage elsewhere"? Well what do Morris & DeShon have to say? Well obviously they're concerned about when it is acceptable to combine studies with and without controls, and conclude that ideally, if you intend to do this, there oughtn't to be a change in the control group with time, i.e. there should be no placebo effect in the control group. But they do refer to Becker for a meta-analytic method proposed for use when there is a placebo effect:

"Becker (1988) described two methods that can be used to integrate results from single-group pretest-posttest designs with those from independent-groups pretest-posttest designs. In both cases, meta-analytic procedures are used to estimate the bias due to a time effect. The methods differ in whether the correction for the bias is performed on the aggregate results or separately for each individual effect size. The two methods are briefly outlined below, but interested readers should refer to Becker (1988) for a more thorough treatment of the issues.
An important assumption of this method is that the source of bias (i.e., the time effect) is constant across studies. This assumption should be tested as part of the initial meta-analysis used to estimate the pretest-posttest change in the control group. If effect sizes are heterogeneous, the investigator should explore potential moderators, and if found, separate time effects could be estimated for subsets of studies." [my emphasis]

That is, if you can't be sure that the placebo effect is constant across studies, you shouldn't combine studies using this method. And, of course, this is precisely the objection that I and others have raised to this method - because we already know that placebo responses can vary between trials - that is why we have placebo control arms in randomised controlled trials!

So Huedo-Medina, Johnson, and Kirsch are advocating the rejection of the usual meta-analytic techniques used in medical research where the highest standards are required and control groups considered very important, in favour of adopting a methods from psychology and education that is only used when two different designs, one of which is rejected in medical research, need to be combined, and where placebo effects are downplayed, a method that even its advocates recognise is unsuitable with heterogeneous placebo responses between studies.

This is quite some defence when you look at the scatter on the placebo responses in the Kirsch et al meta-analysis, that's about as heterogeneous as it gets, and it isn't explained by baseline severity of depression - so the assumptions underlying the meta-analytic method used by Kirsch et al are violated, even according to the citations they refer to in justifying their approach! Even the original study (with standardised mean differences rather than raw change scores) showed great heterogeneity in the placebo arm:

"The amounts of change for...placebo groups varied widely around their respective means, Q(34)s = ... 74.59, p-values [less than] 0.05, and I2s = ... 54.47"

Precision versus bias

Huedo-Medina et al also completely misunderstand the objections of Robert Waldmann and others by saying:

"Waldman argued that our estimates of the overall difference between drug and placebo was conservatively biased (i.e., too small) because of assumptions present in our estimates of precision for each effect size. It is of course not possible to be certain that one has completely removed error from any measurement, or for that matter, to do so in an analysis of measures from independent trials. As Young noted, there are uncontrolled measurement errors or artefacts that necessitate the use of a control group and the randomised controlled trial design.
...
The calculation of a weighted effect size by using the inverse of each within-subjects variance is more precise than a sample-size weighted average (9), contrary to the Waldman’s assertion."

When, of course, his assertion is that their estimates may be precise but are biased:

"In each case, Kirsch chose a method which, under strong assumptions, gives an efficient and unbiased estimate of the true overall average benefit. In each case there are alternative approaches which are less efficient under those assumptions but which are unbiased not only when the Kirsch et al estimates are unbiased, but also for many cases in which the Kirsch et al estimates are biased. That is they are less efficient under the null but more robust. In each case the null hypothesis that the Kirsch et al estimator is unbiased has been tested and overwhelmingly rejected. The available unbiased estimate of the overall average benefit of NDA’s is equal to 2.65 HRSD units, which is considerably higher than Kirsch et al’s biased estimate."

* UPDATE
PJ Leonard replies to Huedo-Medina et al on PLoS Medicine.

2 comments:

Anonymous said...: You're doing a very fine job of following this up and reporting this issue - thank you.

We'll update our posts relating to this.; 5 May 2008 at 03:26:00 BST
Robert said...: also they mispeled my last nname.
Thanks for doing a very fine job of following this up.; 5 May 2008 at 22:32:00 BST

Sunday, 4 May 2008

Kirsch et al reply again

2 comments:

Pyjamas in Bananas

Blog Archive

Blogroll