Thursday, 28 February 2008

Antidepressants redux part deux (updated)

Rather than writing the systematic review I'm supposed to be doing, following on from a suggestion made by Cyril Hoschl regarding the PLoS study, here's another back-of-the-envelope forest plot, this time using the final Hamilton depression score and calculating the weighted mean difference (again the data is not 100% correct since I have estimated SD from the standardised change difference and the final and baseline scores, so it is the pooled SD between baseline and final scores and likely an underestimate).

Many people would argue that it is better to focus on final scores than change scores in RCTs, and certainly this way we are keeping all results in units of Hamilton depression score.

So what does it show? Well dear reader, it is quite interesting, overall the effect size is 2.93 (95% CI 1.99-3.87) but if we look at paroxetine and venlafaxine weighted mean differences in HRSD scores are 3.67 and 3.93 points respectively - both above NICE's criteria of:
"a between group difference of at least 3 points (2 points for treatment-resistant depression) was considered clinically significant for both BDI and HRSD [Hamilton Rating Scale for Depression]"
Make of that what you will.

UPDATE
Ok, as I point out above my estimates of SD in this analysis are a bit dodgy so how can I sense check the above results? Well there's a rather obvious way, repeating the analysis below using the raw Hamilton change scores rather than the standardised mean difference (UPDATE2 a more accurate version of this analysis is presented here).

When you read a study and find that the authors have used an unusual method of analysis you always have to ask yourself why. Often it is just because they are particularly weird, or particularly clever. But sometimes it is because this method gets them the results they were looking for, and the more obvious 'normal' way of doing things doesn't. In this study they didn't just use the standardised mean difference to carry out their analysis, they actually calculated the change between drugs and placebo by subtracting the standardised mean difference of the improvement in one group from the SMD of the improvement in the other group - I'm not sure that's a very meaningful measure of improvement, they've actually normalised the improvement within each group then calculated the difference between these scores which have been standardised to two different groups - usually you would look at the SMD of the change in raw scores between the two groups (i.e. you subtract the mean improvement in HRSD in placebo from the mean improvement in the drug group and then standardise by dividing by the pooled standard deviation derived from the SD of the mean improvement in each group).

When the analysed studies have all used the same measurement scale, as in this case, I think most people would use weighted mean difference rather than standardised mean difference for the final effect size*. Although the study reports that they did perform a weighted mean difference analysis in addition to their odd SMD analysis, their description leaves me a bit unclear exactly how they went about it and whether their reported effect size was derived from this. They report finding:
"Confirming earlier analyses [2], but with a substantially larger number of clinical trials, weighted mean improvement was 9.60 points on the HRSD in the drug groups and 7.80 in the placebo groups, yielding a mean drug–placebo difference of 1.80 on HRSD improvement scores."
When I reanalysed the data I extracted from the paper, this time using the mean difference expressed in raw Hamilton depression scores, I found something curious (see the forest plot reproduced to the right).

I made the overall effect size 2.74 (95% CI 1.90-3.58) which is a fair bit higher than the 1.80 the study authors report, and more importantly I found that the effect sizes for paroxetine and venlafaxine were 3.13 and 3.92 Hamilton depression score points respectively. These are both above that oh so important NICE 'clinical efficacy' threshold.

So my conclusion is that I don't trust this study, and I certainly don't trust their conclusions. I'm not sure exactly why my analyses are so different from the authors' but I'm fairly sure it has something to do with their over reliance on the standardised mean difference as a measure of effect size.


* The decision is more complicated than that, the Cochrane handbook says:

"There are two summary statistics used for meta-analysis of continuous data, the mean difference (MD) and the standardised mean difference (SMD) (see 8.2.2 Effect measures for continuous outcomes). Selection of summary statistics for continuous data is principally determined by whether trials all report the outcome using the same scale (when the mean difference can be used) or using different scales (when the standardised mean difference has to be used).
It is important to note the different roles played in the two approaches by the standard deviations of outcomes observed in the two groups.
For the mean difference method the standard deviations are used together with the sample sizes to compute the weight given to each study. Studies with small standard deviations are given relatively higher weight whilst studies with larger standard deviations are given relatively smaller weights. This is appropriate if variation in standard deviations between studies reflects differences in the reliability of outcome measurements, but is probably not appropriate if the differences in standard deviation reflect real differences in the variability of outcomes in the study populations.
For the standardised mean difference approach the standard deviation is used to standardise the mean differences to a single scale (see 8.2.2.2 The standardised mean difference), as well as in the computation of study weights. It is assumed that variation between standard deviations reflects only differences in measurement scales and not differences in the reliability of outcome measures or variability among trial populations.
These limitations of the methods should be borne in mind where unexpected variation of standard deviations across studies is observed."

Antidepressants redux

For people interested in this sort of thing I've done a back-of-the-envelope forest plot of the PLoS study - the data is derived from their Table 1 and isn't 100% accurate (due to the way I derived the SD from the confidence intervals), and I've used a weighted mean difference of standardised mean differences rather than a true standardised mean difference because of the way they've presented their data (but it shouldn't make much difference to comparing my results to theirs). Personally I wouldn't have used standardised mean difference scores like this, I'd have wanted to use the raw scores since all studies used the same rating scale and it seems odd to standardise within treatment group in this way (but I don't know how the data was presented to the FDA - they may have had to use the data this way).

So we can see that I've pretty much replicated their finding of a .32 effect size (95% CI .24-.41) and this holds if we exclude studies with group sizes below 40.

I think it is interesting to note that this study hasn't told us much more than we already knew since you'll note that effect sizes are not exactly huge if we were looking for a d > .5 'medium' effect. You'll also note that our confidence limits do not include .5 so NICE would classify it as:
"There is evidence suggesting that there is a statistically significant difference between x and y but the size of this difference is unlikely to be of clinical significance."
Someone somewhere was asking about how effect size is influenced by study size (commonly used as a proxy for study quality). I've already said that excluding small samples doesn't affect the conclusions and looking at a simple scatter plot, if anything, larger studies have a smaller effect size. The funnel plot I've derived here is also unremarkable. Excluding the outlying study of mildly depressed subjects doesn't make much difference either.

The interesting thing is to look at the data split by individual antidepressant. We can see that paroxetine and venlafaxine have larger effect sizes (both .42, with confidence intervals crossing .5) than nefazadone and fluoxetine (.22 and .24 respectively, neither CI crossing .5). In the PLoS study they remark that:
"Although venlafaxine and paroxetine had significantly (p [less than] 0.001) larger weighted mean effect sizes comparing drug to placebo conditions (ds = 0.42 and 0.47, respectively) than fluoxetine (d = 0.22) or nefazodone (0.21), these differences disappeared when baseline severity was controlled."
But that is a rather troubling caveat to their overall conclusion. What they are saying is that their regression analysis suggests that the venlafaxine and paroxetine trials enrolled more severe patients and that could be why they had greater responses to the medication. But at the very least we must conclude that in the trials that were actually performed and submitted to the FDA there was a reasonable effect size due to these two drugs (we might also conclude that there was little evidence of a meaningful effect size of the other two). However, according to the NICE criteria we should still say that for venlafaxine and paroxetine:
"There is evidence suggesting that there is a statistically significant difference between x and y but there is insufficient evidence to determine its clinical significance."

Tuesday, 26 February 2008

Misrepresenting science

Further to the discussion below on the PLoS Medicine paper on antidepressants versus placebo, I've noted what I think is a strange way of reporting the paper that ultimately traces back to the authors themselves and the PLoS summary.

From the authors:

"the increased benefit for extremely depressed patients seems attributable to a
decrease in responsiveness to placebo, rather than an increase in responsiveness
to medication."

And the summary:

"The findings also show that the effect for these patients seems to be due to
decreased responsiveness to placebo, rather than increased responsiveness to
medication."

Now there are two ways to interpret this (see Figure 3 repeated here). One way would be to say that if we assume (for simplicity) that drug responses are due to placebo + 'true' drug effect then the 'true' drug effect in severe depression is greater than for milder depression. The decreased placebo effect at more severe depression suggests that there is less spontaneous remission (or less response to non-specific interventions) with more severe depression.

But a lot of people have been saying things like:

"The only exception is in the most severely depressed patients...But that is probably because the placebo stopped working so well, they say, rather than the drugs having worked better."

Or:

"The researchers said that the drug was more effective than a placebo in severely depressed patients but that this was because of a decreased placebo effect."

Even:

"...with slightly more benefit in severe depression but only because of less
response to placebo"

And:

"People with severe symptoms appeared to gain more clear-cut benefit - but this
might be more down to the fact that they were less likely to respond to the
placebo pill, rather than to respond positively to the drugs. "

Now there is a sense in which this could be true, and I really hope it is the interpretation that the authors intended, and that is consistent with the view that a large placebo effect can mask the response to a drug. So if the drugs cause a constant improvement in symptoms at any level of severity (see the figure) then at lower severity where the placebo effect is greater those people who got better with the drug would have got better anyway so there is no net difference between the groups. At more severe levels of depression the drug works just as well but those people who get better on the drug wouldn't have got better with the placebo, that is the decreasing placebo response unmasks the 'real' drug effect. Technically what we are saying here is that the drug effect is not additive, that the response to the drug is more than simply adding the response to placebo plus adding the 'true' drug effect. But note here that we are saying that the drugs do in fact have an appreciable 'true' drug effect that is simply masked by high placebo response rates.

However, the way the stories are written, and I admit this may just be me over interpreting the subtext, suggests to me that the implication intended is that the falling placebo effect with increasing severity is somehow causing the greater apparent effect of the antidepressants over placebo and this is thus not a 'real' effect. That would be a fundamental misunderstanding.

[Language Log has a nice discussion of the reporting of this paper]

The drugs don't work?

Interesting study looking at antidepressants out in PLoS Medicine. It does a meta-analysis:

"from the FDA all publicly releasable information about the clinical trials for efficacy conducted for marketing approval of fluoxetine, venlafaxine, nefazodone, paroxetine, sertraline, and citalopram, the six most widely prescribed antidepressants approved between 1987 and 1999".
So it looks at the evidence available to the FDA at the time it licenced these SSRIs (they aren't actually all SSRIs), but not necessarily all the evidence that is in fact available on these SSRIs, in particular most studies were only for six weeks and despite their conclusions about mild depression only one study actually looked at mild depression, the authors reach their conclusions primarily by extrapolating a regression line.

What they find, in summary, is that there is in fact a statistically greater benefit of the SSRIs over placebo, but that this difference was below the criteria that NICE use to determine clinical significance*. They also find that efficacy increases (relative to placebo) as the severity of depression increases reaching NICE's criteria for severe depression (see their Figure 2, or here).

They make something of the fact that the greater difference in severe depression is driven by a reduction in the efficacy of placebo, but that seems neither here nor there really - in fact it suggests that the very high placebo response rate for less severe forms of depression is masking the response to SSRIs (a problem well known in other areas such as low back pain).

It is worth noticing that NICE already recommends that:

"In mild and moderate depression, consider psychological treatment specifically focused on depression (problem-solving therapy, brief CBT and counselling) of 6 to 8 sessions over 10 to 12 weeks...Antidepressants are not recommended for the initial treatment of mild depression, because the risk–benefit ratio is poor."
Although they also recommend:

"In moderate depression, offer antidepressant medication to all patients routinely, before psychological interventions...CBT is the psychological treatment of choice. Consider interpersonal psychotherapy (IPT) if the patient expresses a preference for it or if you think the patient may benefit from it...For patients who have not made an adequate response to other treatments for depression (for example, antidepressants and brief psychological interventions), consider giving a course of CBT of 16 to 20 sessions over 6 to 9 months."
There was some mention in the news coverage this morning that 'talking' therapies would be a better idea instead of SSRIs. I am always interested in this view, which is very common, because there is little evidence that talking therapies, and in particular the best studied therapy CBT, are any better than medication or any cheaper (which isn't to say we don't need more clinical psychologists).


* Worth noting here I think that there is a difference between something being licenced because it is relatively safe and effective, and thus a permitted drug - as the FDA (in the US) or MHRA (in the UK) do - and something being cost effective - as NICE seeks to determine for the NHS in the UK.

The
NICE criteria are

"For continuous outcomes for which an SMD [standardised mean difference] was calculated (for example, when data from different versions of a scale are combined), an effect size of ~0.5 (a ‘medium’ effect size (Cohen, 1988)) or higher was considered clinically significant. Where a WMD [weighted mean difference] was calculated, a between group difference of at least 3 points (2 points for treatment-resistant depression) was considered clinically significant for both BDI and HRSD [Hamilton Rating Scale for Depression]...Where an ES [effect size] was statistically significant, but not clinically significant and the CI [confidence interval] excluded values judged clinically important, the result was characterised as ‘unlikely to be clinically significant’ (S3). Alternatively, if the CI included clinically important values, the result was characterised as ‘insufficient to determine clinical significance’ (S6)."
And NICE found that:

"There is evidence suggesting that there is a statistically significant difference favouring SSRIs over placebo on reducing depression symptoms as measured by the HRSD but the size of this difference is unlikely to be of clinical significance (N= 16; n= 2223; Random effects SMD= -0.34; 95% CI, -0.47 to -0.22).
In moderate depression there is evidence suggesting that there is a statistically significant difference favouring SSRIs over placebo on reducing depression symptoms as measured by the HRSD but the size of this difference is unlikely to be of clinical significance (N= 2; n= 386; SMD= -0.28; 95% CI, -0.48 to -0.08).
In severe depression there is some evidence suggesting that there is a clinically significant difference favouring SSRIs over placebo on reducing depression symptoms as measured by the HRSD (N= 4; n= 344; SMD= -0.61; 95% CI, -0.83 to -0.4).
In very severe depression there is evidence suggesting that there is a statistically significant difference favouring SSRIs over placebo on reducing depression symptoms, as measured by the HRSD, but the size of this difference is unlikely to be of clinical significance (N= 5; n= 726; SMD= -0.39; 95% CI, -0.54 to -0.24)."
So NICE's findings were not all that dissimilar to those of this study:

"weighted mean improvement was 9.60 points on the HRSD in the drug groups and 7.80 in the placebo groups, yielding a mean drug–placebo difference of 1.80 on HRSD improvement scores...the standardized mean difference, d, mean change for drug groups was 1.24 and that for placebo 0.92, both of extremely large magnitude according to conventional standards. Thus, the difference between improvement in the drug groups and improvement in the placebo groups was 0.32, which falls below the 0.50 standardized mean difference criterion that NICE suggested."
Of course the Cohen medium effect size criteria are completely arbitrary (even if not unreasonable; but see here for a discussion of whether Kirsch et al actually measured a true Cohen d effect size) and note that NICE is a lot more circumspect in dealing with statistically significant differences that they do not deem to be clinically significant than the authors of the PLoS study.

Also note that NICE also found:

"There is strong evidence suggesting that there is a clinically significant difference favouring SSRIs over placebo on increasing the likelihood of patients achieving a 50% reduction in depression symptoms as measured by the HRSD (N = 1742; n = 3143; RR = 0.73; 95% CI, 0.69 to 0.78)."
And in the PLoS study dichomotous results such as 50% reductions on the HRSD are not addressed, only average changes in the HRSD score.

Turner & Rosenthal (from the NEJM paper on selective publication in antidepressant trials) have an interesting editorial on this topic in the BMJ, where they say:

"Clinical significance is an important concept because a clinical trial can show superiority of a drug to placebo in a way that is statistically, but not clinically, significant. Tests of statistical significance give a yes or no answer (for example, P<0.05>0.05 non-significant) that tells us whether the true effect size is zero or not, but it tells us nothing about the size of the effect.3 In contrast, effect size does, and thus allows us to look at the question of clinical significance. Values of 0.2, 0.5, and 0.8 were proposed to represent small, medium, and large effects, respectively.4

NICE chose the "medium" value of 0.5 as a cut-off below which they deem benefit of a drug not clinically significant.5 This is problematic because it transforms effect size, a continuous measure, into a yes or no measure, thereby suggesting that drug efficacy is either totally present or absent, even when comparing values as close together as 0.51 and 0.49. Kirsch and colleagues compared their effect size of 0.32 to the 0.50 cut-off and concluded that the benefits of antidepressant drugs were of no clinical significance.

But on what basis did NICE adopt the 0.5 value as a cut-off? When Cohen first proposed these landmark effect size values, he wrote, "The terms ‘small’, ‘medium’, and ‘large’ are relative . . . to each other . . . the definitions are arbitrary . . . these proposed conventions were set forth throughout with much diffidence, qualifications, and invitations not to employ them if possible." He also said, "The values chosen had no more reliable a basis than my own intuition." Thus, it seems doubtful that he would have endorsed NICE’s use of an effect size of 0.5 as a litmus test for drug efficacy. "

Interestingly, even Moncrieff & Kirsch say:
"No research evidence or consensus is available about what constitutes a clinically meaningful difference in Hamilton scores, but it seems unlikely that a difference of less than 2 points could be considered meaningful. NICE required a difference of at least 3 points as the criterion for clinical importance but gave no justification for this figure."

Wednesday, 20 February 2008

Bridgend Suicides

Via the badscience forums, interesting discussion of 'clustering' of suicides in the Bridgend area.

I saw a bunch of journalists defending their coverage yesterday, usual dissembling about how it is really the fault of everyone else, and how it is a real story because:
"Latest statistics available from the Office for National Statistics show that there were three suicides in 2004-2005 in the Bridgend area for those aged between 15 and 30, and three in 2006."
But according to the Guardian:
"The sad fact is that 16 suicides among young people in Bridgend in 12 months is no worse than usual. There were 13 suicides by young people in 2007, and 21 in total. In 2006 the total was 28."
It looks like the former figures come from:
"A briefing document prepared last month for Bridgend Local Health Board (LHB) by the National Public Health Service for Wales...“Three LHBs have rates of suicide among males aged 15-24 that exceed the Welsh average to a level considered statistically significant. These are Denbighshire, Neath Port Talbot and Bridgend. Bridgend and Neath Port Talbot show the highest levels with an average of three cases of suicide among males of this age per year (1996-2006) in each area. They were lowest in Ceredigion with an average of one case per year.”"
So what is it, 3 for 15-30s in 2006, or 28? According to this:
"Four years ago, I began noticing a cluster of suicides of young males. It reached a peak in 2006 when 17 young people in the Bridgend constituency and four in Ogmore took their own lives - that makes a total of 21 in Bridgend. As the press has highlighted many times in recent weeks, a similar number of suicides occurred last year."
But looking at the national Public Health Service for Wales briefing document, we see that the 15-24 2004-2006 rate is 2 deaths/year.

I'm not sure what is going on here - anyone got any ideas? My first instinct was a difference in geographical areas as Philip Irwin in the Guardian suggests, but it doesn't seem to be that judging from the National Public Health Service figures. My next guess is that it depends on what you define as 'young', Irwin says "Men aged 16-35 are most at risk" and Bridgend's figures for 2004-2006 are an average of 16 over 15s a year, perhaps those missing 'young people' are in that 25-35 age bracket (a peak age group for suicide).

Monday, 4 February 2008