Many people would argue that it is better to focus on final scores than change scores in RCTs, and certainly this way we are keeping all results in units of Hamilton depression score.

So what does it show? Well dear reader, it is quite interesting, overall the effect size is 2.93 (95% CI 1.99-3.87) but if we look at paroxetine and venlafaxine weighted mean differences in HRSD scores are 3.67 and 3.93 points respectively - both above NICE's criteria of:

"a between group difference of at least 3 points (2 points for treatment-resistant depression) was considered clinically significant for both BDI and HRSD [Hamilton Rating Scale for Depression]"Make of that what you will.

**UPDATE**

Ok, as I point out above my estimates of SD in this analysis are a bit dodgy so how can I sense check the above results? Well there's a rather obvious way, repeating the analysis below using the raw Hamilton change scores rather than the standardised mean difference (

**UPDATE2**a more accurate version of this analysis is presented here).

When you read a study and find that the authors have used an unusual method of analysis you always have to ask yourself why. Often it is just because they are particularly weird, or particularly clever. But sometimes it is because this method gets them the results they were looking for, and the more obvious 'normal' way of doing things doesn't. In this study they didn't just use the standardised mean difference to carry out their analysis, they actually calculated the change between drugs and placebo by subtracting the standardised mean difference of the improvement in one group from the SMD of the improvement in the other group - I'm not sure that's a very meaningful measure of improvement, they've actually normalised the improvement within each group then calculated the difference between these scores which have been standardised to two different groups - usually you would look at the SMD of the change in raw scores between the two groups (i.e. you subtract the mean improvement in HRSD in placebo from the mean improvement in the drug group and

*then*standardise by dividing by the pooled standard deviation derived from the SD of the mean improvement in each group).

When the analysed studies have all used the same measurement scale, as in this case, I think most people would use weighted mean difference rather than standardised mean difference for the final effect size*. Although the study reports that they did perform a weighted mean difference analysis in addition to their odd SMD analysis, their description leaves me a bit unclear exactly how they went about it and whether their reported effect size was derived from this. They report finding:

"Confirming earlier analyses [2], but with a substantially larger number of clinical trials, weighted mean improvement was 9.60 points on the HRSD in the drug groups and 7.80 in the placebo groups, yielding a mean drug–placebo difference of 1.80 on HRSD improvement scores."When I reanalysed the data I extracted from the paper, this time using the mean difference expressed in raw Hamilton depression scores, I found something curious (see the forest plot reproduced to the right).

I made the overall effect size 2.74 (95% CI 1.90-3.58) which is a fair bit higher than the 1.80 the study authors report, and more importantly I found that the effect sizes for paroxetine and venlafaxine were 3.13 and 3.92 Hamilton depression score points respectively. These are both above that oh so important NICE 'clinical efficacy' threshold.

So my conclusion is that I don't trust this study, and I certainly don't trust their conclusions. I'm not sure exactly why my analyses are so different from the authors' but I'm fairly sure it has something to do with their over reliance on the standardised mean difference as a measure of effect size.

* The decision is more complicated than that, the Cochrane handbook says:

"There are two summary statistics used for meta-analysis of continuous data, the mean difference (MD) and the standardised mean difference (SMD) (see 8.2.2 Effect measures for continuous outcomes). Selection of summary statistics for continuous data is principally determined by whether trials all report the outcome using the same scale (when the mean difference can be used) or using different scales (when the standardised mean difference has to be used).

It is important to note the different roles played in the two approaches by the standard deviations of outcomes observed in the two groups.

For the mean difference method the standard deviations are used together with the sample sizes to compute the weight given to each study. Studies with small standard deviations are given relatively higher weight whilst studies with larger standard deviations are given relatively smaller weights. This is appropriate if variation in standard deviations between studies reflects differences in the reliability of outcome measurements, but is probably not appropriate if the differences in standard deviation reflect real differences in the variability of outcomes in the study populations.

For the standardised mean difference approach the standard deviation is used to standardise the mean differences to a single scale (see 8.2.2.2 The standardised mean difference), as well as in the computation of study weights. It is assumed that variation between standard deviations reflects only differences in measurement scales and not differences in the reliability of outcome measures or variability among trial populations.

These limitations of the methods should be borne in mind where unexpected variation of standard deviations across studies is observed."

## 8 comments:

Thanks for this re-analysis. Looking at the study, I too thought it seemed a bit dodgy, and I appreciate your work and honesty.

Great work. Are you able to repeat analysis with just those in 'mild' 'moderate' and 'severe' HDRS categories. It would be interesting to see if effect sizes increased further.

Thanks

Sam

I'm not at home at the moment so I don't have access to the data/software but the current analyses are all on trials with severe depression on the HDRS except for the fluoxetine trial 'ELC 62 (mild)' which won't make much difference to the overall analysis if it is removed (see here) but might make a difference to the flouxetine specific analysis.

UPDATE 10/3/8Redid above comment with the correct data, unsurprisingly it didn't massively change:

Sam - forgot about your request - now done. Repeating this analysis. If we use the NICE/APA criteria but shift the labels (as suggested at the bottom of this post) then we have HRSD scores of 'mild' <19, 'moderate' <23, 'severe' >=23.

Looking at drug group baseline HRSD there is only one 'mild' trial (for fluoxetine; 'ELC 62 (mild)') with effect size essentially 0. There is also only one 'moderate' trial (for paroxetine; 'GSK UK 12') with effect size 2.4. The rest of the trials are 'severe' and excluding the two non-severe trials we get effect sizes of 2.9 overall, 3.3, 2.8, 1.7, 4.0 for parox, fluox, nefaz, and venla respectively.

If we divide up 'severe' into two halves (median 25.6, so we'll say 26, which is the 'clinical significance' threshold above) then overall effect size for HRSD >=23, <26 is 2.5, with 2.2, 3.1, 1.7, 3.8 for parox, fluox (single study), nefaz, and venla. For >=26 the overall effect size is 3.9, with 4.7, 2.6, 4.8 for paroxetine, fluoxetine, venlafaxine (single study) respectively (fluox and venla not stat sig, no studies for nefaz).

UPDATE 10/3/8As above, rewrote comment using corrected data (there was a small error in the analysis when I wrote that comment originally so I've corrected the figures):

The finding of an effect size of about 2.5 in the 23-26 range compares with the above regression where that range goes from effect sizes of 1.5-3.0. The 19-23 range goes from around 0-1.5, so the single paroxetine trial at 2.4 is quite good. For the HRSD above 26 the effect size increases from 3 to as high as even 5, so 3.9 seems reasonable.

So these categorical severity analyses are in broad agreement with the regression.

UPDATE 11/3/8Can't be bothered to use the new SD measures to recalculate these - they'll be much the same, and Robert finds the same sort of thing.

Post a comment