Many people would argue that it is better to focus on final scores than change scores in RCTs, and certainly this way we are keeping all results in units of Hamilton depression score.

So what does it show? Well dear reader, it is quite interesting, overall the effect size is 2.93 (95% CI 1.99-3.87) but if we look at paroxetine and venlafaxine weighted mean differences in HRSD scores are 3.67 and 3.93 points respectively - both above NICE's criteria of:

"a between group difference of at least 3 points (2 points for treatment-resistant depression) was considered clinically significant for both BDI and HRSD [Hamilton Rating Scale for Depression]"Make of that what you will.

**UPDATE**

Ok, as I point out above my estimates of SD in this analysis are a bit dodgy so how can I sense check the above results? Well there's a rather obvious way, repeating the analysis below using the raw Hamilton change scores rather than the standardised mean difference (

**UPDATE2**a more accurate version of this analysis is presented here).

When you read a study and find that the authors have used an unusual method of analysis you always have to ask yourself why. Often it is just because they are particularly weird, or particularly clever. But sometimes it is because this method gets them the results they were looking for, and the more obvious 'normal' way of doing things doesn't. In this study they didn't just use the standardised mean difference to carry out their analysis, they actually calculated the change between drugs and placebo by subtracting the standardised mean difference of the improvement in one group from the SMD of the improvement in the other group - I'm not sure that's a very meaningful measure of improvement, they've actually normalised the improvement within each group then calculated the difference between these scores which have been standardised to two different groups - usually you would look at the SMD of the change in raw scores between the two groups (i.e. you subtract the mean improvement in HRSD in placebo from the mean improvement in the drug group and

*then*standardise by dividing by the pooled standard deviation derived from the SD of the mean improvement in each group).

When the analysed studies have all used the same measurement scale, as in this case, I think most people would use weighted mean difference rather than standardised mean difference for the final effect size*. Although the study reports that they did perform a weighted mean difference analysis in addition to their odd SMD analysis, their description leaves me a bit unclear exactly how they went about it and whether their reported effect size was derived from this. They report finding:

"Confirming earlier analyses [2], but with a substantially larger number of clinical trials, weighted mean improvement was 9.60 points on the HRSD in the drug groups and 7.80 in the placebo groups, yielding a mean drug–placebo difference of 1.80 on HRSD improvement scores."When I reanalysed the data I extracted from the paper, this time using the mean difference expressed in raw Hamilton depression scores, I found something curious (see the forest plot reproduced to the right).

I made the overall effect size 2.74 (95% CI 1.90-3.58) which is a fair bit higher than the 1.80 the study authors report, and more importantly I found that the effect sizes for paroxetine and venlafaxine were 3.13 and 3.92 Hamilton depression score points respectively. These are both above that oh so important NICE 'clinical efficacy' threshold.

So my conclusion is that I don't trust this study, and I certainly don't trust their conclusions. I'm not sure exactly why my analyses are so different from the authors' but I'm fairly sure it has something to do with their over reliance on the standardised mean difference as a measure of effect size.

* The decision is more complicated than that, the Cochrane handbook says:

"There are two summary statistics used for meta-analysis of continuous data, the mean difference (MD) and the standardised mean difference (SMD) (see 8.2.2 Effect measures for continuous outcomes). Selection of summary statistics for continuous data is principally determined by whether trials all report the outcome using the same scale (when the mean difference can be used) or using different scales (when the standardised mean difference has to be used).

It is important to note the different roles played in the two approaches by the standard deviations of outcomes observed in the two groups.

For the mean difference method the standard deviations are used together with the sample sizes to compute the weight given to each study. Studies with small standard deviations are given relatively higher weight whilst studies with larger standard deviations are given relatively smaller weights. This is appropriate if variation in standard deviations between studies reflects differences in the reliability of outcome measurements, but is probably not appropriate if the differences in standard deviation reflect real differences in the variability of outcomes in the study populations.

For the standardised mean difference approach the standard deviation is used to standardise the mean differences to a single scale (see 8.2.2.2 The standardised mean difference), as well as in the computation of study weights. It is assumed that variation between standard deviations reflects only differences in measurement scales and not differences in the reliability of outcome measures or variability among trial populations.

These limitations of the methods should be borne in mind where unexpected variation of standard deviations across studies is observed."