Monday 31 March 2008

Poppy Shakespeare

WTF? Adapted from the novel (which I haven't read) which received nothing but plaudits, and it looks like the film will be treated the same, but it seemed to me like the sort of work I'd expect from a middle class re-imagining of 'One Flew Over the Cuckoo's Nest' for today (sane woman driven mad by the 'system') supplemented with, at most, a cursory familiarity with modern psychiatric services (although apparently the author, Clare Allen, has a fair bit of experience of mental health services).

Frankly it didn't resemble any psychiatric service provision I've ever seen, from secure unit to day services - most strikingly the patients were bizarre one-dimensional cut-outs with no sense of humanity or even mundanity (psychiatric patients are everyday people after all). Sam Wollaston in the Guardian also notices:
"...a fine performance from Anna Maxwell Martin in the lead - the only convincing "dribbler", actually; the others overdo it, and look like actors pretending to be mentally ill."
Many reviewers talk about how this was some unsettling commentary on the modern British mental health system, on mental illness itself, or on 'institutional structures' - since most of the plot depended on unbelievable and contrived devices (e.g. a pathetically engineered 'Catch 22' where you must 'prove' you are mad to get 'MAD money' to fund an appeal to prove you aren't mad, and get you discharged) I think this is a pretty credulous reading.

The one negative review by Andy Boyd on Amazon says something similar:

"How this book was ever published is hard,very hard to imagine. Allan seems to have had in her mind another "Cuckoo`s Nest" She is no Ken Kesey.

A little research would have gone a long way.I am a Mental Health Project Worker and a user of services.I find her descriptions of mental health issues at the very least down right insulting with one dimensional characters of no substance whatsoever.

She cannot describe the reality of mental ill health,which,I agree has many moments of humour,empathy and understanding comes with it.

I suggest Ms Allan writes about something she remotely understands.This book is a total turn off she cannot explain the benefits system properly and continues a rant all through the book about "MAD" money - her reference to Disability Living Allowance which is irritating and downright wrong!!

At a time when we are trying to de-stigmatise mental illness and raise awareness of it Ms Allan only serves to describe a mental health system that does not exist - get a reality check!!!"

And another reader responds to him in the way I imagine the author and the book/film's fans would:


"The author, from what I have read, has a great deal of personal experience of the mental health system. I appreciate your stance, but this is a work of fiction."
But this is just a combined appeal to authority with an 'it's just fiction' get-out clause - because the book can't both be a searing indictment of psychiatric services and also a completely innacurate made-up story*, I submit it is the latter, and as such tells us fuck all.

"...Allan has given us something indigestibly, potently true"

My arse.



[* it has obviously occured to me that there is supposed to be a 'blurring of reality' thing going on - with the most implausible events not 'real', but that weakens it even more as both a story and a reflection on psychiatry - a much more interesting tale could have been told from the perspective, for instance, of the brushes with psychiatric services of someone diagnosed with a personality disorder, where there are real and interesting questions of both the patient's skewed perceptions of what is going on, and the attitudes and actions of mental health services to these people]

Sunday 30 March 2008

Kirsch et al update

Following on from the reply by Johnson et al, my re-analyses and Robert Waldmann's work of the last few weeks there's been a nice discussion of the paper by Nick Barrowman on Log base 2 which refers to both my, and Robert's reservations.

Robert has submitted a response to PLoS and develops his argument and analysis in this post. Amongst the other responses to PLoS biostatistician Jim Young develops a similar line of argument:

"...To see what might be regression to the mean with increasing initial disease severity, one would need to plot raw improvement against initial disease severity.

Most of the authors’ analyses seem to represent each trial as two group means, not as a single difference between groups. It is not possible to estimate the pure effect of treatment and of placebo within a single trial of this sort. Each effect is confounded with other effects – such as regression to mean and spontaneous improvement. If it were possible to measure separate effects of treatment and placebo within each trial, then there would be no need for the placebo group at all. It would be much more efficient to run trials with only a treatment group [1].

The authors’ write “Finally…we calculated the difference between the change for the drug group minus the change for the placebo group, leaving the difference in raw units and deriving its analytic weight from its standard error.” This is the only analysis that makes any sense because it is an analysis of the difference between groups in each trial. I think this corresponds to Models 3a and 3b in Table 2 but, as PJ Leonard notes, Model 3b is best because it’s sensible to drop out the one study where patients had only moderate depression.

The authors first conclude that “Drug–placebo differences in antidepressant efficacy increase as a function of baseline severity”. I have no problem with this – that’s what Figure 4 shows, at least in patients with more than moderate depression. But the authors go on to conclude that “The relationship between initial severity and antidepressant efficacy is attributable to decreased responsiveness to placebo among very severely depressed patients, rather than to increased responsiveness to medication.” This second conclusion requires the assumption that other effects (such as regression to the mean and spontaneous improvement or deterioration) are the same in each trial. Because of measurement error, it’s logical to expect regression to the mean to be greater in trials recruiting patients with more severe disease or greater in trials with longer follow up. Likewise it’s logical to expect spontaneous improvement or deterioration to differ with length of follow up. Even if the authors are happy to make the assumption that these other effects are the same in each trial, I think they should have made this assumption explicit, because I would not want to assume this myself.

Even if willing to make this assumption, why would you base this second conclusion on Figure 3? Why plot the standardised difference (between initial and final measurements in each group); why not just plot the raw difference? In meta-analysis, the only reason for standardising is to convert measurements on different scales into a common metric so that one can compare them. But if measurements are already on the same scale in each trial, why standardise them? It’s more difficult to interpret and requires stronger assumptions (“that variation between standard deviations reflects only differences in measurement scales and not differences in the reliability of outcome measures or variability among trial populations” [2]). Figure 3 may mislead because standard deviations will vary from trial to trial (these’s an order of magnitude difference in sample size between the smallest and largest trials). PJ Leonard says that if you use the raw differences, the “placebo response does not in fact decrease with increasing baseline severity”. If he’s right, then the authors’ second conclusion just looks like wishful thinking."

Sunday 16 March 2008

Godless evangelicals

From John Gray:

"The irony is that, in its fanaticism and intolerance, atheism's militant tendency apes the worst aspects of religious fundamentalism"

Yawn.

Saturday 15 March 2008

Kirsch et al reply

Blair Johnson and the other authors of the Kirsch et al paper in PLoS Medicine reply to the responses to their paper here. Some relevant parts for the discussions here:

"...as we reported in our results, this difference was more apparent than real, disappearing when we controlled for baseline severity. It is worth noting that Turner et al. (2008) found between-group effect size (d) estimates of 0.40 for venlafaxine and 0.26 for nefazodone, both of which are close to the mean of 0.40 for all 12 newer antidepressants and are identical to those for fluoxetine (0.26) and paroxetine (0.42)."

"Leonard took the trouble of re-analyzing the data from our Table 1 and concluded that a clinically significant difference emerged at a lower point of severity than we concluded in our article (i.e., 26 vs. 28). We are grateful that his work confirms our major conclusion, which is that the efficacy of anti-depressants depends on the initial severity of depression. Unfortunately, however, his estimates of the standard deviation underlying each effect size relied on between-subjects’ rather than within-subjects’ formulations. In examining improvement in response to drug or placebo, individual trials conventionally control for the correlation between the HRSD scores at baseline. We adopted this convention in our analyses of drug and placebo improvement. Reassuringly, the analyses at the end of our Results section pertaining to each trial’s drug vs. placebo comparison also used a between-subjects variance formulation and confirmed that clinical significance emerges in the vicinity of an HRSD score of 28."

"We found a nonsignificant benefit of drug compared to placebo for moderately depressed patients. Yet, consistent with our other conclusions, the difference between drug and placebo grows at higher levels of depression. Davies commented on the fact that there were few samples with scores below the category of very severe depression on the Hamilton Rating Scale of Depression (HRSD), a limitation that our Discussion mentioned. "

I note that they don't engage with the finding by 'Leonard' (that's me that is) that there is no real decrease in placebo response with increasing severity, nor do they address my concerns that their use of the measure 'd' (mean change divided by SD of the change) biases the effect size (expressed in HRSD change scores), nor that looking at raw HRSD changes suggests that paroxetine and venlafaxine exceed the NICE 'clinical significance' criteria. I'm not quite sure what they mean by referring to within-subjects variance versus between-subjects variance (since I've changed the analysis based on Robert Waldmann's findings I don't know which analysis they looked at), they could be referring to normalising to the change score SD, which makes little difference compared to my previous analyses, or to analysing the drug group and placebo groups separately, which is just plain statistically wrong (and seems to be what they did, note that my analysis of separate regression lines produces the same results as looking at the between-subjects regression). They refer to their analyses at the end of their results section as confirming their 'within-subjects' results, I wonder if they mean their Figure 4 (repeated here), you might want to compare that to my regression (and their Figure 2) - and decide for yourself whether that confirms that the threshold for 'clinical significance' of 3 HRSD points difference is at baseline HRSD of 28 points as they claim, or 26 as I find.

They also don't really seem sufficiently contrite over their claim that in 'moderate' depression antidepressants should be avoided, given that it was based on a single study plus extrapolating a regression line.
The only real finding, that is robust, is that the difference between placebo and antidepressant response seems to increase with baseline HRSD severity. Although Kirsch et al emphasise that the level at which this difference becomes 'clinically significant' is in severe depression, it is worth noting that in fact the level at which it is significant (around 26 according to both my and their analysis of raw HRSD figures) is pretty much the middle of the pack in terms of the baseline severity of the studies (which were pretty much all in the 'very severe' range over 23 HRSD points - see that figure). [Their finding that the differences between the drugs may be largely explained by the differing baseline of the studies is not unreasonable].

UPDATE
'PJ Leonard' has submitted a response, titled 'Analytical differences', pretty much repeating what I said above:
"It is good of Johnson et al to reply to the responses here. However, I do not think they have sufficiently dealt with some of the reservations concerning their paper.

In particular, I do not think that they have engaged with my finding that using the raw HRSD change scores reveals that the placebo response does not in fact decrease with increasing baseline severity on the HRSD.

I am not clear exactly what they mean when they say that I have used between-subjects analyses to suggest that the effect size (when analysing the raw HRSD change scores) is larger than presented in their paper, whereas they have used within-subjects analyses.

My analyses utilise conventional methods for meta-analysis where the effect size in each study is analysed directly, whereas it seems likely that the low estimated effect size in HRSD units in this study is the result of carrying out the meta-analytic weighting on the drug and placebo groups separately (a 'within subjects' analysis?), and then comparing the effect sizes thus obtained (which would explain the lack of forest plots in the paper).

This is not an acceptable analytic technique because it ignores that there is a relationship between the improvement in placebo and drug groups from the same study, but that the placebo and drug groups from any given study can have grossly different weightings when considered separately (e.g. there would be half as much weighting to the results from the fluoxetine trials in the drug analysis as the placebo analysis, the result of, for example, different sample sizes between the experimental arms).

Normalising the HRSD change to the change standard deviation in each group separately is also unnaceptable because a larger change in HRSD score in the drug group could be associated with a greater variance, although this does not appear to be the case in this study.

Robert Waldmann estimates that there is more bias in analytical method in this paper than publication bias present in the data itself:

http://rjwaldmann.blogspot.com/2008/03/just-cant-let-it-go.html

I note that Figure 4 in the paper of Kirsch et al is actually more consistent with my finding of 'clinical significance' at a baseline of 26 (this threshold is found both by regression on the difference scores, or separate regressions for each group's change score) than their suggestion of 28 points, this difference is undoubtedly because this figure looks at raw HRSD scores, as did my analyses, and because the NICE 'clinical significance' threshold of d > .5 is actually stricter than the NICE threshold of an HRSD difference > 3.

I concur that there is a relationship between baseline HRSD severity and effect size but it is worth noting that almost all studies examined had baselines over 23 points (and were thus in APA/NICE categories of 'very severe' depression) so the threshold of 26 points is a fairly average baseline severity for the studies analysed in this paper (as can be seen from my regression plots or their Figure 4). Any generalisation to less severe categories of depression is unwarranted given that it would depend on extrapolating the regression line to a region with only a single study."

Tuesday 11 March 2008

Statistics and depression

Robert Waldmann has some statistical thoughts on the Kirsch et al meta-analysis of anti-depressants:

Just can't let it go. SSRI Meta-analysis meta-addiction

Caveat lector

Caveat Lector II

The Simplest Meta-Analysis Problem

That Hideous Strength

Prozac Fan Talks Back

Personal chat with pj

In particular, in response to my confusion about where they got an effect size of 1.8 from:

"Actually I think I understand how Kirsch et al got their results. I get a weighted average difference of change of 1.

notation: changeij is the average change in HRSD of patients in study i who got the SSRI (j=1) or the placebo (j=0). Nij is the sample size of patients in trial i who get j pills.

In one calculation I used changeij/dij as the standard deviation of the change and thus (changeij/dij)^2/Nij as the estimated variance of the average change. The *separately* for drug and placebo data, I calculated the precision weighted average over i of changeij. This gave me an average change of 7.809 for the placebo and 9.592 for the SSRI treated for a difference of 1.78.

I guess this is what they did. I the confidence intervals are screwy and d is as described in the paper."

He also finds that it looks like Kirsch et al did indeed divide the change score by the standard deviation of the change score to obtain their 'd' measure - the wide confidence intervals that I thought argued against this seem to be a particular idiosyncrasy of their study. Fortunately this makes very little difference to my previous analyses (I've updated the effect sizes in this study, but the only real impact is on calculating proper SMD effect measures where these are larger, because the estimated SD is smaller).

I'll repeat one of my comments here:

"Hmm, that would be very annoying if someone had based their analyses on the confidence intervals being, you know, normal confidence intervals.

Looking back at the data it seems you're right that it is by essentially carrying out the meta-analysis on two entirely separate populations, the drug changes, and the placebo changes, and then subtracting one from the other, that they get their very low estimate of HRSD change.

That is a very odd way of doing things indeed, it is basically assuming that each study is really two separate and entirely unrelated studies, one on how people improve with drugs, and one on how they improve with placebo, so the way to analyse them is to ignore the study design and just try and estimate the pooled effect size for each group (drug and placebo) as if they were unrelated. It partly has an effect because the SDs depend on response, and because sample sizes are skewed towards drug groups in some studies (so the placebo group is much smaller than the drug group).

Taking your SD = change/d approach, and just plugging it into a meta-analysis program (SE weighting, fixed effects) giving an overall effect size of 1.9, it is interesting to note that the fluoxetine trials contribute half as much to the drug analysis (in terms of weighting) compared to the placebo analysis!

But as before, it is also interesting to see that segregating by drug gives effect sizes from 3.6 to .6 (or, given the silly form of this analysis, comparing individual drug groups to pooled placebo subjects 3.8 to -.2)."

I'll expand on that last bit - basically if Robert is right, and it is the best explanation I've found (looking back at the paper there are tantalising suggestions that it is correct because they report model statistics separately), then they have assumed that there are two entirely separate populations, the drug group and the placebo group, and that each trial is simply an attempt to estimate the size of the improvement in HRSD score within each group, ignoring any information about which placebo group went with which drug group in any particular trial (this is an approach that chimes with their regression analysis approach looking at each group separately).

When I attempted to replicate this sort of analysis as I mention above I find that the effect sizes are 9.6 and 7.7 (difference of 1.9) with the drug groups paroxetine, fluoxetine, nefazadone, and venlafaxine 9.6, 7.5, 10.6, and 11.5 respectively, making differences (to overall placebo) of 2.0, -.2, 3.0, and 3.8, although compared to their relevant placebo groups these differences are 3.0, .6, 1.8, and 3.6, giving you an idea of how the placebo groups vary by the drug study they are in.

Robert also finds a particularly telling aspect of the study:
"In my view in passing from the publication biased 3.23 to the final 1.78 only 0.6 of the change is due to removing the publication bias and 0.85 is due to inefficient and biased meta analysis.if the subsample of studies with references (I guess published studies) is analyzed with the method of Kirsch et al the weighted average improvement with SSRI is 9.63 and the weighted average improvement with placebo is 7.37 so the added improvement with SSRI is 2.26.If I have correctly inferred which studies were publicly available before Kirsch et al's FOIA request, I conclude that they would have argued that the effect of SSRI's is not clinically significant based on meta analysis of only published studies."
UPDATE
As mentioned in the comments, here's a pretty graph showing the effect size adjusted by regression on the baseline HRSD scores (to a baseline severity of 26 points) giving an overall effect of 3.0 (the grey line, as we'd expect from the regression lines which reach 'clinical significance' at baseline = 26) we get 2.4, 2.9, 3.2, and 3.7 for the effect sizes in nefazadone, flouxetine, paroxetine, and venlafaxine respectively, although the differences between drugs doesn't seem to be stastically significant (the closest is nefazadone versus venlafaxine).

CBT

As I mentioned below, I'm amused that the apparent limitations of anti-depressants have lead to an enthusiastic embracing of cognitive behavioural therapy.

I thought I'd share NICE's view of the evidence for CBT:

"In the only comparison available from a single trial there was insufficient evidence to determine the efficacy of individual CBT for depression compared with either pill placebo (plus clinical management) or other psychotherapies. However, stronger data do exist when CBT is compared with antidepressants (a number of which include clinical management); here individual CBT is as effective as antidepressants in reducing depression symptoms by the end of treatment. These effects are maintained a year after treatment in those treated with CBT whereas this may not be the case in those treated with antidepressants. CBT appears to be better tolerated than antidepressants, particularly in patients with severe to very severe depression. There is a trend suggesting that CBT is more effective than antidepressants on achieving remission in moderate depression, but not for severe depression. There was also evidence of greater maintenance of a benefit of treatment for CBT compared with antidepressants.
We recognise that this is a different finding to that of Elkin et al. (1989).
Adding CBT to antidepressants is more effective than treatment with antidepressants alone, particularly in those with severe symptoms. (This is the subject of a costeffectiveness analysis in Chapter 9.) There is no evidence that adding an antidepressant to CBT is generally helpful, although we have not explored effects on specific symptoms (e.g. sleep). There is insufficient evidence to assess the effect of CBT plus antidepressants on relapse rates.
There is evidence from one large trial (Keller et al., 2000) for chronic depression that a combination of CBT and antidepressants is more beneficial in terms of remission than either CBT or antidepressants alone. In residual depression the addition of CBT may also improve outcomes.
It appears to be worthwhile adding CBT to antidepressants compared with antidepressants alone for patients with residual depression as this reduces relapse rates at follow-up, although the advantage is not apparent post-treatment.
In regard to modes of delivery there is evidence that group CBT is more effective than other group therapies, but little data on how group CBT fares in comparison with individual CBT. Much may depend on patient preferences for different modes of therapy.
However, group mindfulness-based CBT appears to be effective in maintaining response in people who have recovered from depression, particularly in those who have had more than two previous episodes."

What is worth noticing here is that CBT advocates ought to be a little careful in crowing about the alleged failings of anti-depressants with respect to their effect being not big enough compared to placebos to be 'clinically significant' since CBT has not been shown to be better than pill placebo at all, and their effect size is comparable to anti-depressants:

"There is evidence suggesting that there is no clinically significant difference
between CBT and antidepressants on:
● reducing depression symptoms by the end of treatment as measured by the BDI (N = 86; n = 480; SMD = –0.06; 95% CI, –0.24 to 1.12) or HRSD (N = 107; n = 1096; SMD = 0.01; 95% CI, –0.11 to 0.13)
● increasing the likelihood of achieving remission as measured by the HRSD (N = 5; n = 839; RR = 1; 95% CI, 0.91 to 1.10).

A sub-analysis by severity did not indicate any particular advantage for antidepressants over CBT based on severity of depression at baseline.

When analysed by severity, there is evidence suggesting that there is no clinically significant difference between CBT and antidepressants on reducing depression symptoms by the end of treatment"

Monday 3 March 2008

Regression in depression

I haven't got time to do a proper meta-regression analysis, so unlike the last post, this is going to be another back-of-the-envelope analysis****. Given that we've found using the raw HRSD change scores from Kirsch et al reveals rather larger effect sizes than the authors suggest, what would the regression look like?

First of all I don't like the Kirsch et al method of carrying out a regression on the drug and placebo groups seperately - we are looking at how the difference between antidepressant and placebo groups varies with initial disease severity and that is the analysis to carry out.

So taking my SD and mean effect from the last study, using the mean baseline HRSD score (in the drug group) from the paper, and weighting a linear regression by inverse variance*, what do we find?
Well let's look at the graph on the top right. Here we see that, unsurprisingly, we replicate the Kirsch et al finding that effect size increases with baseline severity, we also see that those drugs which had statistically and 'clinically' significant effect sizes in the previous analysis did have greater disease severity - supporting the Kirsch et al contention that differences in baseline disease severity may underlie some of the differences in the efficacy of the different drugs (see that this is essentially the same as Kirsch et al's Figure 4, right).

But also compare this figure with one from the PLoS Medicine paper (Figure 2, right). Note that Kirsch et al have marked the region in green where their regression suggests an effect size over the NICE criteria (Cohen d > .5). On my graph I have also marked (grey vertical line) where the regression effect size exceeds the NICE criteria (mean HRSD difference > 3, marked as grey horizontal line). I find that this occurs around a baseline of 26 HRSD points** versus the 28 points found by Kirsch et al. So not a large difference between my finding and theirs, and likely due to using standardised effect size analyses and associated NICE criteria versus raw HRSD scores and associated NICE criteria, but looking at my figure is rather less striking because a baseline of 26 points is actually more representative of the studies analysed, falling near the mean baseline score.

If I repeat my analysis so that it is performed the same way as Kirsch et al, modelling an inverse variance* regression line on the mean change in HRSD for each group separately we can see (bottom right) that the point at which the regression lines are 3 HRSD points apart (grey bar) is similar to my above analysis, so that result is replicated when we consider the change scores and baseline HRSD scores for drug and placebo groups separately***.

An interesting further point to note is that the placebo response does not seem to decrease with increasing baseline HRSD when we use raw HRSD change as our effect size, and that drug response does in fact rise with increasing severity. Again this suggests that Kirsch et al's findings were due to using standardised mean differences as a measure of effect size rather than the raw figures. It also puts the lie to the widely reported claim that increasing effectiveness with severity is due to decreasing placebo response rather than increasing drug response.

* Weighting by sample size makes little difference.

** This is still in the 'very severe' category of depression, althought the labels used by NICE do not necessarily reflect clinical practice as Moncrieff & Kirsch say that in
"...the NICE meta-analysis [of] “moderate” (Hamilton score 14-18) through “severe” (19-22) to “very severe” depression ( ≥ 23)...the middle group...would generally be referred to as moderately depressed..."
So the 'very severe' group may be better thought of as simply 'severe' in normal practice, explaining why most studies were in this range of severity.

*** The regression analyses here are in broad agreement with the meta-analysis split by severity category.

**** UPDATE 11/3/8
Thanks to Robert Waldmann I've had to re-do the regression weights as the SD is lower than I estimated from confidence intervals if we use them directly derived from the 'd' and change scores, but it essentially makes no difference as we would expect. As Robert points out in the comments, excluding the fluoxetine 'mild' depression study makes no difference to the regression, which is interesting as it does look from the plot as if it is driving the regression.

Sunday 2 March 2008

Final analysis

Summary
Carrying out a meta-analysis using raw Hamilton Rating Scale for Depression (HRSD) change scores derived from the Kirsch et al 2008 PLoS Medicine paper I found that the effect size was larger than that found by Kirsch et al, and for paroxetine and venlafaxine this effect size exceeded the NICE criterion for 'clinical significance' (a difference of 3 points on the HRSD score). This suggests that the findings of Kirsch et al are both dependent on their particular method of analysis and cannot be generalised to all antidepressants included in their analysis.

Results
The weighted mean difference (WMD) with a random effects model (see right) shows an overall effect size of 2.7 (95% CI 2.0-3.4) with paroxetine, fluoxetine, nefazadone, and venlafaxine 3.4, 2.1 (non-sig), 1.7, and 3.5 respectively, so both paroxetine and venlafaxine both exceeded the NICE criteria for 'clinical significance' of 3 HRSD points.

Analysing the standardised mean difference gave an overall effect size of .32 (95% CI .24-.41), with paroxetine, fluoxetine, nefazadone, and venlafaxine having effect sizes of .41, .24, .21, and .40 respectively.

There was minimal difference between using fixed versus random effects, and excluding the mild study gives an overall effect size of 2.81, with fluoxetine statistically significant at 2.85.

Methods*
The PLoS Medicine paper gives the data for individual studies in Table 1. It reports that the measure of effect size 'd' is the change score divided by the standard deviation (SD) of the change score so the SD can be derived from the change score and 'd'.
The raw HRSD change scores and SDs were entered, along with the sample sizes from Table 1, into the Cochrane Collaboration RevMan Analyses (v 1.0.5) software to perform a weighted mean difference (WMD) meta-analysis with random effects.

Subgroups were defined to analyse paroxetine, fluoxetine, venlafaxine, and nefazadone separately. Sensitivity analyses were performed by omitting the outlying fluoxetine study of subjects with mild depression ('ELC 62 (mild)').

For completeness a fixed effect analysis was also carried out, as well as an analysis looking at the standardised mean diffence (this is the difference in change scores normalised to the standard deviation of the change scores) using Hedges adjusted g (similar to Cohen's d but includes an adjustment for small sample bias), although this is not in fact appropriate for studies which have used the same outcome measure (HRSD scores in this case) .

Links
This study is an updated version of this analysis and this analysis, but deriving the SD for the SMD more accurately. I also discuss the Kirsch et al paper here and here.

* UPDATE 11/3/8
Following on from Robert Waldmann's findings, despite my protestations to the contrary, it looks like the confidence intervals of 'd' in Kirsch et al are a poor guide to the standard deviation of the change score, and the effect size 'd' may actually be the HRSD change score/SD change score, so the above analysis was corrected to be based on this new SD measure. Unsurprisingly it makes little difference.