Friday, 7 September 2007

Lie Detectors

Looks like government is getting into more business driven pseudoscience (cf. personality tests like the Myers-Briggs) by backing "Voice Risk Analysis" to detect benefit cheats.

So, are lie detectors, and particularly voice stress recognition systems any good? Standard polygraphs (as administered by trained and experienced personnel) detect physiological signs associated with lying, although these can be absent in the truly psychopathic, you can learn to fool them, and anxiety can also produce them. Most studies of standard polygraphs are carried out on offenders, and they tend to find fairly high detection rates. Reported figures are typically of the order of a sensitivity (proportion of liars detected) of 76% and specificity (number of confirmed truth-tellers) of 63% ('average' values), with 87% and 88% representing the upper range of estimates ('maximal' values), which doesn't sound too bad. But the utility of the polygraph (or any test in fact) very much depends on how likely it is that the suspect is guilty (the prevalence of true positives in the population) because if few people are guilty then even though only a small proportion of truth-tellers are falsely declared liars the large number of truth-tellers tested compared to the small number of liars means that most people reported as liars will actually be truth-tellers. Conversely, if most people you are testing are guilty (and thus liars) then even though a lot of guilty people will be detected, a lot of those declared innocent will actually be lying.

To make that rather convoluted explanation a bit more concrete, I refer you to a rather famous paper by Brett et al (1986, Lancet) which used the figures above (the 'average' and 'maximal' values). They showed that when the prevalence of offenders in the population is assumed to be 5% (i.e. not many, such as with benefit cheats) there was a 10% positive predictive value, that is only 1 in 10 positive tests are actually lying, with the rest falsely accused (that is with the 'average' values, using the 'maximal' values they find 25% true positives).

For a pre-test probability of 50% (e.g. criminal investigations, hopefully, maybe) the positive predictive value is 67% (88% with the 'maximal' values), a gain in certainty after the test of only 17%, with 33% of positive results still false positives. If most people are liars (90%) then the negative predictive value is only 23% with 77% of negative test results generated by lying subjects. It is often said that if you are innocent you probably don't want to take the risk of being falsely labelled a liar, and thus a suspect, while it may be worth taking the chance if you're guilty anyway, as it could throw them off the scent!

So we know that polygraphs aren't going to be that great at detecting liars in the population, even though they do work to some extent. There was a study done last year (Gamer et al 2006, Int J Psychophysiol) of polygraph measures (heart rate etc)* and voice stress recognition. It used the Guilty Knowledge Test (GKT):

"If, for example, a robbery of a fuel station is examined, a typical GKT-question could be: “Which car was used for the robbery of the fuel station last night?” If in fact a red BMW was used, proper items for this question could be “(a) a green Ford?”, “(b) a blue Mercedes?”, “(c) a red BMW?”, “(d) a yellow Chrysler?”, “(e) a black Pontiac?”. According to the assumptions of the GKT, only the culprit should be able to differentiate relevant and irrelevant items correctly and thus show more pronounced physiological responses to the relevant item."
They used the TrusterPro program (made by Israeli company Nemesysco, and I believe the core of the Capita program used in the UK) - and found a sensitivity of 30% and specificity of 83%, which was not significantly above chance. More detailed analysis of specific raw factors did not reveal any further discriminative ability. If you analyse the figures in the same way as Brett et al did with the polygraph you find similar results, with (assuming 5% prevalance of liars) only 8% positive predictive value - i.e you aren't doing much better by using this voice recognition system than just randomly selecting people and deciding they're benefit cheats, and over 90% of those you designate as liars aren't, while 70% of cheats will still get away with it - and this of course assumes optimal scientific study levels of operator training and question format (it is, I would imagine, unlikely that they will use the GKT structure).

The low sensitivity is a real problem because the DWP themselves say:
"If the pilot is successful we will consider the case for changes to verification procedures for cases adjudged to be low risk, potentially reducing the need to issue and process forms and undertake unnecessary and expensive visits."
If they are only using this software for people they already suspect are dodgy it is truly useless. Say we have a 50% chance this person is a benefit cheat (based on their funny looking application) checking them with this software gives a negative predictive value (number of true negatives) barely better than chance at 54%.

*this study used a logistic regression analysis to calculate what is essentially a theoretical upper bound on the information that can be provided by polygraph type measures in this study, and sensitivity was 93% and specificity 97%. It is an upper bound because it inevitably overfits the data from this study and we don't know whether it would generalise to another sample.

The Deception Blog has more on this sort of thing.

There is a discussion on this topic at the Badscience forums.


Pedantica said...

Hi there,

You stated "over 90% of those you designate as liars aren't, while 70% of cheats will still get away with it"

This is not necessarily a problem provided you treat the resulting data in the correct way. Your resulting two groups should not be labelled "liars" and "truthtellers", they should be labelled "higher risk" and "lower risk".

Stratifying your population into two groups one of which has a higher proportion of "liars" than the other is actually very useful for audit purposes. It allows you to concentrate proportionally more of your subsequent effort on the "higher risk" group than the "lower risk" group. If you do that you will get a higher rate of hits than if you investigate the unstratified population.

Of course then you have to weigh up the time and effort of stratifying your population against the benefits. But, in principle, it could be useful.

pj said...


I talk about liars and truth-tellers to get the point across about what the software is doing.

But yes, the article says that they will single these people out for additional scrutiny and fast-track the rest through.

I also point out that the gain from stratification is only 3% from 5%, that is this software identifies high-risk people of whom only 8% are actually liars, whereas completely randomly allocating them would yield 5% liars anyway.

And this is assuming optimal conditions (also note that the original study did not find that the software performed better than chance - I have simply assumed that their figures are reliable indicators of the efficacy for illustrative purposes).

In theory, given the very low sensitivity you could carry out a cost-benefit analysis to see whether the small increase in hit-rate that you might expect was cost-effective when compared with the high costs of purchase, servicing, and training. But that assumes that the DWP is approaching it fully cognisant of the limitations. That is not the impresson I get (you would not fast track through 'low-risk' applications knowing that you had a sensitivity of 30%!)

Actually, I think in reality, and given the level of skill of phone operators, there will be real mistakes made by over confidence in the system (which is advertised as essentially 100% accurate) meaning that most fraudsters will be fast-tracked through with even less scrutiny than before:

"If the caller is deemed by the operator to be low risk, using the test results to support their own judgement, they are fast-tracked and avoid more rigorous vetting.
Those deemed to be at higher risk of lying must supply further evidence to support their claim."

And those labelled as high-risk will be treated appallingly by operators who have been lead to believe that their magic gizmo has shown that they are liars.

Pedantica said...

[quote="RS"]I also point out that the gain from stratification is only 3% from 5%, that is this software identifies high-risk people of whom only 8% are actually liars, whereas completely randomly allocating them would yield 5% liars anyway. [/quote]

You'll note I didn't say "high risk" I said "higher risk".

So if I throw my expensive team of inspectors at the 'higher risk' group entirely they will find 60% more benefit cheats than they were finding before. That's quite a big improvement isn't it?

I accept you would not want to go that far because some people might be able to systematically trick their way into the 'lower risk' group. So what you actually do is simply favour the 'higher risk' group in your sampling.

But any stratification is useful in itself. The issue then becomes one of cost benefit.

Issues of operator bias could be easily removed by not showing the evidence to the telephone operators at all. But leaving that data to the team that select individuals for closer inspection.

I'm not stating that the system could not be misused. Just that, used correctly, it would be useful. The question then becomes whether it is cost effective.

pj said...

There is a discussion on this topic at the Badscience forums where pedantica and I (screen name 'RS') continue this.