Humans are unable to detect over a quarter of deepfake speech samples, according to a study published in PLOS ONE. This is the first study to assess how easily humans can spot artificially generated speech in a language other than English.
Deepfakes are videos created to resemble a real person’s voice or appearance. They’re categorised as generative artificial intelligence (AI), a type of machine learning that uses algorithms to learn patterns from a real person and uses that to produce original sounds and images.
The first deepfake videos needed thousands of samples from a person’s voice to be able to generate original audio, but this process has evolved, and now the latest pre-trained algorithms can recreate a person’s voice from a three-second video clip. Using open-source material, anybody could do these videos within a few days.
A team from UCL in London used a text-to-speech (TTS) algorithms trained using two datasets, one in English and one in Mandarin, to create 50 fake videos in each language. Over 500 participants watched these videos and genuine samples to determine whether they could identify which ones were fake and which ones were real.
After watching all the videos, participants only identified fake speech 73% of the time, which improved a little bit after they received training to recognise aspects of deepfake speech.
“Our findings confirm that humans are unable to reliably detect deepfake speech, whether or not they have received training to help them spot artificial content. It’s also worth noting that the samples that we used in this study were created with algorithms that are relatively old, which raises the question of whether humans would be less able to detect deepfake speech created using the most sophisticated technology available now and in the future,” said Kimberly Mai (UCL Computer Science), first author of the study.
According to the authors, we need better automated tools to detect automated speech to stop the threat of artificially generated audio and imagery. There are undeniable benefits from generative audio technology. For example, this technology can be used by those with limited speech or who may have lost their voice due to illness. However, there are growing fears that such technology could be used in criminal activity to cause significant harm to individuals and societies.
“With generative artificial intelligence technology getting more sophisticated and many of these tools openly available, we’re on the verge of seeing numerous benefits as well as risks. It would be prudent for governments and organisations to develop strategies to deal with abuse of these tools, certainly, but we should also recognise the positive possibilities that are on the horizon,” said Professor Lewis Griffin (UCL Computer Science), senior author of the study.
Kimberly Mai et al. (2023) Warning: humans cannot reliably detect speech deepfakes. PLOS ONE,https://doi.org/10.1371/journal.pone.0285333