Abstract
As generative artificial intelligence (AI) continues its ballistic trajectory, everything from text to audio, image, and video generation continues to improve at mimicking human-generated content. Through a series of perceptual studies, we report on the realism of AI-generated voices in terms of identity matching and naturalness. We find human participants cannot consistently identify recordings of AI-generated voices. Specifically, participants perceived the identity of an AI-generated voice to be the same as its real counterpart approximately
of the time, and correctly identified a voice as AI generated only about
of the time.
Subject terms: Information technology, Computer science
In January 2024, in the lead up to the November United States presidential election, an estimated tens of thousands of Democratic party voters received a robocall in the voice of President Biden instructing them not to vote in the upcoming New Hampshire primaries. The voice was AI-generated.
The perpetrators of this attempted election interference were Steven Kramer (a political consultant), Paul Carpenter (a magician and hypnotist who was paid $150 to create the fake audio), and a telecommunications company called Lingo Telecom1,2. Carpenter used ElevenLabs, a platform offering instant voice cloning for as little as $5 a month. Kramer was fined $6 million and subsequently charged with two dozen crimes including impersonating a candidate and voter suppression, while the telecommunications company, Lingo Telecom, received a $1 million fine for transmitting the calls. This is just one of many examples of how the rise of generative AI is being weaponized, from election interference, to disinformation campaigns3, to small-4 and large-scale5 financial fraud.
There is large literature on technologies that can automatically determine whether media – such as audio, video, and images – has been manipulated either by humans or generative AI6. These techniques, however, largely operate asynchronously and not as an audio or video call is unfolding in real time. The synchronous detection of fraudulent media, such as the phone calls that attempted to suppress voter turnout in New Hampshire, still poses significant technological challenges. Until technology can monitor every landline, mobile device, and video call (which itself would raise additional privacy concerns), people are largely left to their own defenses to sort out the real from the fake.
The question then naturally arises: how well-equipped are people for the perceptual challenge of distinguishing real from AI-generated content? The answer, of course, depends on both the quality of the fake and the modality of the media. For example, studies focusing on visual perception of images of people have concluded that participants are at chance at distinguishing real and AI-generated head shots7,8. Results for video (with audio) are more mixed – likely due to differences in the types of videos that have been assessed. While some studies report that performance is only slightly better than chance for videos of people talking9,10, a recent large-scale study investigating how well people could distinguish fabricated political speeches from real ones report a consistent accuracy of
and above11.
Here, we focus on people’s ability to distinguish real voices from AI-generated voice clones, as would be required to detect a fraudulent phone call or voicemail. Interestingly, prior work suggests that people are better at this task than detection of AI-manipulated images, but can still often be tricked. For example, Mai et al.12 report that human participants were able to accurately distinguish real voices from AI-generated voice clones with an accuracy of
. This study, however, only used a single English and a single Chinese speaker identity, and the spoken phrases consisted of a single sentence ranging in length from 2 to 11 seconds (by comparison the fake Biden robocall was 40 seconds in length). Müller et al.13 report a similar accuracy of
. This second study has the advantage that it employed multiple speaker identities (107), but the spoken phrases were still relatively short at one to two sentences in length. Mostly consistent with these earlier studies, a recent study by Warren et al.14 found that human participants detect AI-generated voices with an accuracy of
. For each of these studies, the AI-generated voices were not created using state-of-the-art, commercially available techniques, and both studies focused on naturalness (is the voice real or not) and did not examine identity perception (who is speaking).
Groh et al.11 also investigated audio-only performance with political speeches. Performance when people were only given the isolated audio of the speeches was worse than with the full video with audio, but still above chance. Political speeches, however, are atypical given that the speakers are highly familiar to most listeners, and that political speeches have characteristic cadences and content that do not necessarily reflect natural conversational speech.
We expand on these previous studies by employing the state-of-the-art voice cloning of ElevenLabs (used in the Biden robocall), increasing the number of speakers to over 200, and considering how different tasks (identity and naturalness) impact our ability to distinguish AI-generated voices. This study reveals that, generally speaking, people are poorly equipped to identify AI-generated voice clones, both in terms of identity matching and naturalness.
Methods
Real speaker dataset
The stimuli for our study were sampled from the DeepSpeak dataset, which comprises recordings of 220 unique speakers collected through the Prolific research recruitment platform15. More details about this dataset can be found in the accompanying manuscript, but here we provide a brief summary.
All speakers were native English speakers and U.S. residents. Their ages ranged from 18-75 years (mean=38, sd=11.4), with 109 identifying as male, 107 female, and 4 non-binary. Racial identities included 158 White/Caucasian, 39 Black/African American, 26 Asian, 4 American Indian/Alaska Native, 2 Native Hawaiian/Other Pacific Islander, and 5 other.
Each speaker was instructed to record themselves responding to 32 prompts. The prompts were divided into four categories: (1) standardized scripted responses in which each speaker read the same prompt extracted from transcripts of the TIMIT dataset16; (2) randomized scripted responses in which each speaker read a randomized prompt from TIMIT; (3) unscripted responses in which each speaker responded to four open-ended questions, and asked for a response that was close to 30 seconds in length; and (4) combined responses consisting of four open-ended unscripted questions in which each speaker read out loud a question and then answered the question.
Both audio and video were recorded using a custom-built web application. The audio/video recordings were converted from their initial .webm format to .mp4 at a bitrate of 192 kbps from which the audio was extracted as a .wav file. All real audio files were converted to a .mp3 format with a sample rate of 44kHz, with an amplitude normalized between
and 1, and with silences at the start and end removed. In this study, we only used a subset of the audio clips, which had a mean duration of 5.37s (min: 0.88s, max: 52.57).
Voice cloning
A voice clone of each of the 220 speakers was generated using the ElevenLabs’ Instant Voice Cloning API. For each speaker, a cloned voice model was first synthesized using the audio from just the first two prompts in the DeepSpeak dataset as input. Transcripts of speakers’ responses to other prompts were then used to create a cloned version of each original audio clip (32 total per speaker). For scripted responses, we assumed that the speaker correctly repeated the prompt; for unscripted responses, OpenAI’s Whisper17 was used to transcribe the audio. Cloned speaker audio files were converted to the same format, sample rate, and amplitude as the real speaker audios.
Voice matching
For one of our studies, participants were presented with two real voices with different speaker identities. When people heard two speakers with different identities, we wanted the voices to be similar to each other. Thus, for each speaker in our dataset, we also determined another speaker with a perceptually similar voice. This voice matching was performed by first extracting a 192-D TitaNet embedding18 of the same scripted sample. The closest matching voice was determined by finding the voice of another speaker (with replacement) with the maximal cosine similarity between extracted embeddings (the mean similarity was 0.6 in the range
). Audios were paired by scriptedness to ensure comparisons were made between similar contexts.
Study design
We examined people’s perception of AI-powered voice clones in two perceptual studies.
In the identity study (Fig. 1a, left), participants listened to two voices back-to-back (saying something different) and were asked to specify if the voices were from the same identity. Participants were randomly assigned to receive one of 10 batches comprising a randomized set of 44 stimuli (i.e., 44 unique random voice pairings). Thirty of these stimuli contained scripted single-sentence responses, 10 were unscripted responses, and 4 were attention checks (see below). There was no stimulus overlap between batches. Because the voice pairings were randomized, each stimulus could fall into six possible conditions. The three conditions of core interest for our analyses were:
[same identity] / [real speaker] (
),[same identity] / [real and AI-clone speaker] (
),[different identity] / [real speaker] (
).
Fig. 1.
Study methods and results. (a) In the identity study, participants were presented with pairs of audio clips and asked if they were from the same or different people. We present results for three conditions: both voices were the same identity and a real person (
, green), both voices were the same identity but one clip was an AI-clone (
, yellow), and the voices were from real but different identities (
). Results are plotted as the percentage of trials on which each participant responded “same identity.” Thick lines indicate medians and thin lines indicate
and
quantiles. The
–
–
quantile for the three remaining, unplotted conditions are as follows: [same identity] / [AI-clone speaker] (
)
–
–
, [different identity] / [AI-clone speaker] (
)
–
–
, and [different identity] / [real and AI-clone speaker] (
)
–
–
. (b) In the naturalness study, participants were presented with audio clips one at a time and asked whether the voice was real or AI generated. Results are plotted in the same manner as panel (a), except now the ordinate indicates percent correct for real (blue) and AI-generated (orange) voices.
There were also trials that fell into three additional conditions, which ensured that the structure of the study could not be learned by participants over time:
[same identity] / [AI-clone speaker] (
),[different identity] / [AI-clone speaker] (
),[different identity] / [real and AI-clone speaker] (
).
Participants were asked only to judge identity and were not told that some voices may be AI generated.
In the naturalness study, participants listened to one voice at a time and were asked to classify it as real or AI generated (Fig. 1b, left). The randomization and assignment of stimuli was identical to the identity study, except in this case each stimulus was one audio clip rather than two. Half of the audio clips were real and half were AI generated. Participants were not told of this distribution. For both studies, participants did not receive any explicit instructions to use headphones, earphones or speakers. The full instructions are included in the Supplementary Information.
Listener participants
A total of 604 participants were recruited from the Prolific crowd-sourcing platform, split into two groups of 304 and 300 for the identity and naturalness studies. Listener ages ranged from
years (mean=35, sd=12), with 293 male, 294 female, 11 non-binary and 6 not providing their gender. 411 listeners identified as White/Caucasian, 132 as Black/African American, 43 as Asian, 19 as American Indian/Alaska Native, 4 as Native Hawaiian/Other Pacific Islander, 30 as other, and 5 preferred not to share.
This study was approved by the UC Berkeley Committee for Protection of Human Subjects. Participants provided informed consent prior to participation and data collection was performed in accordance with relevant guidelines and regulations.
Before beginning the study, participants were given an overview of their task (either in terms of judging identity or naturalness) and tested their hardware. For the naturalness study, participants were given two examples of real voices and two examples of AI-generated voices to set expectations as to the realism of the voices.
To ensure that participants were paying attention, four attention checks were randomly distributed throughout. These checks consisted of audio clips that clearly described the correct answer to be selected. Participants who failed any of these attention checks were removed from subsequent analysis (the totals reported above only include participants who passed all attention checks). Participants could not respond without having listened to the entire audio(s). Participants were paid $5 for their time.
Statistical analysis
For each participant in the identity study, we calculated the percentage of stimuli for which they responded that the identities were the same, separately for each stimulus condition ((
), (
), (
), etc). The distribution of responses across conditions deviated substantially from normality, so we adopted non-parametric statistics to examine differences between conditions. We used single sample signed rank tests (with test statistic W) to examine differences from chance (
) separately for each condition, then a Friedman test was used to examine whether there were significant differences between conditions. Pairwise follow up tests were conducted via paired sample rank sum tests.
For the naturalness study, the response distributions were approximately normally distributed, so we used single and paired sample t-tests to examine differences from chance (
) for stimuli that were real and AI generated, as well as to directly compare responses for these two stimulus types. We report effect size as Cohen’s D.
For all analyses, we adopted a significance threshold of
and applied Bonferroni correction when the same analysis was run on multiple conditions. Because six participants in the naturalness study did not provide their gender, the participant count for these analyses was reduced from 300 to 294.
Because we used a large dataset of naturalistic stimuli and designed our study to obtain a large listener participant sample, audio clips in this experiment varied in length and each participant only listened to a subset of stimuli (the randomized batches). To examine how these, and other variables, might affect performance, we therefore fit the trial-by-trial responses with mixed-effects logistic regression models.
For the identity study, we focused on fitting performance on trials with the same real identity (
) and those with the same identity but one AI-generated clone (
) – that is, to keep the model relatively simple we only fitted the data from these two conditions. For the naturalness study, the model was fitted to all data. Each model also included random effects (intercepts) per participant and per stimulus batch. In addition to the stimulus condition, the fitted models included the following predictors: total audio duration (continuous), scripted vs unscripted (categorical), listener gender (categorical), listener age (continuous), speaker gender (categorical), speaker age (continuous). The combined audio duration of both audios was used.
For both models, the continuous predictors, age and duration, were scaled using Z-score normalization and robust scaling, respectively.
Main results
Identity
When participants listened to a pair of real audio clips with the same speaker (
), they were on average highly accurate. That is, they overwhelmingly indicated that it was the same person speaking in both clips:
–
–
quantile of
–
–
, as shown in Fig. 1a, green. Indeed,
of participants indicated that it was the same identity on every trial for these stimuli. When one of the clips was an AI-clone (
), participants also overwhelmingly judged that it was the same person speaking in both clips:
–
–
quantile of
–
–
as shown in Fig. 1a, yellow. However, only
of participants indicated the same identity on every trial. This finding suggests that AI-clones are highly, but not uniformly, convincing. It is also possible that participants were simply biased to indicate that all speaker pairs were the same. The data, however, show that this was not the case: when participants listened to two different, but similar, voice identities (
), they rarely indicated that the speakers had the same identity:
–
–
quantile of
–
–
, as shown in Fig. 1a, red.
Statistical tests indicate that responses in each of these conditions are significantly different from chance (
: W = 94.5,
;
: W = 693.5,
;
: W = 301.5,
). There are also significant differences between conditions (Friedman test = 517.0,
). Follow up pairwise tests indicate that each pair of conditions differ significantly (
vs
: W = 3605.5,
;
vs
: W = 0.0,
;
vs
: W = 0.0,
). Responses for the other three conditions are generally consistent with these results and are reported in the caption of Fig. 1a.
The responses to the
and
conditions were further investigated with a logistic regression model (Table 1). As compared to the
condition, the
condition is associated with significantly fewer “same” responses. There is also a significant effect of speaker gender: the voices of male and non-binary genders are more often judged “same” as compared to female speakers. Interestingly, both combined audio duration and scriptedness have significant effects: longer clips and scripted clips are associated with more frequent “same” responses. We will return to explore these effects in more detail in the Exploratory results.
Table 1.
Generalized linear mixed-effects model for the identity study showing coefficients, z-values, p-values, and
confidence intervals. The model includes fixed effects for predictor variables and random intercepts by group. Predictor variables include combined duration (continuous), listener age (continuous), speaker average age (continuous), as well as scriptedness (scripted vs. unscripted), condition (real vs. AI-clone), and gender (categorical), with grouping by ResponseId (participant) and batch_number.
| Predictor | Coef. | z | p-value | 95% CI Lower | 95% CI Upper |
|---|---|---|---|---|---|
| Intercept | 2.619 | 18.201 | <0.001 | 2.337 | 2.901 |
Condition assigned ( ) |
−0.988 | −10.814 | <0.001 | −1.167 | −0.809 |
| Combined duration | 0.533 | 3.953 | <0.001 | 0.269 | 0.797 |
| Scripted (vs Unscripted) | −1.331 | −5.831 | <0.001 | −1.778 | −0.883 |
| Listener gender (Male) | −0.198 | −1.594 | 0.111 | −0.442 | 0.046 |
| Listener gender (Non-Binary) | −0.476 | −1.195 | 0.232 | −1.258 | 0.305 |
| Listener age | −0.079 | −1.293 | 0.196 | −0.199 | 0.041 |
| Speaker gender (Male) | 0.376 | 4.227 | <0.001 | 0.202 | 0.551 |
| Speaker gender (Non-Binary) | 0.736 | 3.028 | 0.002 | 0.260 | 1.213 |
| Speaker average age | −0.005 | −0.133 | 0.894 | −0.087 | 0.076 |
| Random effects | |||||
| Group: ResponseId (304 levels) | Standard deviation = 0.745 | ||||
| Group: Batch Number (10 levels) | Standard deviation = 0.257 | ||||
Naturalness
In the naturalness study, people were asked to directly judge whether individual audio clips were real or AI generated (irrespective of identity). This proved to be a more challenging task, Fig. 1b. When the audio clip contained a real voice, participants were correct on average
of the time (standard deviation
). Similarly, when the audio clip was one of our AI clones, participants were correct
of the time (standard deviation
). Indeed,
and
of participants were at or below chance for the real and AI-generated stimuli, respectively.
Nonetheless, for both stimulus types, average performance is significantly greater than chance with a medium or large effect size (real: t(299) = 20.4,
, D = 1.177; AI generated: t(299) = 11.2,
, D = 0.648). A pairwise comparison between stimulus types indicates a small but statistical significant difference associated with stimulus type (t(299) = 4.6,
, D = 0.263).
Consistent with these results, the logistic regression model (Table 2) indicates a significant effect of stimulus type (real vs AI generated) on accuracy. None of the listener or speaker demographics are associated with statistically significant effects. However, as in the identity study, there are, again, significant effects associated with audio duration and scriptedness. In this study, longer clips and unscripted clips are associated with more accurate judgments of naturalness.
Table 2.
Mixed-effects logistic regression model for the naturalness study showing coefficients, z-values, p-values, and
confidence intervals. The model includes fixed effects for predictor variables and random intercepts by group. Predictor variables include duration (continuous), listener age (continuous), speaker age (continuous), as well as scriptedness (scripted vs. unscripted), condition (real vs. fake), and gender (categorical), with grouping by ResponseId (participant) and batch_number.
| Predictor | Coef. | z | p-value | 95% CI Lower | 95% CI Upper |
|---|---|---|---|---|---|
| Intercept | 0.497 | 10.650 | <0.001 | 0.405 | 0.588 |
| Condition assigned (AI generated) | −0.229 | −5.614 | <0.001 | −0.309 | −0.149 |
| Duration | 0.160 | 6.640 | <0.001 | 0.112 | 0.207 |
| Scripted (vs Unscripted) | 0.311 | 3.153 | 0.002 | 0.118 | 0.505 |
| Listener gender (Male) | −0.011 | −0.211 | 0.833 | −0.110 | 0.089 |
| Listener gender (Non-Binary) | 0.274 | 1.233 | 0.218 | −0.162 | 0.710 |
| Listener age | −0.004 | −0.162 | 0.871 | −0.054 | 0.046 |
| Speaker gender (Male) | 0.020 | 0.493 | 0.622 | −0.059 | 0.099 |
| Speaker gender (Non-Binary) | −0.126 | −0.866 | 0.387 | −0.410 | 0.159 |
| Speaker age | 0.028 | 1.359 | 0.174 | −0.012 | 0.067 |
| Random effects | |||||
| Group: ResponseId (294 Levels) | Standard deviation = 0.263 | ||||
| Group: Batch number (10 Levels) | Standard deviation = <0.001 | ||||
Together with the results of the identity study, these results reflect a status quo whereby people can often be tricked into thinking that AI-generated clones have the same identity as a real speaker, and cannot reliably detect when a voice they hear is AI generated.
Exploratory results
AI-powered phone scams can range from brief, scripted robocalls to fully-fledged conversations. Since our analyses revealed a significant effect of audio clip length and scriptedness on performance in both studies, we conducted an exploratory follow up to understand better how these factors affect performance. We focused on the naturalness task and aimed to investigate whether the effect of clip length on performance in the naturalness study was due, at least in part, to the fact that the unscripted responses tended to be longer (i.e., was improved performance caused by clip length or unscripted content?). This study contained a new set of 25 additional audio clips with longer scripted and shorter unscripted audio clips. It also included combined scripted-unscripted clips, for example, a speaker reading aloud a written question and then answering it. These audio clips were played to 30 new participants.
We first conducted exploratory analyses separately on clip length and scriptedness for this new dataset. For the clip length analysis, we computed the average accuracy across all participants for each clip. This analysis revealed a weak to moderate (but statistically significant) positive relationship between audio duration and the accuracy of identifying the audio as real or AI generated (Spearman
= 0.245,
). Scriptedness was associated with a statistically significant difference in performance in this study (Friedman test = 13.0, p = 0.001), and additionally, we observed a qualitative trend consistent with the main study, whereby the median accuracy was highest for combined audios (
), followed by unscripted audios (
) and scripted audios (
).
A logistic regression that included effects for both scriptedness and audio duration indicated that these effects were not statistically significant, likely because clip length and scriptedness were still somewhat correlated even in this dataset. However, the coefficients were consistent with the findings of the logistic regression model from the naturalness study, which identified positive relationships for both duration and scripting. Therefore, these results provisionally support the conclusion that both the duration of an audio and its scriptedness influence accuracy, but a larger dataset with a wider variety of audio clips would be necessary to confirm this.
These followup results, in conjunction with the findings from the naturalness study, suggest a potential strategy for listeners to better identify fraudulent AI-voices: engage the speaker in a longer conversation by, for example, asking open-ended questions.
Discussion
Here, we discuss the implications of these results further and report an exploratory analysis of participant strategies.
Self-reported listener strategies
At the end of each study, participants were asked to share any tactics they used to differentiate between same/different voices (identity study) and real/AI-generated voices (naturalness study). Keywords were extracted from their responses using a qualitative coding analysis and grouped into thematic codes. As shown in Fig. 2, the top three most frequent codes in the identity study were “inflection,” “accents,” and “other.” For the naturalness study, these were “inflection,” “breathing,” and “background noise.”
Fig. 2.
The 21 most frequent thematic codes as reported by participants in the naturalness (top) and identity (bottom) studies. See also Fig. 3.
We also found evidence that these strategies were impacting performance by looking at participant responses. Specifically, six audio recordings were correctly classified by all participants in the naturalness study, all of which were real, and five of which were unscripted. Qualitatively, we observed that these six recordings did indeed contain audible background noise, opening mouth clicks, and several disfluencies (um, ah, etc.). At the same time, these cues were not necessarily diagnostic in general. For example, “background noise” was mentioned 79 times across both studies, yet, upon analyzing the average background noise in both datasets, no significant effects on performance were found. Indeed, given the variety of background noise conditions that can occur on phone calls, it seems unlikely that this cue holds a key to detecting AI voice clones on the phone. We therefore suggest that a good listener strategy would be to disregard background noise as a cue and focus on the qualities of the caller’s voice.
Additionally, we examined the strategies of the top and bottom
of performers for each study, Fig. 3. Interestingly, for the identity study, “inflection” remained the most popular strategy for both top and bottom performers, but “accents” were more popular with top performers than bottom. For the naturalness study, the top performers also relied upon “inflection” as their most popular strategy, while the bottom performers relied equally upon “inflection” and “breathing.” While both groups relied upon “pauses” and “background noise” as popular strategies, top performers relied more frequently upon “speech pace,” while bottom performers tended to favor “conversational” strategies. These findings suggest, while some authenticity cues such as breaths may be misleading, speech pace and accent may still offer effective indicators for detecting AI voice clones.
Fig. 3.
The most frequent thematic codes as reported by the top and bottom
performing participants for the naturalness (top) and identity (bottom) studies. See also Fig. 2.
Gender differences
Although there were no effects of gender (or other demographics) in the naturalness study, we found an effect of gender in the identity study. In particular, the voices of male and non-binary speakers were associated with more “same” responses as compared to female speakers. This could be the result of a bias in the AI training in which female voices are underrepresented and so AI-generated male voices are more identity preserving. However, we were also curious whether this finding might reflect an additional perceptual effect related to gender. For example, the well-known cross-race effect describes the phenomena in which people more easily recognize faces that belong to our own racial group, or the group to which we have most exposure19. We wondered if there was a similar cross-gender effect in voice perception.
To investigate this possibility, the main logistic regression models described in Tables 1 and 2 were modified to replace the speaker and listener gender predictors with a single same/different gender predictor corresponding to whether the listener and speaker were of the same gender (because non-binary listeners and speakers were heavily underrepresented in our dataset, they were excluded from this analysis). For both studies, we found no evidence for a cross-gender effect. That is, participants were not notably better or worse at detecting AI-generated voices within or across their own gender.
Forensic techniques
While modern forensic techniques20 are better than humans at distinguishing between certain characteristics of real and AI-generated voices, these techniques typically operate asynchronously, making it difficult to protect consumers on phone/video calls. While synchronous techniques can operate at the source of a call, this raises serious privacy concerns that would need to be addressed. Any such technology would also have to have exceedingly high accuracy to avoid false alarms and giving a false sense of security against AI-powered voice scams.
Our reporting of people’s ability to detect AI-generated voices derives from a task in which their attention is fully drawn to either the identity or realism of a voice. In real-world scenarios, people may be less attentive to the voices and therefore more likely to be fooled.
Another intervention that may prove useful (albeit not perfect) in mitigating the risk of AI-powered scams is the insertion of difficult to remove and easy to identify, imperceptible watermarks into AI-generated voices. With the appropriate software at the receiver, AI-generated voices can be easily identified. If, however, all AI-powered voice generators do not deploy this technology, scammers would simply migrate to those services that opt out of this security protocol.
Conclusions
The quality and realism of AI-generated media is rapidly improving. Given the results reported here, there is good reason to believe that AI-generated voices will soon be indistinguishable from real ones both in terms of naturalness and identity. While this should be considered a triumph for those on the generative side, it raises real concerns for public safety. Our results highlight that relying on human perception to detect AI-generated voice clones is no longer consistently reliable. Thus, improved technologies that can detect AI-generated voices while protecting the user’s privacy will become essential tools for preventing phone-based – and eventually, video-based – frauds.
Acknowledgements
We thank ElevenLabs for API access to their voice cloning service. This work was supported by funding from the University of California Noyce Initiative and a gift from YouTube.
Author contributions
Authors SB and HF contributed equally to the conception, execution, and analysis of these studies. EC helped with conception, analysis, and writing. All authors reviewed the final manuscript.
Data availability
Audio data available at https://huggingface.co/datasets/faridlab/deepspeak_v1. Anonymized speaker and listener data available at https://doi.org/10.5281/zenodo.13654688.
Declarations
Competing interest
HF is the co-founder and Chief Science Officer at GetReal Labs, a company developing techniques to detect AI-generated content, and he is an advisor to the Content Authentication Initiative, and is a member of the steering committee for the Coalition for Content Provenance and Authentication.
Footnotes
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Change history
7/23/2025
This article has been updated to amend the license information.
References
- 1.Bond, S. A political consultant faces charges and fines for Biden deepfake robocalls. NPR (2024).
- 2.Ramer, H. Political consultant behind fake Biden robocalls says he was trying to highlight a need for AI rules (Associated Press, 2024).https://apnews.com/article/ai-robocall-biden-new-hampshire-primary-2024-f94aa2d7f835ccc3cc254a90cd481a99
- 3.Alsharif, M., Marquez, A., Mullen, A. Elon Musk retweets altered Kamala Harris campaign ad. NBC News (2024).
- 4.AARP. Chatbots and voice-cloning fuel rise in AI-powered scams. AARP Arizona (2024).
- 5.Magramo, K., Chen, H. Finance worker pays out $25 million after video call with deepfake ‘chief financial officer’. CNN (2024).
- 6.Farid, H. Creating, using, misusing, and detecting deep fakes. Journal of Online Trust and Safety. 1 (2022).
- 7.Nightingale, S. J. & Farid, H. AI-synthesized faces are indistinguishable from real faces and more trustworthy. Proc. Natl. Acad. Sci.119, e2120481119 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Bray, S. D., Johnson, S. D. & Kleinberg, B. Testing human ability to detect ‘deepfake’images of human faces. J. Cybersecur.9, tyad011 (2023). [Google Scholar]
- 9.Groh, M., Epstein, Z., Firestone, C. & Picard, R. Deepfake detection by human crowds, machines, and machine-informed crowds. Proc. Natl. Acad. Sci.119, e2110013119 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Köbis, N. C., Doležalová, B. & Soraperra, I. Fooled twice: People cannot detect deepfakes but think they can. iScience24 (2021). [DOI] [PMC free article] [PubMed]
- 11.Groh, M. et al. Human detection of political speech deepfakes across transcripts, audio, and video. Nat. Commun.15, 7629 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Mai, K. T., Bray, S., Davies, T. & Griffin, L. D. Warning: humans cannot reliably detect speech deepfakes. PLoS ONE18, 1–20 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Müller, N. M., Pizzi, K., Williams, J. Human perception of audio deepfakes. In 1st International Workshop on Deepfake Detection for Audio Multimedia, 85-91 (2022).
- 14.Warren, K. et al. Better be computer or I’m dumb: A large-scale evaluation of humans as audio deepfake detectors. In ACM SIGSAC Conference on Computer and Communications Security, 2696-2710 (2024).
- 15.Barrington, S., Bohacek, M., Farid, H. DeepSpeak Dataset v1.0. arXiv:2408.05366 (2024).
- 16.Garofolo, J.S. et al. DARPA TIMIT acoustic phonetic continuous speech corpus (1993).
- 17.Radford, A. et al. Robust speech recognition via large-scale weak supervision. In 40th International Conference on Machine Learning, ICML’23 (2023).
- 18.Koluguri, N.R., Park, T., Ginsburg, B. TitaNet: Neural model for speaker representation with 1D depth-wise separable convolutions and global context. In IEEE International Conference on Acoustics, Speech and Signal Processing, 8102–8106 (IEEE, 2022).
- 19.Young, S. G., Hugenberg, K., Bernstein, M. J. & Sacco, D. F. Perception and motivation in face recognition: A critical review of theories of the cross-race effect. Pers. Soc. Psychol. Rev.16, 116–142 (2012). [DOI] [PubMed] [Google Scholar]
- 20.Barrington, S., Barua, R., Koorma, G., Farid, H. Single and multi-speaker cloned voice detection: From perceptual to learned features. In IEEE International Workshop on Information Forensics and Security, 1–6 (IEEE, 2023).
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
Audio data available at https://huggingface.co/datasets/faridlab/deepspeak_v1. Anonymized speaker and listener data available at https://doi.org/10.5281/zenodo.13654688.




