Skip to main content
NPJ Digital Medicine logoLink to NPJ Digital Medicine
. 2025 Oct 17;8:616. doi: 10.1038/s41746-025-01987-3

Navigating the tradeoff between personal privacy and data utility in speech anonymization for clinical research

Catherine Diaz-Asper 1,2,, Lars Ailo Bongo 3, Brita Elvevåg 4
PMCID: PMC12534620  PMID: 41107524

Abstract

Speech data inherently contains personally identifiable information. Anonymization strategies to obscure this while preserving essential characteristics all represent a tradeoff between privacy and utility. We examine this balancing act of modifying voice characteristics, masking identity, and eliminating identifiable content by showcasing challenges with the common techniques—generalization, suppression, anatomization, permutation, and perturbation—in the context of preserving utility for individual level speech data analyses in clinical research.

Subject terms: Communication, Ethics

Introduction

Human speech provides a unique window into cognitive, emotional, and mental health, with a growing body of research showing that both the content and delivery of speech can be analyzed by automated systems to detect, predict, and monitor psychiatric and neurological conditions1,2. These systems leverage advances in natural language processing, acoustic analysis, and machine learning to identify linguistic and acoustic features that correlate with conditions such as depression, schizophrenia, Parkinson’s disease, and Alzheimer’s disease. By non-invasively capturing speech data through clinical tasks or passive monitoring via smart devices, these tools offer the potential for personalized medicine in the form of earlier diagnosis, personalized treatment plans, and real-time tracking of disease progression or therapeutic response.

Speech also conveys valuable information about a person’s identity, thoughts, emotions, and intentions3,4. Important detail about the speaker can be gleaned from not only the content and style of discourse, but also the pitch, tone, accent, intonation, and pronunciation of what is said. Additionally, non-speech information such as the frequency, length, and placement of pauses, breathing, and background noise can all contribute to the identification of individuals with high accuracy4. Hence, to protect individual identities in datasets of recorded speech, deidentification efforts are necessary before the data are processed. However, research suggests that complete anonymization of speech is practically impossible to achieve without sacrificing the quality of the data itself5. Put differently, a balance must be struck between preserving utility of the data (by using all available information and risking exposure of personal data) and ensuring privacy of the speaker (by removing all identifiable information and obscuring the signal to the extent that useful information is lost). This tradeoff means that “(d)ata can be either useful or perfectly anonymous but never both”6.

In contrast to transcripts of spoken language, speech data include additional information that make anonymization efforts especially challenging. Even when voice data are altered in some way, additional contextual cues often remain that can reveal a speaker’s identity. Studies have demonstrated that anonymized voice data can often be re-identified when cross-referenced with publicly available recordings, such as those found in social media posts, podcasts, and phone conversations (e.g., ref. 7). The combination of unique voice traits, artificial intelligence (AI)-driven analysis, metadata exploitation, and the extensive collection of voice recordings by various stakeholders (e.g., web services) ensures that even “anonymized” or modified voice data can often be traced back to its original source. For example, even when limited to textual data without additional voice components, large language models have successfully re-identified texts previously anonymized by other tools8.

Various approaches have been utilized to obscure personally identifiable information in speech datasets, including synthetic data generation and differential privacy and adversarial training9,10. Research groups using shared datasets and evaluation protocols have developed speech anonymization systems based on these and related approaches, achieving strong privacy protections, but at the expense of speech quality10. Techniques that modify non-linguistic aspects of speech, such as signal distortion or alteration, can reduce the identifiability of a speech signal (i.e., its potential as a voiceprint), yet they are not viable in clinical settings where speech analysis relies on acoustic signals to detect subtle, non-linguistic abnormalities associated with neurological or psychiatric conditions11. For example, there is evidence that individuals with speech disorders face a higher risk of identification compared to those with typical speech12, and anonymization methods have varying effects across different diagnoses, suggesting that disorder-specific anonymization strategies may be necessary to achieve an optimal balance between privacy and utility13.

In this Perspective paper, we argue that anonymizing speech data for clinical applications is inherently complex, requiring careful evaluation of numerous factors, including speaker demographics, the type of speech elicitation task, and the recording environment. These variables must be considered when selecting anonymization techniques to ensure an acceptable balance between privacy protection and data utility. We begin by defining what constitutes speech data anonymization, then summarize the relevant legal frameworks governing speech data privacy. We also review general strategies for enhancing the privacy-utility tradeoff across diverse clinical research contexts. Finally, we recommend that ethical frameworks adopt detailed, quantifiable metrics to specify how, and to what extent, an individual’s speech data are safeguarded against re-identification.

What does “anonymization” mean?

In the context of speech data, anonymization refers to techniques used to remove or obscure personally identifiable information from recorded speech while preserving the usefulness of the data for research or machine learning14. The goal is to ensure that a person’s voice cannot be linked back to them while still allowing the speech to be analyzed for clinical or research purposes. Importantly, anonymization needs to address two distinct risks: the direct re-identification of a specific individual and the possible inference of more general characteristics (e.g., country of origin). While direct re-identification threatens personal privacy by linking data back to a known person, the inference of broader characteristics can also pose ethical and security concerns, as such information might still be exploited for nefarious purposes15. The likelihood of re-identification is substantially lower when individuals have little to no speech presence on social media or other public domains, due to fewer anchor points for matching16. Similarly, the collection of very brief and non-autobiographical speech samples lessens the risk of linking the data back to specific individuals17. This knowledge can aid researchers in contextualizing and articulating potential privacy risks in Institutional Review Board (IRB)—and similar ethics—applications. Figure 1 presents some of the key aspects of speech anonymization.

Fig. 1. An overview of speech anonymization.

Fig. 1

Key aspects of anonymization (yellow box) may be conceptualized as resulting in three distinct methods (white boxes), namely (i) removing or altering voice characteristics, (ii) masking speaker identity, and (iii) eliminating identifiable content. Further, removing or altering voice characteristics (top white box) may be achieved by (a) modifying pitch, tone or timbre of speech or (b) replacing the speaker’s voice with a synthetic or generic voice (top two blue boxes). Masking speaker identity (middle white box) may be achieved by (a) modifying a speaker’s voiceprint by adjusting pitch, formants, and/or speech patterns or (b) using techniques like voice morphing or speaker embedding modification (middle two blue boxes). (1) Voice morphing is a technique that involves manipulating the spectral envelop and pitch of a voice to mimic another speaker’s characteristics. (2) Speaker embeddings are digital representations (vectors) of a speaker’s unique voice features, used by AI for speaker recognition. Modifying or replacing these embeddings makes it harder to match the anonymized speech to the original speaker. Finally, eliminating identifiable content (bottom white box) may be achieved by (a) removing speaker names, dates, locations, and experiences from speech recordings or (b) redacting personally identifiable words or phrases (bottom two blue boxes).

Once personally identifying information is removed, post-processing should also ensure secure usage of the data via encrypted or privacy preserving mechanisms such as multi-party computation (multiple parties can jointly compute a function over their inputs while keeping the inputs private from each other) or homomorphic encryption (computations can be performed directly on encrypted data, without decrypting it. The output is also encrypted and can be decrypted only by the data owner). However, these techniques are generally impractical for individual-level speech data due to the complexity of speech recordings and the computational demands of these methods. Both approaches introduce significant processing and bandwidth demands, making tasks like feature extraction, automatic speech recognition, or model training prohibitively slow and resource-intensive when applied to raw or lengthy recordings. Additionally, they only protect data during computation, not after decryption, meaning speaker identifiable traits remain unless further anonymization is applied. While secure platforms and legal frameworks are essential components of a comprehensive data protection strategy, they really can only mitigate risk through compliance and enforcement; they cannot eliminate the potential for data exposure.

Anonymization and legal protections

The legal responsibility to preserve privacy is enshrined in the European Union’s General Data Protection Regulation (GDPR; https://gdpr-info.eu), which requires organizations processing personal data, including speech, to implement appropriate privacy and security safeguards (Articles 5 and 25). Speech data is considered personal data, and when used for identification, it is considered biometric data (Article 4(14)). Data that has been irreversibly anonymized, such that no reasonably available method (including modern AI-based techniques) could re-identify individuals, is no longer subject to GDPR (Recital 26). However, most processed speech remains identifiable due to unique vocal traits or potential linkages with other datasets, so GDPR typically applies. Further, speech varies as a function of demographic attributes such as race and ethnicity, which are considered “special categories of data” (Article 9) and prohibited from collection and use under the GDPR (unless specific consent is provided or other exceptions apply, such as substantial public interest, scientific research, or health purposes with safeguards). Hence, an individual’s voice can function as a “proxy” for ethnicity, even if no ethnicity data were collected, triggering the same GDPR concerns as if ethnicity data had been collected directly, and potentially leading to unintended, indirect discrimination18. While technical measures (e.g., anonymization, fairness algorithms) can mitigate risks, they do not by themselves guarantee GDPR compliance, which depends on a case-specific legal basis, safeguards, and risk assessment.

The Health Insurance Portability and Accountability Act (HIPAA; https://aspe.hhs.gov/reports/health-insurance-portability-accountability-act-1996) in the United States provides limited protection for speech data, applying only when the data qualifies as Protected Health Information (PHI). For speech to be considered PHI, it must be created, received, or stored by healthcare providers, insurers, clearinghouses, or third-party vendors handling PHI. Voice recordings that include a patient’s name, medical condition, treatment details, or similar information are protected under HIPAA. However, HIPAA no longer applies once speech data has been anonymized—that is, once all personally identifiable information has been removed. Importantly, if a voice recording has been stripped of personally identifiable information but still contains features that could be used to identify the speaker, HIPAA does not offer any further protections.

In the state of California, speech data collected by for-profit businesses, including healthcare entities, is protected under the California Consumer Privacy Act (CCPA; https://cppa.ca.gov/regulations/pdf/ccpa_statute.pdf). The CCPA grants consumers the right to know what personal data are being collected about them, to whom it is being disclosed or sold, and to request that their data be deleted. This includes audio recordings or transcripts that can be linked to an individual, such as those containing names, health information, or voice characteristics that could identify a person. Unlike HIPAA, which is limited to specific types of organizations and data, the CCPA has broader applicability and focuses on consumer rights across many industries. However, it does not apply to non-profit organizations or to data already regulated by HIPAA, meaning there can be gaps in protection depending on the nature of the organization and the context in which the data are collected.

So how do clinicians and researchers ensure that the speech data they collect and/or use cannot be traced back to the individuals who provided it, whilst maximizing the utility of the data for use in speech technologies (Fig. 2)? We argue that while complete anonymization is not currently possible without losing critical data for individual-level analyses, specific strategies can be applied to maximize the privacy-utility tradeoff and these will vary by the particular use case, as discussed below.

Fig. 2. A conceptual illustration of anonymization and the tradeoff between personal privacy and data utility.

Fig. 2

A UNSAFE. Personally identifiable information remains, and the utility of the signal is not compromised (e.g., original speech file). B IDEAL. All personally identifiable information is removed without compromising the utility of the signal. C NOT USEFUL. Personally identifiable information remains, and the utility of the signal is compromised (e.g., poor recording quality). D CURRENT. Personally identifiable information is removed, but doing so compromises the utility of the data (e.g., adjusting pitch negatively affects acoustic analysis). The solid line on the graph illustrates the hypothetical tradeoff between privacy and utility when anonymizing a brief audio recording from an English-speaking individual. The three dashed lines underneath represent progressively greater uncertainty and deviation from the ideal situation (B), reflecting how variables such as speaker characteristics, type of speech task, technology used, recording conditions, and intended use of the data can influence this tradeoff. Grey shaded area: prioritizing usefulness of the data (signal and content) increases risk of individual re-identification. Green shaded area: prioritizing privacy of the individual risks the removal of data that may be critical for downstream purposes.

Anonymizing speech data often involves adapting techniques commonly used for tabular or textual data, such as generalization, suppression, anatomization, permutation, and perturbation, to the unique challenges of spoken language, each offering varying tradeoffs between privacy and utility (see dashed lines on Fig. 2). Figure 3 illustrates these techniques as they may apply to a hypothetical speech sample.

Fig. 3. Common anonymization strategies.

Fig. 3

Several types of techniques (blue rectangles) can be applied to an original speech recording (top white speech bubble) to produce de-identified samples (yellow boxes). Generalization reduces the specificity of information, for instance, by converting a name or location into a broader category. Suppression removes identifiable details entirely, such as muting a speaker’s name or address. Anatomization separates identifying details (like metadata about the speaker) from the speech content itself, storing them in different locations to lower re-identification risk. Permutation replaces sensitive elements with alternative counterparts (e.g., switching “Jerry” for “Karyn” or “10 Pudding Lane” for “26 Smith Road”). Perturbation involves modifying the existing data by adding small distortions or noise, while keeping the structure and intelligibility intact.

Case studies illustrating the privacy-utility tradeoff

To strike the right balance between privacy and utility, anonymization strategies need to be customized at the individual-level, according to the specific context. We demonstrate this by examining how factors such as the type of task used to elicit the actual speech data, speaker characteristics, recording conditions, and the intended use of the data can influence both the effectiveness of privacy protection and the usefulness of the data.

Type of speech task: story recall vs. autobiographical recall as an example

Strategies to preserve individual privacy will vary significantly depending on the type of speech task, as content and context determine the level and type of identifiable information present. In a typical story recall task, commonly used in behavioral studies, participants are asked to retell a fixed, short narrative soon after hearing it, ensuring that the speech content remains largely controlled and consistent across individuals. Since the story does not originate from the speaker’s personal life, it is unlikely to contain direct identifiers such as names, locations, or personal experiences (although exceptions may occur), so strategies such as removing or masking speaker metadata may be sufficient. However, identification can still be made via voice characteristics, so greater attention needs to be paid to masking the acoustic signal via strategies such as perturbation (see Table 1).

Table 1.

Five types of anonymization strategies and case study examples of each

Generalization
replacing specific data with more general categories
Suppression
removing (or masking) identifying elements entirely
Anatomization
separating identifying information from the main dataset and storing it in a different location
Permutation
shuffling or reordering data to obscure relationships between
identifying data and other
attributes
Perturbation
modifying data slightly to obscure details while retaining structure or patterns

Task

Story recall

Not needed (personally identifying information is not recalled) Not needed (personally identifying information is not recalled)

Separate audio signal from content.

Speaker demographics stored separately and linked via

pseudonymous IDs

Speaker 1’s voice could be paired with Speaker 4’s age/gender in a released dataset Audio: apply noise injection, etc., to disguise speaker voice. Features: add noise to timing (e.g., pauses) or prosodic data for machine learning datasets

Task

Autobiographical recall

Identifying words (names, locations etc.) are abstracted Identifying words are removed or replaced with placeholders. (Useful for high-risk identifiers, but can fragment the story)

Separate audio signal from content.

Speaker demographics stored separately and linked via

pseudonymous IDs

Speaker 1’s story is kept intact, but their

demographics are swapped with Speaker 3. (Can destroy chronology/meaning of

narratives)

Audio: apply noise injection, etc., to disguise speaker voice. Features: add noise to timing (e.g., pauses) or prosodic data for machine learning datasets

Speaker

English

Can be applied efficiently using well-established ontologies Identifying words are removed or replaced with placeholders

Common in large datasets where speaker demographics are separated from transcripts/

audio via participant IDs

Permute participant demographics

or reassign transcript labels. (Applied mostly at the

metadata level)

Tools exist to perform synonym replacement, paraphrasing, or audio transformations

Speaker

Somali

Generalization must often be done manually (fewer structured linguistic resources than English) Suppression may remove contextually important but culturally sensitive details

Anatomization should be tied to ethically sensitive metadata practices, not just

technical separation

Permutation must respect sociolinguistic identity in Somali more carefully than in English

Somali expressions and idioms may lose meaning when perturbed without

cultural awareness

Conditions

Controlled

Clean data allows generalizing

demographic cues reliably

Can apply voice activity detection and keyword spotting to mute

names or identifiers

Can create voice embeddings and

cluster similar sounding speakers, then mix or average

their data

Controlled conditions allow accurate feature alignment across speakers

Techniques like pitch shifting, formant warping,

or spectral masking can mask identity while maintaining naturalness

Conditions

Naturalistic

Might be less needed or less effective, since features like pitch may already be corrupted or unreliable

Suppression may accidentally remove useful or

non-sensitive information due to false positives in noisy

conditions

May need to normalize recordings first or use noise- robust embeddings to anatomize meaningfully Restrict permutation to high-level features only (e.g., speaking rate) or avoid it altogether

Already noisy — adding more perturbation may

degrade intelligibility

Intended use

Clinical

Identifying words (names, symptoms,

diagnoses etc.) are abstracted

Identifying words (names, medical terminology etc.) are removed or replaced with placeholders Store speech content and metadata (e.g., age, diagnosis) in separate databases linked only by pseudonymous IDs.

Swap demographic labels among speakers. (High risk of semantic misalignment

affecting clinical utility)

Slightly alter timestamps or pitch to disguise identity without impacting diagnosis relevant features.

Intended use

Public

Identifying words (names, locations etc.) are abstracted Identifying words are removed or replaced with placeholders. Detach demographic attributes from actual speech data. Swap demographic labels among speakers

Change slight prosodic features or word choice. Mask speaker identity (voice conversion)

while preserving topic or emotion

In contrast, autobiographical recall tasks, commonly included in cognitive test batteries used by neuropsychologists, involve personal memories that are naturally rich in self-referential and identifying details. These include names of people and places, familial relationships, dates, occupations, and culturally specific references. Such narratives often contain both direct and indirect identifiers, requiring a more nuanced and context-sensitive approach, such as redacting content from transcripts and audio and applying voice alteration techniques more rigorously. In some cases, portions of audio may need to be redacted or excluded entirely if they pose a high risk of re-identification. This, of course, will affect the utility of the speech signal for later processing. Ultimately, the autobiographical nature of the task demands greater attention to privacy, making privacy preservation more complex than in standardized speech tasks.

Speaker characteristics: English vs. Somali as an example

Strategies to preserve individual privacy can also differ significantly depending on the linguistic, cultural, demographic, and health characteristics of the speaker. In terms of language spoken, the global dominance of English in speech technologies has led to significant resource concentration, as most models, datasets, and tools are optimized for English. Since established voice anonymization techniques and tools are more likely to have been developed and tested in English, they tend to perform better for English speakers than for those of other languages19. Additionally, the sheer size and diversity of the pool of English speakers suggests a somewhat lower (but certainly not zero) risk of re-identification through voice alone. Also, rich metadata (e.g., location, age, gender) is often collected but can be stripped or generalized for privacy, and transcription tools can aid in content redaction for privacy (e.g., removing names, addresses). Finally, English speech data is governed by well-established legal frameworks, leading to greater privacy protections.

In contrast, for speakers of low resource languages such as Somali, the smaller pool of speakers may increase the risk of identifying individuals through accent, dialect, or unique voice traits. (The same can be said for other defined populations, such as those with specific neurological or psychiatric diagnoses, for example ref. 13). The limited availability of Somali-specific anonymization tools or voice synthesis for privacy preservation and poor performance of off-the-shelf tools developed in English, mean that advanced voice obfuscation or synthesis techniques tailored to Somali phonetics would need to be developed. Similarly, native speakers would need to manually review and redact transcripts, as automation is less reliable. Hence, privacy preserving efforts generally are more labor intensive for low resource languages. Speakers also might be less aware of their digital rights and protections, making informed consent more complex, and depending on the location of data collection, there may be less rigorous national or institutional privacy frameworks in place (e.g., ref. 20), shifting the ethical responsibility more toward the data collector to define and implement protective measures.

Recording conditions: controlled environment vs. naturalistic setting as an example

The environment in which speech is recorded can introduce variables that influence how identifiable a speaker is, how much information can be extracted from the recording, and how resilient anonymization techniques need to be. In controlled laboratory environments, speech data are collected under standardized conditions, typically using high-quality recording equipment, with minimal background noise and consistent acoustic properties. These conditions produce clean, consistent recordings that facilitate the extraction of acoustic and linguistic features with high precision. However, the same consistency that benefits research utility can heighten privacy concerns, as the clarity of recordings allows for easy identification of personally identifiable characteristics. In this context, the selection of an anonymization technique directly shapes the privacy-utility tradeoff: using simpler, less aggressive methods increases the risk of re-identification, while applying stronger approaches can distort speech features essential for clinical or linguistic analysis.

In contrast, naturalistic settings, such as in the home, tend to be characterized by varying microphone quality (e.g., built-in laptop or phone microphones), greater background noise, variable reverberation and acoustics, and non-standardized recording setups. Background noise can be both beneficial and detrimental to anonymization strategies. On the one hand, background noise and distortions may partially anonymize the voice naturally, reducing the need for aggressive processing. On the other hand, non-verbal cues and background audio can inadvertently reveal location, household composition, or even identity through context (e.g., family voices, home sounds). Hence, this might require source separation or background filtering. Additionally, because naturalistic datasets are often large and demographically diverse, individuals can still be identifiable through distinctive accents, speech patterns, or environmental markers. Addressing these risks often requires more aggressive anonymization, which can erode important linguistic and contextual features, further exemplifying the inherent privacy-utility tradeoff.

Intended use of the data: clinical applications vs. public datasets as an example

In clinical research, when building machine learning models to detect cognitive impairment (e.g., refs. 21,22) or predict incipient psychosis (e.g., ref. 23), for example, the specific speech features that will be important are typically not known in advance. Consequently, applying anonymization techniques without careful consideration risks stripping away the richness and nuance of the original recordings, potentially degrading the model’s performance in identifying speech biomarkers. In clinical research contexts, where speech content, style, and emotion may all hold diagnostic value, anonymization techniques typically need to be less aggressive. Maximizing privacy might involve removing personally identifiable information and applying subtle pitch shifts to obscure speaker identity, while still preserving articulation and other diagnostically important features. However, retaining much of the speech signal also preserves individuals’ unique vocal traits, increasing the risk of re-identification. This risk is heightened in clinical datasets, which usually include fewer participants than public datasets, meaning each person represents a larger share of the sample, and rare traits (e.g., specific accents, speech disorders, or demographic combinations) stand out more prominently. Applying stronger anonymization methods, such as heavy perturbation, can mitigate this risk but often degrades the data’s clinical utility. Hence, the central challenge is to develop privacy preserving techniques that protect identity while retaining the essential speech characteristics (both known and yet to be discovered), a balance that is remarkably difficult to achieve.

In contrast, when developing generalized tools or public datasets that require only broad speech characteristics, stronger anonymization methods, such as fully synthetic speech generation, can be used to more thoroughly obscure speaker identity, environmental sounds, and other identifiable features. In large datasets, individuals are statistically less unique, reducing the likelihood that any one voice or trait stands out. This greater diversity also allows analyses to tolerate more aggressive anonymization, as the statistical power of a large sample enables population-level trends to remain detectable even when some signal fidelity is lost.

Next steps: anonymization metrics

The privacy-utility tradeoff exists and persists for speech data because the same features that make speech useful for clinical analysis also make it identifiable and therefore sensitive. To most effectively address this tradeoff, we argue that near future ethical frameworks should include detailed and quantifiable metrics of how and to what extent a person’s speech data have been protected against re-identification. This would enable individuals to make an educated judgment as to whether the privacy protections in place are sufficiently robust for them to contribute their speech for analysis. Presumably, it could affect legal obligations and oversight also. Such metrics would make anonymization practices more transparent, enabling regulators to evaluate compliance with laws, such as the GDPR. By quantifying re-identification probabilities, these frameworks could push regulators and courts to formalize the much-needed thresholds for what qualifies as true anonymization versus pseudonymization, potentially triggering stricter compliance requirements. Over time, these practices will likely reduce flexibility as re-identification techniques become more advanced. However, they will also encourage the development of industry-wide standards for speech data anonymization, which will likely be codified into law and serve as benchmarks in regulatory enforcement.

These metrics would look different depending on the specific use case. For example, in a speech database of individuals with Parkinson’s disease, perturbation techniques like adjusting pitch would hinder re-identification, but would also likely obscure important speech characteristics that are diagnostically important. A better choice may be anatomization, where speaker metadata is stored separately, and the audio is separated from the speech content. To evaluate the efficacy of these techniques, various metrics like speaker ID accuracy drop, embedding similarity, and k-anonymity can be applied to quantify the extent to which identity is protected (ref. 24 presents a review of deep learning for protecting and evaluating privacy in NLP applications). A graphical representation of both privacy and utility together can effectively illustrate the tradeoff, inspired by the Data Nutrition Label project25 (see Fig. 4).

Fig. 4. An example label of metrics illustrating the privacy-utility tradeoff.

Fig. 4

Voice transformation + PII redaction using ASR + NER pipeline refers to a two-stage anonymization process for speech data. First, Automatic Speech Recognition (ASR) transcribes the audio into text, which is then processed by a Named Entity Recognition (NER) pipeline to detect and flag personally identifiable information (PII) such as names, locations, or phone numbers. These sensitive segments are then redacted or replaced in the audio. In parallel, voice transformation techniques—such as pitch shifting or speaker embedding modification—are applied to alter vocal characteristics to prevent speaker identification, while preserving the intelligibility of the speech. Equal Error Rate (EER) represents the point where false acceptance rate (the rate at which the system incorrectly identifies a person) equals the false rejection rate (the rate at which the system incorrectly rejects a person). A higher EER typically indicates greater privacy in speaker anonymization tasks—but may also reduce utility. Cosine similarity of x-vectors is the degree of acoustic similarity between the original and anonymized speech. Human rater re-identification is when human raters try to re-identify or infer private attributes (e.g., age, gender, identity). Their score reflects how often they succeed versus chance. Word Error Rate (WER) is a standard metric used to evaluate the accuracy of ASR systems by measuring how closely the transcribed output from ASR matches the original spoken content. WER is used to quantify how much speech intelligibility or recognizability is lost due to anonymization—helping balance the trade-off between privacy (e.g., high EER) and functionality (low WER). Tradeoff Score represents a single metric that quantifies the balance between privacy and utility in the dataset, scaled from 0 to 10. It may be derived from a weighted combination of normalized privacy and utility metrics (e.g., using a harmonic mean), the area under a privacy–utility curve, or a Pareto-optimality index that reflects how closely the dataset approaches the best achievable balance.

Conclusion

In clinical settings, speech technologies offer great promise for diagnosis, monitoring, and treatment. These tools enable continuous, remote, and longitudinal tracking of patients while providing clinicians with objective, data-driven insights to support personalized care and improve outcomes. However, because speech carries both overt and subtle indicators of identity, protecting patient privacy is critical. Current anonymization techniques that can effectively mask individual speaker identity risk distorting speech in ways that may compromise the effectiveness of downstream analysis tools. The case examples presented here highlight the need for customized approaches that balance privacy and utility—shaped by factors such as the nature of the speech task, speaker demographics, recording conditions, and analysis goals. Greater transparency can be achieved through the use and documentation of clear metrics that quantify both privacy protection and utility, empowering individuals to make informed decisions about sharing their speech data.

Acknowledgements

This work was supported by a research grant to Catherine Diaz-Asper from Rotary USA’s Coins for Alzheimer’s Research Trust (CART) Fund.

Author contributions

C.D.A.: conceptualization, writing. L.A.B.: editing. B.E.: conceptualization, editing.

Data availability

No datasets were generated or analysed during the current study.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Kappen, M., Vanderhasselt, M. A. & Slavich, G. M. Speech as a promising biosignal in precision psychiatry. Neurosci. Biobehav. Rev.148, 105121 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Moell, B., Aronsson, F. S., Östberg, P. & Beskow, J. The order in speech disorder: a scoping review of state of the art machine learning methods for clinical speech classification. Preprint at https://arxiv.org/abs/2503.04802 (2025).
  • 3.Goldstein-Stewart, J., Winder, R. & Sabin, R. Person identification from text and speech genre samples. In Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009) 336–344 (eds Lascarides, A. et al.) (Association for Computational Linguistics, 2009).
  • 4.Kröger, J. L., Lutz, O. H. M. & Raschke, P. Privacy implications of voice and speech analysis–information disclosure by inference. In Privacy and Identity Management. Data for Better Living: A. I. and Privacy (eds. Fischer-Hübner, S. et al.) 242–258 (Springer, 2020).
  • 5.Weitzenboeck, E. M., Lison, P., Cyndecka, M. & Langford, M. The GDPR and unstructured data: is anonymization possible? Int. Data Priv. Law12, 184–206 (2022). [Google Scholar]
  • 6.Ohm, P. Broken promises of privacy: responding to the surprising failure of anonymization. UCLA Law Rev.57, 1701 (2009). [Google Scholar]
  • 7.Rocher, L., Hendrickx, J. M. & de Montjoye, Y. A. Estimating the success of reidentifications in incomplete datasets using generative models. Nat. Commun.10, 3069 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Patsakis, C. & Lykousas, N. Man vs the machine in the struggle for effective text anonymisation in the age of large language models. Sci. Rep.13, 16026 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Arshad, R. & Asghar, M. R. Characterisation and quantification of user privacy: key challenges, regulations, and future directions. IEEE Commun. Surv. Tutor.26, 1 (2024). [Google Scholar]
  • 10.Panariello, M. et al. The VoicePrivacy 2022 Challenge: Progress and perspectives in voice anonymisation. IEEE/ACM Trans. Audio Speech Lang. Process.32, 3477–3491 (2024). [Google Scholar]
  • 11.Mandal, A., Chakraborty, T. & Gurevych, I. Towards privacy-aware mental health AI models: advances, challenges, and opportunities. Preprint at https://arxiv.org/abs/2502.00451 (2025). [DOI] [PubMed]
  • 12.Tayebi Arasteh, S. et al. The effect of speech pathology on automatic speaker verification: a large-scale study. Sci. Rep.13, 20476 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Tayebi Arasteh, S. et al. Addressing challenges in speaker anonymization to maintain utility while ensuring privacy of pathological speech. Commun. Med.4, 182 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Nautsch, A. et al. Preserving privacy in speaker and speech characterisation. Comput. Speech Lang.58, 441–480 (2019). [Google Scholar]
  • 15.Narayan, S. M., Kohli, N. & Martin, M. M. Addressing contemporary threats in anonymised healthcare data using privacy engineering. NPJ Digit. Med.8, 145 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Wiepert, D. et al. Reidentification of participants in shared clinical data sets: experimental study. JMIR AI3, e52054 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Viñals, I., Ortega, A., Miguel, A. & Lleida, M. An analysis of the short utterance problem for speaker characterization. Appl. Sci.9, 3697 (2019). [Google Scholar]
  • 18.Van Bekkum, M. & Borgesius, F. Z. Using sensitive data to prevent discrimination by artificial intelligence: does the GDPR need a new exception?. Comput. Law Secur. Rev.48, 105770 (2023). [Google Scholar]
  • 19.Meyer, S., Lux, F. & Vu, N. T. Probing the feasibility of multilingual speaker anonymization. In Proc. Interspeech 2024, 4448-4452 (2024).
  • 20.Edgcumbe, A. et al. ‘Potato potahto’? Disentangling de-identification, anonymisation, and pseudonymisation for health research in Africa. J. Law Biosci.12, lsae029 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Diaz-Asper, C. et al. Increasing access to cognitive screening in the elderly: applying natural language processing methods to speech collected over the telephone. Cortex156, 26–38 (2022). [DOI] [PubMed] [Google Scholar]
  • 22.Chandler, C. et al. An explainable machine learning model of cognitive decline derived from speech. Alzheimers Dement. Diagn. Assess. Dis. Monit.15, e12516 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Zaher, F. et al. Speech markers to predict and prevent recurrent episodes of psychosis: A narrative overview and emerging opportunities. Schizophr. Res.266, 205–215 (2024). [DOI] [PubMed] [Google Scholar]
  • 24.Sousa, S. & Kern, R. How to keep text private? A systematic review of deep learning methods for privacy-preserving natural language processing. Artif. Intell. Rev.56, 1427–1492 (2023). [Google Scholar]
  • 25.Holland, S., Hosny, A., Newman, S., Joseph, J. & Chmielinski, K. The dataset nutrition label. Data Prot. Priv.12, 1 (2020). [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

No datasets were generated or analysed during the current study.


Articles from NPJ Digital Medicine are provided here courtesy of Nature Publishing Group

RESOURCES