Skip to main content
Data in Brief logoLink to Data in Brief
. 2020 Feb 26;29:105331. doi: 10.1016/j.dib.2020.105331

A dataset of histograms of original and fake voice recordings (H-Voice)

Dora M Ballesteros 1,, Yohanna Rodriguez 1, Diego Renza 1
PMCID: PMC7058910  PMID: 32154354

Abstract

This paper presents H-Voice, a dataset of 6672 histograms of original and fake voice recordings obtained by the Imitation [1,2] and the Deep Voice [3] methods. The dataset is organized into six directories: Training_fake, Training_original, Validation_fake, Validation_original, External_test1, and External_test2. The training directories include 2088 histograms of fake voice recordings and 2020 histograms of original voice recordings. Each validation directory has 864 histograms obtained from fake voice recordings and original voice recordings. Finally, External_test1 has 760 histograms (380 from fake voice recordings obtained by the Imitation method and 380 from original voice recordings), and External_test2 has 76 histograms (72 from fake voice recordings obtained by the Deep Voice method and 4 from original voice recordings). With this dataset, the researchers can train, cross-validate and test classification models using machine learning techniques to identify fake voice recordings.

Keywords: Fake voice, Machine learning, Convolutional neural networks, Binary classification, Imitation, Deep voice, H-Voice


Specifications Table

Subject Computer Vision and Pattern Recognition
Specific subject area Image processing related to identify/classify tampered data
Type of data Images
How data were acquired The images were obtained by calculating the histogram of original and fake voice recordings from a repository of the Deep Voice (https://audiodemos.github.io/) and the Imitation methods (https://doi.org/10.17632/ytkv9w92t6.1)
Data format Raw: histograms (PNG)
Parameters for data collection The voice recordings are re-quantized to 16 bits. The histograms with 216 bins are calculated from the voice recording (original or fake)
Description of data collection The dataset is composed by six directories, organized as follows:
  • 1.

    Training_fake: 2088 histograms from fake voice recordings (by the Imitation and the Deep Voice methods)

  • 2.

    Training_original: 2020 histograms from original voice recordings

  • 3.

    Validation_fake: 864 histograms from fake voice recordings (by the Imitation method)

  • 4.

    Validation_original: 864 histograms from original voice recordings

  • 5.

    External_test1: 760 histograms from original and fake voice recordings (by the Imitation method)

  • 6.

    External_test2: 76 histograms from original and fake voice recordings (by the Deep voice method)

Data source location City: Bogotá
Country: Colombia
Data accessibility Repository name: Mendeley
Data name: H-Voice: Fake voice histograms (Imitation + DeepVoice) [4]
Direct URL to data: https://doi.org/10.17632/k47yd3m28w.4
Value of the Data
  • This is the first dataset of histograms from original and fake voice recordings. The histograms are obtained from real signals (original and fake) using the Imitation [1,2] and the Deep Voice [3] methods.

  • This dataset of histograms allows fake voice classifiers to be trained, cross-validated and tested using machine learning techniques such as convolutional neural networks, like how it is done in anti-spoofing speaker verification systems that use spectrograms as features [5,6].

  • The dataset is balanced between original and fake voice recordings which is a desirable condition to obtain a good trade-off between precision and recall.

  • This dataset can be used for comparing the performance of different fake voice classification models.

1. Data description

This dataset is composed by histograms (images) from original and fake voice recordings obtained by two methods: Imitation [1,2] and Deep Voice [3]. This data set has four versions in Mendeley, the difference between them corresponding to the number of histograms. Version 1 has 3432 histograms, version 2 has 3792 histograms and version 3 has 6672 histograms. In version 4, corrupted images have been fixed. The latest version (i.e. version 4) is the one explained in this document, which is organized in six directories: Training_original, Training_fake, Validation_original, Validation_fake, External_test1, and External_test 2 [4].

Fig. 1 shows the structure of the dataset. This is explained below:

  • 1.

    Training_original: 2020 histograms from original voice recordings.

  • 2.

    Training_fake: 2088 histograms from fake voice recordings, of which 2016 histograms are obtained by the Imitation method, and 72 by the Deep Voice method.

  • 3.

    Validation_original: 864 histograms from original voice recordings.

  • 4.

    Validation_fake: 864 histograms from fake voice recordings obtained by the Imitation method.

  • 5.

    External_test1: this is composed of 380 histograms of original voice recordings and 380 histograms of fake voice recordings obtained by the Imitation method.

  • 6.

    External_test2: this is composed of 4 histograms of original voice recordings and 72 histograms of fake voice recordings obtained by the Deep Voice method.

Fig. 1.

Fig. 1

H-Voice dataset structure.

Fig. 2 shows examples of histograms of original and fake voice recordings of the training and validation directories. Fig. 3 and Fig. 4 show examples of the External_test1 and External_test2 directories, respectively.

Fig. 2.

Fig. 2

First example of histograms, located at: a)Training_original, b)Training_fake, c)Validation_original, and d)Validation_fake directories.

Fig. 3.

Fig. 3

Second example of histograms, located at External_test1 directory): a) original voice recording, b) fake voice recording obtained by the Imitation method.

Fig. 4.

Fig. 4

Third example of histograms, located at External_test2 directory): a) original voice recording, b) fake voice recording obtained by the Deep Voice method.

2. Experimental design, materials, and methods

Fake voice files are created entirely by a machine, either by machine learning (e.g. the Deep Voice method) or by signal processing techniques (e.g. the Imitation method). Unlike false voice recordings that are obtained by spoofing the voice, or by manipulating an original voice signal with insertion tasks, deletion or splicing. In the case of the Deep Voice method, a convolutional neural network is trained with original voice recordings to create new (fake) voice recordings with different plain text than the original. On the other hand, the Imitation method uses a re-ordering process of the wavelet coefficients of the original voice signal by imitating the genre, intonation and rhythm of another speaker.

The first step in creating our histograms was to obtain examples of fake voice recordings from the Deep Voice and the Imitation methods. In the case of Deep Voice, we use the voice recordings publicly available at https://audiodemos.github.io/. But, in the case of Imitation, we ourselves created fake voice recordings with the following code (based on the algorithm proposed in Ref. [2]):

  • % Inputs: original.wav, target.wav.

  • % Outputs: fake.wav, key.

  • [original, FS] = audioread(‘original.wav’); % read the original voice recording.

  • [target, FS] = audioread(‘target.wav’); % read the target voice recording (to be imitated).

  • [C1,L1] = wavedec(target,4,‘db10’); % obtain the wavelet coefficients of the original voice recording.

  • [C2,L2] = wavedec(original,4,‘db10’); % obtain the wavelet coefficients of the target voice recording.

  • [B1,IX1] = sort(C1,‘descend’); % sort the wavelet coefficients of the original voice recording.

  • [B2,IX2] = sort(C2,‘descend’); % sort the wavelet coefficients of the target voice recording.

  • C2m(IX1) = C2(IX2); % re-ordering the wavelet coefficients of the original voice recording.

  • key(IX1) = IX2; % obtain the key to reverse the process.

  • fake = waverec(C2m,L1,‘db10’); % create the fake voice obtained from the original voice recording.

  • audiowrite(‘fake.wav’,fake,FS,‘BitsPerSample’,16); % save the fake voice recording.

Examples of original and fake voice recordings obtained with the above algorithm are available at https://doi.org/10.17632/ytkv9w92t6.1.

Once the fake voice recordings have been generated, the following code in Matlab allows us to draw the histograms (original/fake):

  • % Input: name.wav.

  • % Output: histogram of the voice recording.

  • [voice, FS] = audioread(‘name.wav’); % read the original/fake voice recording.

  • nbins = 65536; % number of bins of the histogram.

  • h = histogram(x, nbins); % plot the histogram.

It is important to note that the examples of fake voice recordings obtained by Deep Voice published at https://audiodemos.github.io/have been re-quantized to 16-bits before their histograms were obtained.

Acknowledgments

This work is supported by the “Universidad Militar Nueva Granada – Vicerrectoría de Investigaciones” under the grant IMP-ING-2936 of 2019.

Footnotes

Appendix A

Supplementary data to this article can be found online at https://doi.org/10.1016/j.dib.2020.105331.

Conflict of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Appendix A. Supplementary data

The following is the Supplementary data to this article:

Multimedia component 1
mmc1.xml (2.3KB, xml)

References

  • 1.Ballesteros L D.M., Moreno A J.M. Highly transparent steganography model of speech signals using Efficient Wavelet Masking. Expert Syst. Appl. 2012;39:9141–9149. [Google Scholar]
  • 2.Ballesteros L D.M., Moreno A J.M. On the ability of adaptation of speech signals and data hiding. Expert Syst. Appl. 2012;39:12574–12579. [Google Scholar]
  • 3.Arik S.O., Chrzanowski M., Coates A., Diamos G., Gibiansky A., Kang Y., Li X., Miller J., Ng A., Raiman J., Sengupta S., Shoeybi M. vol. 70. 2017. Deep voice: real-time neural text-to-speech; pp. 195–204.https://arxiv.org/abs/1702.07825 (Proceedings of the 34th International Conference on Machine Learning). [Google Scholar]
  • 4.Ballesteros D.M., Rodriguez Y.P., Renza D. 2020. (H-Voice: Fake Voice Histograms (Imitation+DeepVoice)). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Himawan I., Villavicencio F., Sridharan S., Fookes C. Deep domain adaptation for anti-spoofing in speaker verification systems. Comput. Speech Lang. 2019;58:377–402. [Google Scholar]
  • 6.Zhang C., Yu C., Hansen J.H.L. An investigation of deep-learning frameworks for speaker verification antispoofing. IEEE J. Sel. Top. Signal Process. 2017;11:684–694. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Multimedia component 1
mmc1.xml (2.3KB, xml)

Articles from Data in Brief are provided here courtesy of Elsevier

RESOURCES