This repository contains the code for the paper Speaker Anonymization with Distribution-Preserving X-Vector Generation for the Voice Privacy Challenge 2020 presented at the Voice Privacy Challenge Special Session at Interspeech 2020. This work is a collaboration between Henry Turner and Giulio Lovisotto from the System Security Lab at University of Oxford.
The idea for our challenge entry began by examining the similarity distributions of x-vectors obtained from anonymized voices generated by the baseline system, with those from organic voices, which revealed that the anonymized voices had significantly higher inter voice similarity. In other words, the similarity between pairs of fake voices is much higher than the similarity between pairs of organic ones, as can be seen in the following density plots of the similarity scores:
We examined this further with a t-SNE analysis, conducted on a set comprised of half anonymized x-vectors and half organic x-vectors. This t-SNE analysis, shown below, clearly highlights the differences that exist between the two types of x-vectors, with the fake ones being way more similar to one another compared to organic x-vectors.
We identify the problem in the possibly biased method of fake x-vector generation used by the baseline system, which requires averaging and selecting subsets of x-vectors from other uses.
We set out to remedy this, by developing a technique to generate psuedo-xvectors that better utilize the x-vector hyperspace, thus creating more diverse x-vectors and as a result anonymized voices that are more distinct from another.
We do this by training a PCA decomposition model on organic x-vectors and subsequently a Gaussian Mixture Model (GMM) fitted on the PCA-transformed version of these x-vectors. Now, rather than using a possibly biased x-vector generation method, we can simply sample fake x-vectors from the GMM, which better preserves the original inter-similarity across voices compared to the baseline system. We can then apply the inverse of the PCA transformation, which transforms the x-vector into the correct dimensional space. We then use the same voice processing pipeline as the baseline solution to generate an anonymized voices, which replaces the speakers x-vector with the newly created anonymized x-vector, as shown in the diagram below.
Please see our paper here for further details
📄 - Voice Privacy Challenge System Description
📄 - ArXiV verison
🎥 - Voice Privacy Challenge Presentation
This work was generously supported by a grant from Master-card and by the Engineering and Physical Sciences ResearchCouncil [grant numbers EP/N509711/1, EP/P00881X/1]
The system is based on the Voice-Privacy-Challenge 2020 baseline system, which can be found on Github here.
A Dockerfile is provided, which will clone the above repository and install Kaldi.
As part of running the recipe the models used for evaluation must be downloaded from the Voice Privacy Challenge organisers (stage 1). This currently requires a password from them, and the process for getting this is described on the Github linked above. After the challenge has fully completed I will ask about including a download link separately in this code base.
- Docker installation, with nvidia gpu support
- Docker compose installed
- run
docker-compose up
in the Experiment folder N.B this can take a while, as it builds Kaldi and runs some installs which may take a while. - attach to the container once it is built:
docker attach DistPrevXvecs
- run the code with
./run.sh
If you wish to only create the xvector generator run train_models.sh
and then see local/anon/gen_pseudo_xvecs.py
for its usage to create fake x-vectors.
To actually turn the fake x-vectors into audio see local/anon/anonymize_data_dir.sh
, which will run the anonymization on a directory in Kaldi format.