Automatic Speaker Verification (ASV) enables high-security applications
like user authentication or criminal investigation. However, ASV can
be subjected to malicious attacks, which could compromise that security.
The ASV literature mainly studies spoofing (a.k.a impersonation) attacks
such as voice replay, synthesis or conversion. Meanwhile, other kinds
of attacks, known as adversarial attacks, have become a threat to all
kind of machine learning systems. Adversarial attacks introduce an
imperceptible perturbation in the input signal that radically changes
the behavior of the system. These attacks have been intensively studied
in the image domain but less in the speech domain.
In this work, we investigate
the vulnerability of state-of-the-art ASV systems to adversarial attacks.
We consider a threat model consisting in adding a perturbation noise
to the test waveform to alter the ASV decision. We also discuss the
methodology and metrics to benchmark adversarial attacks and defenses
in ASV. We evaluated three x-vector architectures, which performed
among the best in recent ASV evaluations, against fast gradient sign
and Carlini-Wagner attacks. All networks were highly vulnerable in
the white-box attack scenario, even for high SNR (30–60 dB).
Furthermore, we successfully transferred attacks generated with smaller
white-box networks to attack a larger black-box network.