Vocal fingerprints

24 Nov 2009

More secure voice identification systems could stem from an improved algorithm for comparing speech patterns

Your voice could become your next ‘fingerprint’ when confirming your identity in situations that normally involve supplying passwords.

Your voice could become your next ‘fingerprint’ when confirming your identity in situations that normally involve supplying passwords.

© 2009 istockphoto.com/ferrantraite

When a person needs to confirm their identity over the phone, they must usually give a password or other personal details known to the second party. But these details can be discovered relatively easily and used to forge a person’s identity.

Speaker recognition technology, which enables a person’s identity to be verified from the sound of his/her voice, is being investigated as a potential means to tackle the growing problem of identity fraud. Changhuai You and colleagues at the Institute for Infocomm Research of A*STAR, Singapore, have now developed a method for comparing two different samples of human speech that improves the accuracy and reliability of this technology.

“When a person speaks, his/her voice not only conveys information in the words they say, but also non-linguistic characteristics determined by the properties of the person’s vocal tract,” explains You. These characteristics include the mixture of different frequencies in a person’s voice and the particular way that this mixture varies as the person talks. Such characteristics are unique to an individual and difficult to mimic perfectly—even by professional voice impressionists—and can, in principle, be used to build a ‘vocal fingerprint’ for that individual. Unfortunately, the complexity of these characteristics makes them much more difficult to compare than conventional fingerprints.

You and colleagues simplified the task by using a Gaussian mixture model (GMM) to mathematically represent the human voice. This involved analyzing how often features such as certain frequencies or changes in volume occur within a sample of speech and representing the occurrence of these features as a distribution of probability. They then simplified the complexity of this distribution by representing it as a mixture of Gaussian-shaped distributions. This reduced the information contained in a voice sample to only that which is unique to an individual speaker.

The researchers were then faced with deciding how to calculate the similarity of two voice samples represented by a mixture model. To solve this, they quantified the difference between probability distributions of the samples, verified these through experiments, and then mathematically derived suitable statistics using a method called the ‘Bhattacharyya distance’. Applying this idea enabled the researchers to construct a speaker recognition algorithm that performed better than the current state-of-the-art method; it correctly identified individual voice samples from a large database.

In future work, the team intend to apply this approach to identify the language a speaker is using, which is an important step towards recognizing the individual words they are speaking.

The A*STAR-affiliated authors in this highlight are from the Institute for Infocomm Research.

Want to stay up-to-date with A*STAR’s breakthroughs? Follow us on Twitter and LinkedIn!


You, C. H., Lee, K. A. & Li, H. An SVM kernel with GMM-supervector based on the Bhattacharyya distance for speaker recognition. IEEE Signal Processing Letters 16, 49–52 (2009). | article

This article was made for A*STAR Research by Nature Research Custom Media, part of Springer Nature