Automatic speaker recognition: modelling, feature extraction and effects of clinical environment

Memon, S 2010, Automatic speaker recognition: modelling, feature extraction and effects of clinical environment, Doctor of Philosophy (PhD), Electrical and Computer Engineering, RMIT University.


Document type: Thesis
Collection: Theses

Attached Files
Name Description MIMEType Size
Memon.pdf Thesis application/pdf 1.86MB
Title Automatic speaker recognition: modelling, feature extraction and effects of clinical environment
Author(s) Memon, S
Year 2010
Abstract Speaker recognition is the task of establishing identity of an individual based on his/her voice. It has a significant potential as a convenient biometric method for telephony applications and does not require sophisticated or dedicated hardware.

The Speaker Recognition task is typically achieved by two-stage signal processing: training and testing. The training process calculates speaker-specific feature parameters from the speech. The features are used to generate statistical models of different speakers. In the testing phase, speech samples from unknown speakers are compared with the models and classified.

Current state of the art speaker recognition systems use the Gaussian mixture model (GMM) technique in combination with the Expectation Maximization (EM) algorithm to build the speaker models. The most frequently used features are the Mel Frequency Cepstral Coefficients (MFCC).

This thesis investigated areas of possible improvements in the field of speaker recognition. The identified drawbacks of the current speaker recognition systems included: slow convergence rates of the modelling techniques and feature’s sensitivity to changes due aging of speakers, use of alcohol and drugs, changing health conditions and mental state.
The thesis proposed a new method of deriving the Gaussian mixture model (GMM) parameters called the EM-ITVQ algorithm. The EM-ITVQ showed a significant improvement of the equal error rates and higher convergence rates when compared to the classical GMM based on the expectation maximization (EM) method.

It was demonstrated that features based on the nonlinear model of speech production (TEO based features) provided better performance compare to the conventional MFCCs features.

For the first time the effect of clinical depression on the speaker verification rates was tested. It was demonstrated that the speaker verification results deteriorate if the speakers are clinically depressed. The deterioration process was demonstrated using conventional (MFCC) features.

The thesis also showed that when replacing the MFCC features with features based on the nonlinear model of speech production (TEO based features), the detrimental effect of the clinical depression on speaker verification rates can be reduced.
Degree Doctor of Philosophy (PhD)
Institution RMIT University
School, Department or Centre Electrical and Computer Engineering
Keyword(s) Speaker Recognition
Gaussian Mixture Models
Expectation Maximization
Information Theory
Feature extraction
Clinical Environment
Versions
Version Filter Type
Access Statistics: 322 Abstract Views, 3195 File Downloads  -  Detailed Statistics
Created: Fri, 21 Oct 2011, 10:35:14 EST by Guy Aron
© 2014 RMIT Research Repository • Powered by Fez SoftwareContact us