Robust visual speech recognition using optical flow analysis and rotation invariant features

Shaikh, A 2011, Robust visual speech recognition using optical flow analysis and rotation invariant features, Doctor of Philosophy (PhD), Electrical and Computer Engineering, RMIT University.

Document type: Thesis
Collection: Theses

Attached Files
Name Description MIMEType Size
Shaikh.pdf Thesis Click to show the corresponding preview/stream application/pdf;... 1.68MB
Title Robust visual speech recognition using optical flow analysis and rotation invariant features
Author(s) Shaikh, A
Year 2011
Abstract The focus of this thesis is to develop computer vision algorithms for visual speech recognition system to identify the visemes. The majority of existing speech recognition systems is based on audio-visual signals and has been developed for speech enhancement and is prone to acoustic noise. Considering this problem, aim of this research is to investigate and develop a visual only speech recognition system which should be suitable for noisy environments. Potential applications of such a system include the lip-reading mobile phones, human computer interface (HCI) for mobility-impaired users, robotics, surveillance, improvement of speech based computer control in a noisy environment and for the rehabilitation of the persons who have undergone a laryngectomy surgery.

In the literature, there are several models and algorithms available for visual feature extraction. These features are extracted from static mouth images and characterized as appearance and shape based features. However, these methods rarely incorporate the time dependent information of mouth dynamics. This dissertation presents two optical flow based approaches of visual feature extraction, which capture the mouth motions in an image sequence. The motivation for using motion features is, because the human perception of lip-reading is concerned with the temporal dynamics of mouth motion.

The first approach is based on extraction of features from the optical flow vertical component. The optical flow vertical component is decomposed into multiple non-overlapping fixed scale blocks and statistical features of each block are computed for successive video frames of an utterance. To overcome the issue of large variation in speed of speech, each utterance is normalized using simple linear interpolation method.

In the second approach, four directional motion templates based on optical flow are developed, each representing the consolidated motion information in an utterance in four directions (i.e.,up, down, left and right). This approach is an evolution of a view based approach known as motion history image (MHI). One of the main issues with the MHI method is its motion overwriting problem because of self-occlusion. DMHIs seem to solve this issue of overwriting. Two types of image descriptors, Zernike moments and Hu moments are used to represent each image of DMHIs. A support vector machine (SVM) classifier was used to classify the features obtained from the optical flow vertical component, Zernike and Hu moments separately. For identification of visemes, a multiclass SVM approach was employed.

A video speech corpus of seven subjects was used for evaluating the efficiency of the proposed methods for lip-reading. The experimental results demonstrate the promising performance of the optical flow based mouth movement representations. Performance comparison between DMHI and MHI based on Zernike moments, shows that the DMHI technique outperforms the MHI technique.

A video based adhoc temporal segmentation method is proposed in the thesis for isolated utterances. It has been used to detect the start and the end frame of an utterance from an image sequence. The technique is based on a pair-wise pixel comparison method. The efficiency of the proposed technique was tested on the available data set with short pauses between each utterance.
Degree Doctor of Philosophy (PhD)
Institution RMIT University
School, Department or Centre Electrical and Computer Engineering
Keyword(s) Lip-reading
visual speech recognition
Speech reading
visual feature extraction
temporal speech segmentation
classification of visual features
Support Vector Machines
Version Filter Type
Access Statistics: 287 Abstract Views, 788 File Downloads  -  Detailed Statistics
Created: Thu, 13 Sep 2012, 15:30:41 EST by Brett Fenton
© 2014 RMIT Research Repository • Powered by Fez SoftwareContact us