Video analysis of mouth movement using motion templates for computer-based lip-reading

Yau, W 2008, Video analysis of mouth movement using motion templates for computer-based lip-reading, Doctor of Philosophy (PhD), Electrical and Computer Engineering, RMIT University.


Document type: Thesis
Collection: Theses

Attached Files
Name Description MIMEType Size
Yau.pdf Thesis application/pdf 2.59MB
Title Video analysis of mouth movement using motion templates for computer-based lip-reading
Author(s) Yau, W
Year 2008
Abstract This thesis presents a novel lip-reading approach to classifying utterances from video data, without evaluating voice signals. This work addresses two important issues which are
• the efficient representation of mouth movement for visual speech recognition
• the temporal segmentation of utterances from video.

The first part of the thesis describes a robust movement-based technique used to identify mouth movement patterns while uttering phonemes. This method temporally integrates the video data of each phoneme into a 2-D grayscale image named as a motion template (MT). This is a view-based approach that implicitly encodes the temporal component of an image sequence into a scalar-valued MT.

The data size was reduced by extracting image descriptors such as Zernike moments (ZM) and discrete cosine transform (DCT) coefficients from MT. Support vector machine (SVM) and hidden Markov model (HMM) were used to classify the feature descriptors. A video speech corpus of 2800 utterances was collected for evaluating the efficacy of MT for lip-reading. The experimental results demonstrate the promising performance of MT in mouth movement representation. The advantages and limitations of MT for visual speech recognition were identified and validated through experiments.

A comparison between ZM and DCT features indicates that the accuracy of classification for both methods is very comparable when there is no relative motion between the camera and the mouth. Nevertheless, ZM is resilient to rotation of the camera and continues to give good results despite rotation but DCT is sensitive to rotation. DCT features are demonstrated to have better tolerance to image noise than ZM. The results also demonstrate a slight improvement of 5% using SVM as compared to HMM.

The second part of this thesis describes a video-based, temporal segmentation framework to detect key frames corresponding to the start and stop of utterances from an image sequence, without using the acoustic signals. This segmentation technique integrates mouth movement and appearance information. The efficacy of this technique was tested through experimental evaluation and satisfactory performance was achieved. This segmentation method has been demonstrated to perform efficiently for utterances separated with short pauses.

Potential applications for lip-reading technologies include human computer interface (HCI) for mobility-impaired users, defense applications that require voice-less communication, lip-reading mobile phones, in-vehicle systems, and improvement of speech-based computer control in noisy environments.
Degree Doctor of Philosophy (PhD)
Institution RMIT University
School, Department or Centre Electrical and Computer Engineering
Keyword(s) Deaf -- Means of communication -- Technological innovations
Lipreading -- Data processing
Versions
Version Filter Type
Access Statistics: 426 Abstract Views, 1290 File Downloads  -  Detailed Statistics
Created: Mon, 29 Nov 2010, 16:09:00 EST by Catalyst Administrator
© 2014 RMIT Research Repository • Powered by Fez SoftwareContact us