E-mail and search functions

  • University of Illinois
  • E-mail
  • A-Z Index

Feature Extraction for Audiovisual Speech Recognition

Audiovisual speech recognition (AVSR) provides quality speech recognition in noisy environments. This research investigates the visual aspect of AVSR. Initially, video from an AVSR database was converted to images. Talkers in the video were taped saying digits, sentences, and phone numbers in an automobile under varying conditions such as speed. Originally, selected digits were chosen from the video for first experiment. Then, digits 0 thru 10 were utilized in subsequent experiments. After image conversion, the mouth region was extracted from the image, and feature mean normalization was used to compensate for lighting variations. Then, the mouth subimage was compressed and optimized by a discrete cosine transform and linear discrete analysis. Finally, a Euclidean distance formula was used to determine which digit was closest to the training digit that was being recognized. In the five experiments performed, accuracy ranged from 0 to 28.57%. Also, the results did not significantly improve or degrade recognition analysis where the data was tested against training data recorded at a different angle. Ambiguous face images prevent reliable speech recognition that is dependent on visual features. Therefore, the results are the most beneficial when used to enhance an audio speech recognition system.
Author: 
Kimberly Wright
School: 
Southern University and A&M College at Baton Rouge
Department: 
Electrical Engineering
Research Advisor: 
Mark Hasegawa - Johnson
Department of Research Advisor: 
Electrical and Computer Engineering
Year of Publication: 
2003
The Graduate College at the University of Illinois Urbana-Champaign 801 South Wright Street 204 Coble Hall, MC-322 Champaign, IL 61820-6210 Phone: (217) 333-0035 Fax: (217) 333-8019