Deep Learning applied to Visual Speech Recognition

Santos, Carlos Manuel Simões dos

Publication

Deep Learning applied to Visual Speech Recognition

2023-11-28Master thesis

datacite.subject.fos	Engenharia e Tecnologia::Engenharia Eletrotécnica, Eletrónica e Informática	pt_PT
dc.contributor.advisor	Coelho, Paulo Jorge Simões
dc.contributor.advisor	Cunha, António Manuel Trigueiros da Silva
dc.contributor.author	Santos, Carlos Manuel Simões dos
dc.date.accessioned	2024-05-14T17:28:27Z
dc.date.available	2024-05-14T17:28:27Z
dc.date.issued	2023-11-28
dc.description.abstract	Visual Speech Recognition (VSR) or Automatic Lip-Reading (ALR), the artificial process used to infer visemes, words, or sentences from video inputs, is an efficient yet far from being a day-to-day tool. With the evolution of deep learning models and the proliferation of databases (DB), vocabularies increase in quality and quantity. Large DB feed end-to-end deep learning (DL) models that extract speech, solely on the visual recognition of the speaker’s lips movements. However, large DB production requires large resources, unavailable to the majority of ALR researchers, impairing a larger scale evolution. This dissertation contributes to the development of ALR by diversifying training data, on which the DL depends upon. This includes producing a new DB, in Portuguese language, capable of state-of-the-art (SOTA) performance. As DL only shows a SOTA performance if trained on a large DB, whose resources are not on the scope of this dissertation, a knowledge leveraging method emerges, as a necessary subsequent objective. A large DB and a SOTA model are selected and used as templates, from which a smaller DB (LusaPt) is created, comprising 100 phrases by 10 speakers, uttering 50 typical Portuguese digits and words, recorded and processed by day-to-day equipment. After having pre-trained on the SOTA DB, the new model is then fine-tuned on the new DB. For LusaPt’s validation, the performance of new and the SOTA’s are compared. Results reveal that, if the same video is recurrently subject to the same model, the same prediction is obtained. Tests also show a clear increase on the word recognition rate (WRR), from the 0% when inferring with the SOTA model with no further training on the new DB, to an over 95% when inferring with the new model. Besides showing a “powerful belief” of the SOTA model in its predictions, this work also validates the new DB and its creation methodology. It reenforces that the transfer learning process is efficient in learning a new language, therefore new words. Another contribution is to demonstrate that, with a day-to-day equipment and limited human resources, it is possible to enrich the DB corpora and, ultimately, to positively impact the performance and future of Automatic Lip-Reading.	pt_PT
dc.identifier.tid	203607988	pt_PT
dc.identifier.uri	http://hdl.handle.net/10400.8/9664
dc.language.iso	eng	pt_PT
dc.subject	Leitura Automática de Lábios	pt_PT
dc.subject	Aprendizagem Profunda	pt_PT
dc.subject	Base de Dados	pt_PT
dc.subject	Aprendizagem por Transferência	pt_PT
dc.title	Deep Learning applied to Visual Speech Recognition	pt_PT
dc.type	master thesis
dspace.entity.type	Publication
rcaap.rights	openAccess	pt_PT
rcaap.type	masterThesis	pt_PT
thesis.degree.name	Mestrado em Engenharia Electrotécnica	pt_PT

Files

Original bundle

Now showing 1 - 1 of 1

Name:: DeepLearningAppliedtoVisualSpeechRecognition 2_cf.pdf
Size:: 2.99 MB
Format:: Adobe Portable Document Format
Description:

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.32 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

ESTG - Mestrado em Engenharia Eletrotécnica - Energia e Automação