Repository logo
 
Publication

Deep Learning applied to Visual Speech Recognition

datacite.subject.fosEngenharia e Tecnologia::Engenharia Eletrotécnica, Eletrónica e Informáticapt_PT
dc.contributor.advisorCoelho, Paulo Jorge Simões
dc.contributor.advisorCunha, António Manuel Trigueiros da Silva
dc.contributor.authorSantos, Carlos Manuel Simões dos
dc.date.accessioned2024-05-14T17:28:27Z
dc.date.available2024-05-14T17:28:27Z
dc.date.issued2023-11-28
dc.description.abstractVisual Speech Recognition (VSR) or Automatic Lip-Reading (ALR), the artificial process used to infer visemes, words, or sentences from video inputs, is an efficient yet far from being a day-to-day tool. With the evolution of deep learning models and the proliferation of databases (DB), vocabularies increase in quality and quantity. Large DB feed end-to-end deep learning (DL) models that extract speech, solely on the visual recognition of the speaker’s lips movements. However, large DB production requires large resources, unavailable to the majority of ALR researchers, impairing a larger scale evolution. This dissertation contributes to the development of ALR by diversifying training data, on which the DL depends upon. This includes producing a new DB, in Portuguese language, capable of state-of-the-art (SOTA) performance. As DL only shows a SOTA performance if trained on a large DB, whose resources are not on the scope of this dissertation, a knowledge leveraging method emerges, as a necessary subsequent objective. A large DB and a SOTA model are selected and used as templates, from which a smaller DB (LusaPt) is created, comprising 100 phrases by 10 speakers, uttering 50 typical Portuguese digits and words, recorded and processed by day-to-day equipment. After having pre-trained on the SOTA DB, the new model is then fine-tuned on the new DB. For LusaPt’s validation, the performance of new and the SOTA’s are compared. Results reveal that, if the same video is recurrently subject to the same model, the same prediction is obtained. Tests also show a clear increase on the word recognition rate (WRR), from the 0% when inferring with the SOTA model with no further training on the new DB, to an over 95% when inferring with the new model. Besides showing a “powerful belief” of the SOTA model in its predictions, this work also validates the new DB and its creation methodology. It reenforces that the transfer learning process is efficient in learning a new language, therefore new words. Another contribution is to demonstrate that, with a day-to-day equipment and limited human resources, it is possible to enrich the DB corpora and, ultimately, to positively impact the performance and future of Automatic Lip-Reading.pt_PT
dc.identifier.tid203607988pt_PT
dc.identifier.urihttp://hdl.handle.net/10400.8/9664
dc.language.isoengpt_PT
dc.subjectLeitura Automática de Lábiospt_PT
dc.subjectAprendizagem Profundapt_PT
dc.subjectBase de Dadospt_PT
dc.subjectAprendizagem por Transferênciapt_PT
dc.titleDeep Learning applied to Visual Speech Recognitionpt_PT
dc.typemaster thesis
dspace.entity.typePublication
rcaap.rightsopenAccesspt_PT
rcaap.typemasterThesispt_PT
thesis.degree.nameMestrado em Engenharia Electrotécnicapt_PT

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
DeepLearningAppliedtoVisualSpeechRecognition 2_cf.pdf
Size:
2.99 MB
Format:
Adobe Portable Document Format
Description:
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.32 KB
Format:
Item-specific license agreed upon to submission
Description: