Deep Learning applied to Visual Speech Recognition

Santos, Carlos Manuel Simões dos

http://hdl.handle.net/10400.8/9664

Utilize este identificador para referenciar este registo.

Nome:	Descrição:	Tamanho:	Formato:
DeepLearningAppliedtoVisualSpeechRecognition 2_cf.pdf		2.99 MB	Adobe PDF	Ver/Abrir

Contacte-nos

Autores

Santos, Carlos Manuel Simões dos

Orientador(es)

Coelho, Paulo Jorge Simões

Cunha, António Manuel Trigueiros da Silva

Resumo(s)

Visual Speech Recognition (VSR) or Automatic Lip-Reading (ALR), the artificial process used to infer visemes, words, or sentences from video inputs, is an efficient yet far from being a day-to-day tool. With the evolution of deep learning models and the proliferation of databases (DB), vocabularies increase in quality and quantity. Large DB feed end-to-end deep learning (DL) models that extract speech, solely on the visual recognition of the speaker’s lips movements. However, large DB production requires large resources, unavailable to the majority of ALR researchers, impairing a larger scale evolution. This dissertation contributes to the development of ALR by diversifying training data, on which the DL depends upon. This includes producing a new DB, in Portuguese language, capable of state-of-the-art (SOTA) performance. As DL only shows a SOTA performance if trained on a large DB, whose resources are not on the scope of this dissertation, a knowledge leveraging method emerges, as a necessary subsequent objective. A large DB and a SOTA model are selected and used as templates, from which a smaller DB (LusaPt) is created, comprising 100 phrases by 10 speakers, uttering 50 typical Portuguese digits and words, recorded and processed by day-to-day equipment. After having pre-trained on the SOTA DB, the new model is then fine-tuned on the new DB. For LusaPt’s validation, the performance of new and the SOTA’s are compared. Results reveal that, if the same video is recurrently subject to the same model, the same prediction is obtained. Tests also show a clear increase on the word recognition rate (WRR), from the 0% when inferring with the SOTA model with no further training on the new DB, to an over 95% when inferring with the new model. Besides showing a “powerful belief” of the SOTA model in its predictions, this work also validates the new DB and its creation methodology. It reenforces that the transfer learning process is efficient in learning a new language, therefore new words. Another contribution is to demonstrate that, with a day-to-day equipment and limited human resources, it is possible to enrich the DB corpora and, ultimately, to positively impact the performance and future of Automatic Lip-Reading.

Palavras-chave

Leitura Automática de Lábios Aprendizagem Profunda Base de Dados Aprendizagem por Transferência

URI

http://hdl.handle.net/10400.8/9664

Coleções

ESTG - Mestrado em Engenharia Eletrotécnica - Energia e Automação

Ver registo completo