Repository logo
 
Publication

Segment Level Voice Conversion with Recurrent Neural Networks

dc.contributor.authorRamos, Miguel Varela
dc.contributor.authorBlack, Alan W.
dc.contributor.authorAstudillo, Ramon Fernandez
dc.contributor.authorTrancoso, Isabel
dc.contributor.authorFonseca, Nuno
dc.date.accessioned2025-10-28T14:00:12Z
dc.date.available2025-10-28T14:00:12Z
dc.date.issued2017-08-20
dc.description.abstractVoice conversion techniques aim to modify a subject’s voice characteristics in order to mimic the one’s of another person. Due to the difference in utterance length between source and target speaker, state of the art voice conversion systems often rely on a frame alignment pre-processing step. This step aligns the entire utterances with algorithms such as dynamic time warping (DTW) that introduce errors, hindering system performance. In this paper we present a new technique that avoids the alignment of entire utterances at frame level, while keeping the local context during training. For this purpose, we combine an RNN model with the use of phoneme or syllablelevel information, obtained from a speech recognition system. This system segments the utterances into segments which then can be grouped into overlapping windows, providing the needed context for the model to learn the temporal dependencies. We show that with this approach, notable improvements can be attained over a state of the art RNN voice conversion system on the CMU ARCTIC database. It is also worth noting that with this technique it is possible to halve the training data size and still outperform the baseline.eng
dc.description.sponsorshipThis project was sponsored by the CMU Portugal UIP and supported by national funds through Fundac¸ ˜ao para a Ciˆencia e a Tecnologia (FCT) with reference UID/CEC/50021/2013 and EU project LAW-TRAIN with reference H2020-EU.3.7.653587.
dc.identifier.citationRamos, M.V., Black, A.W., Astudillo, R.F., Trancoso, I., Fonseca, N. (2017) Segment Level Voice Conversion with Recurrent Neural Networks. Proc. Interspeech 2017, 3414-3418, doi: 10.21437/Interspeech.2017-1538
dc.identifier.doi10.21437/interspeech.2017-1538
dc.identifier.urihttp://hdl.handle.net/10400.8/14414
dc.language.isoeng
dc.peerreviewedyes
dc.publisherISCA
dc.relationUID/CEC/50021/2013
dc.relationH2020-EU.3.7.653587
dc.relation.ispartofInterspeech 2017
dc.rights.uriN/A
dc.subjectVoice conversion
dc.subjectRecurrent neural networks
dc.subjectDeep learning
dc.subjectSpectral mapping
dc.titleSegment Level Voice Conversion with Recurrent Neural Networkseng
dc.typeconference paper
dspace.entity.typePublication
oaire.citation.conferenceDate2017-08-20
oaire.citation.conferencePlaceStockholm, Sweden
oaire.citation.endPage3418
oaire.citation.startPage3414
oaire.citation.titleISCA - International Speech Communication Association
oaire.versionhttp://purl.org/coar/version/c_970fb48d4fbd8a85
person.familyNameFonseca
person.givenNameNuno
person.identifier.orcid0000-0002-0769-5306
relation.isAuthorOfPublication5203623c-18fa-4e28-b172-064dd133f026
relation.isAuthorOfPublication.latestForDiscovery5203623c-18fa-4e28-b172-064dd133f026

Files

Original bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
ramos17_interspeech.pdf
Size:
273.82 KB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.32 KB
Format:
Item-specific license agreed upon to submission
Description: