Loading...
5 results
Search Results
Now showing 1 - 5 of 5
- Aes white paper best practices in network audioPublication . Bouillot, Nicolas; Cohen, Elizabeth; Cooperstock, Jeremy R.; Floros, Andreas; Fonseca, Nuno; Foss, Richard; Goodman, Michael; Grant, John; Gross, Kevin; Harris, Steven; Harshbarger, Brent; Heyraud, Joffrey; Jonsson, Lars; Narus, John; Page, Michael; Snook, Tom; Tanaka, Atau; Trieger, Justin; Zanghieri, UmbertoAnalog audio needs a separate physical circuit for each channel. Each microphone in a studio or on a stage, for example, must have its own circuit back to the mixer. Routing of the signals is inflexible. Digital audio is frequently wired in a similar way to analog. Although several channels can share a single physical circuit (e.g., up to 64 with AES10), thus reducing the number of cores needed in a cable. Routing of signals is still inflexible and any change to the equipment in a location is liable to require new cabling. Networks allow much more flexibility. Any piece of equipment plugged into the network is able to communicate with any other. However, installers of audio networks need to be aware of a number of issues that affect audio signals but are not important for data networks and are not addressed by current IT networking technologies such as IP. This white paper examines these issues and provides guidance to installers and users that can help them build successful networked systems.
- Impulse response upmixing using particle systemsPublication . Fonseca, NunoWith the increase of the computational power of DSP's and CPU's, impulse responses (IR) and the convolution process are becoming a very popular approach to recreate some audio effects like reverb. But although many IR repositories exist, most IR consider only mono or stereo. This paper presents an approach for impulse response upmixing using particle systems. Using a reverse engineering process, a particle system is created, capable of reproducing the original impulse response. By re-rendering the obtained particle system with virtual microphones, an upmixing result can be obtained. Depending on the type of virtual microphone, several different output formats can be supported, ranging from stereo to surround, and including binaural support, Ambisonics or even custom speaker scenarios (VBAP).
- Singing voice resynthesis using vocal sound librariesPublication . Fonseca, Nuno; Ferreira, AníbalAlthough resynthesis may seem a simple analysis/synthesis process, it is a quite complex task, even more when it comes to recreating a singing voice. This paper presents a system whose goal is to start with an original audio stream of someone singing and recreate the same performance (melody, phonetics, dynam-ics) using an internal vocal sound library (choir or solo voice). By extracting dynamics and pitch information, and looking for phonetic similarities between the original audio frames and the frames of the sound library, a completely new audio stream is created. The obtained audio results, although not perfect (mainly due to the existence of audio artifacts), show that this technologi-cal approach may become an extremely powerful audio tool.
- Concatenative singing voice resynthesisPublication . Fonseca, Nuno; Ferreira, Anibal; Rocha, Ana PaulaThe concept of capturing the sound of “something” for later replication is not new, and it is used in many synthesizers. But capturing sounds and use them as an audio effect, is less common. This paper presents an approach for the resynthesis of a singing voice, based on concatenative techniques, that uses pre-recorded audio material as an high level semantic audio effect, replacing an original audio recording with the sound of a different singer, while trying to keep the same musical/phonetic performance.
- Segment Level Voice Conversion with Recurrent Neural NetworksPublication . Ramos, Miguel Varela; Black, Alan W.; Astudillo, Ramon Fernandez; Trancoso, Isabel; Fonseca, NunoVoice conversion techniques aim to modify a subject’s voice characteristics in order to mimic the one’s of another person. Due to the difference in utterance length between source and target speaker, state of the art voice conversion systems often rely on a frame alignment pre-processing step. This step aligns the entire utterances with algorithms such as dynamic time warping (DTW) that introduce errors, hindering system performance. In this paper we present a new technique that avoids the alignment of entire utterances at frame level, while keeping the local context during training. For this purpose, we combine an RNN model with the use of phoneme or syllablelevel information, obtained from a speech recognition system. This system segments the utterances into segments which then can be grouped into overlapping windows, providing the needed context for the model to learn the temporal dependencies. We show that with this approach, notable improvements can be attained over a state of the art RNN voice conversion system on the CMU ARCTIC database. It is also worth noting that with this technique it is possible to halve the training data size and still outperform the baseline.
