Videos 1. Test dataset. Synthesized talking head video samples extracted from the test dataset. The labels 320, 640, 1280, and 2560 represent the training epochs of the framework's first stage, which utilizes a Transformer-based model. The second stage model is fixed at the optimal epoch, 120. The label 'original_keypoints' indicates the ground truth landmarks used as input for the second stage framework. The final sample, without a label, is the real video sample.
Videos 2. Exploration Test. Synthesized talking head video samples extracted from YouTube speech videos.
Speech-driven facial animation, a technique employing speech signals as input, aims to generate realistic and expressive talking head animations. Despite advancements in talking head synthesis methods, challenges persist in terms of achieving precise control, robust generalization, and adaptability to various scenarios and speaker characteristics. Additionally, the majority of existing approaches are primarily tailored for a restricted range of languages, with English being the predominant focus.
This work introduces a novel two-stage architecture for Brazilian Portuguese talking head generation, combining the strengths of Transformers and Generative Adversarial Networks (GANs). In the first stage, the transformer-based model extracts rich contextual information from the audio speech input, generating facial landmarks. In the second stage, we employ a GAN-based framework to translate the facial representations into photorealistic video frames.
This framework separates the modeling of dynamic shape variations from the realistic appearance, partially addressing the challenge of generalization. Moreover, it becomes possible to assign multiple appearances to the same speaker by adjusting the trained weights of the second stage. Objective metrics were used to evaluate the synthesized facial speech, showing that it closely matches the ground-truth landmarks.
@inproceedings{bernardo-costa-2024-speech,
title = "A Speech-Driven Talking Head based on a Two-Stage Generative Framework",
author = "Bernardo, Brayan and
Costa, Paula",
editor = "Gamallo, Pablo and
Claro, Daniela and
Teixeira, Ant{\'o}nio and
Real, Livy and
Garcia, Marcos and
Oliveira, Hugo Gon{\c{c}}alo and
Amaro, Raquel",
booktitle = "Proceedings of the 16th International Conference on Computational Processing of Portuguese",
month = mar,
year = "2024",
address = "Santiago de Compostela, Galicia/Spain",
publisher = "Association for Computational Lingustics",
url = "https://aclanthology.org/2024.propor-1.64",
pages = "580--586",
}