Speech-Driven 2D Facial Animation Based on a Two-Stage Generative Framework

Dept. of Computer Engineering and Automation (DCA),
Universidade Estadual de Campinas (UNICAMP)

Our framework synthesizes talking head video from speech Portuguese audio.

Videos 1. Test dataset. Synthesized talking head video samples extracted from the test dataset. The labels 320, 640, 1280, and 2560 represent the training epochs of the framework's first stage, which utilizes a Transformer-based model. The second stage model is fixed at the optimal epoch, 120. The label 'original_keypoints' indicates the ground truth landmarks used as input for the second stage framework. The final sample, without a label, is the real video sample.

Videos 2. Exploration Test. Synthesized talking head video samples extracted from YouTube speech videos.

Abstract

Speech-driven facial animation, a technique employing speech signals as input, aims to generate realistic and expressive talking head animations. Despite advancements in talking head synthesis methods, challenges persist in terms of achieving precise control, robust generalization, and adaptability to various scenarios and speaker characteristics. Additionally, the majority of existing approaches are primarily tailored for a restricted range of languages, with English being the predominant focus.

This work introduces a novel two-stage architecture for Brazilian Portuguese talking head generation, combining the strengths of Transformers and Generative Adversarial Networks (GANs). In the first stage, the transformer-based model extracts rich contextual information from the audio speech input, generating facial landmarks. In the second stage, we employ a GAN-based framework to translate the facial representations into photorealistic video frames.

This framework separates the modeling of dynamic shape variations from the realistic appearance, partially addressing the challenge of generalization. Moreover, it becomes possible to assign multiple appearances to the same speaker by adjusting the trained weights of the second stage. Objective metrics were used to evaluate the synthesized facial speech, showing that it closely matches the ground-truth landmarks.

Framework Inference Overview
Framework Overview: Inference

BibTeX


      @inproceedings{bernardo-costa-2024-speech,
        title = "A Speech-Driven Talking Head based on a Two-Stage Generative Framework",
        author = "Bernardo, Brayan  and
          Costa, Paula",
        editor = "Gamallo, Pablo  and
          Claro, Daniela  and
          Teixeira, Ant{\'o}nio  and
          Real, Livy  and
          Garcia, Marcos  and
          Oliveira, Hugo Gon{\c{c}}alo  and
          Amaro, Raquel",
        booktitle = "Proceedings of the 16th International Conference on Computational Processing of Portuguese",
        month = mar,
        year = "2024",
        address = "Santiago de Compostela, Galicia/Spain",
        publisher = "Association for Computational Lingustics",
        url = "https://aclanthology.org/2024.propor-1.64",
        pages = "580--586",
    }