Abstract: Speech-driven 3D facial animation has attracted considerable attention due to its extensive applicability across diverse domains. The majority of existing 3D facial animation methods ignore the avatar’s expression, while emotion-controllable methods struggle with specifying the avatar’s identity and portraying various emotional intensities, resulting in a lack of naturalness and realism in the animation. To address this issue, we first present an Emolib dataset containing 10,736 expression images with eight emotion categories, i.e., neutral, happy, angry, sad, fear, surprise, disgust, and contempt, where each image is accompanied by a corresponding emotion label and a 3D model with expression. Additionally, we present a novel 3D facial animation framework that operates with unpaired training data. This framework produces emotional facial animations aligned with the input face image, effectively conveying diverse emotional expressions and intensities. Our framework initially generates lip-synchronized and expression models separately. These models are then combined using a fusion network to generate face models that effectively synchronize with speech while conveying emotions. Moreover, the mouth structure is incorporated to create a comprehensive face model. This model is then fed into our skin-realistic renderer, resulting in a highly realistic animation. Experimental results demonstrate that our approach outperforms state-of-the-art 3D facial animation methods in terms of realism and emotional expressiveness while also maintaining precise lip synchronization. The Emolib dataset is available at https://github.com/yuminjing/Emolib.git.
Recommended citation:Minjing Yu, Delong Pang, Ziwen Kang, Zhiyao Sun, Tian Lv, Jenny Sheng, Ran Yi, Yu-Hui Wen, and Yong-Jin Liu*. 2024. ECAvatar: 3D Avatar Facial Animation with Controllable Identity and Emotion. In Proceedings of the 32nd ACM International Conference on Multimedia (MM ‘24).