Edinburgh-S3DFM Face Database

Overview

Speech-driven 3D facial motions describe dynamic 3D human faces while speaking. The behavior is repeatable and person-specific and thus is promising for many applications, e.g person recognition, lip language analysis, etc. This database focuses on dynamic human faces when the subjects are speaking a short phrase. The database collects 1030 samples consisting of two parts: Speaking with Frontal Pose (S3DFM-FP) and Speaking with Varying Pose (S3DFM-VP). There are 770 samples from 77 participants in the FP sub-dataset and 260 samples from 26 participants in the VP sub-dataset. The participants have different ages, genders, ethnicities, and mother-tongues.

Data Acquisition

A high-frame-rate (500FPS) 3D video sensor from DI4D Ltd was used for capturing data. The sensor is a binocular stereo vision system mainly consisting of two intensity cameras. Each participant was asked to repeat a short phrase -- ni'hao (a Chinese word, means 'Hello') 10 times when looking naturally straight at the cameras. For each repetition, we captured a video sequence using the sensor and a synchronized audio sequence via a microphone. In the capture of speaking face with varying pose, the participant repeated the same phrase but with the head naturally moving.

The 3D reconstruction of each video sequence was done using DI4D's commercial software with additionally spatial smoothing and temporal filtering. Each sample contains a depth/3D sequence and a pixel-wise registered intensity sequence, plus a short 'passphrase' (the synchronized audio sequence). Each video sequence contains 500 frames and each audio sequence also covers 1 second with a sampling frequency of 44.1 kHZ. The resolutions of the depth/3D and intensity images are 600*600 points each. (The original video sequence was downsampled from their original resolution of 1200*1200 pixels to improve the processing efficiency and to reduce the 3D noise)

Overall, the database contains 2 parts: Frontal Pose (S3DFM-FP), Varying Pose (S3DFM-VP).

In the S3DFM-FP, there are 770 samples with

77*10 3D sequences (77*10*500 3D point clouds)
77*10 registered intensity sequences (77*10*500 intensity frames)
77*10 synchronized audio sequences

In the S3DFM-VP, there are 260 samples with

26*10 3D sequences (26*10*500 3D point clouds)
26*10 registered intensity sequences (26*10*500 intensity frames)
26*10 synchronized audio sequences

Data Examples

We present the cosine shaded depth data from two participants as examples, their registered intensity frames (frame #: 50, 150, 300, 450) from a video sequence, and the synchronized audio sequence, as shown in Fig.1.

Figure 1. Example samples from 2 participants. For each set: 3D images (top row); registered intensity images (mid-row); bottom row: a synchronized audio sequence (phrase: ni'hao).

The mouth is the principal dynamic region of a speaking face. We represent a 3D mouth region via the mouth width and opening. The change of the 3D mouth region from a participant and its repeatability from 10 sequences are shown in Fig.2.

Figure 2. Analysis of the mouth region of a participant speaking a phrase

Data Download

The database is freely available for use by other researchers or parties, under CC-BY-NC-ND license terms. Note that the database can only be used for academic research. If you use the data in a publication, please cite:

Jie Zhang, Robert B. Fisher; 3D visual passcode: Speech-driven 3D facial dynamics for behaviometrics, Signal Processing, 2019, 160: 164-177. A draft version of the paper is here.

Each file listed below contains 10 sequences (3D & intensity & audio) from a participant. You could download them by clicking on a file and then unzipping them individually.

Each zip file (~2.23 GB) consists of 500 Matlab *.mat files (~4.5 MB) and one *.mp3 file.
Each matlab *.mat file (e.g. seq1_050.mat) contains 2 arrays: Img(600,600) and XYZ(600,600,3). These are single precision files. Img is the infrared intensity image at the corresponding frame. XYZ is the (x,y,z) point computed for the corresponding pixel.
The mp3 file consists of approximately 44100 audio samples. The first audio sample has been synchronized previously to align with the first video frame.
The Demographical breakdown of the subjects is given here. The data is:
- demographics(1:77,1).Subject - subject identifier (1..77)
- demographics(1:77,2).Age - age category. 1-Youth, 2-Middle age; 3-Senior (minimum age is 16, maximum age is 73)
- demographics(1:77,3).Gender - 1-Female; 2-Male. (27 females, 50 males)
- demographics(1:77,4).Nationality - 0-Unknown; 1-North American; 2-South American; 3-African; 4-European; 5-East Asian; 6-South/Southeast Asian.

Part 1: Speaking Faces with Frontal Pose (S3DFM-FP)

Participant 1:   Seq1   Seq2   Seq3   Seq4   Seq5   Seq6   Seq7   Seq8   Seq9   Seq10
Participant 2:   Seq1   Seq2   Seq3   Seq4   Seq5   Seq6   Seq7   Seq8   Seq9   Seq10
Participant 3:   Seq1   Seq2   Seq3   Seq4   Seq5   Seq6   Seq7   Seq8   Seq9   Seq10
Participant 4:   Seq1   Seq2   Seq3   Seq4   Seq5   Seq6   Seq7   Seq8   Seq9   Seq10
Participant 5:   Seq1   Seq2   Seq3   Seq4   Seq5   Seq6   Seq7   Seq8   Seq9   Seq10
Participant 6:   Seq1   Seq2   Seq3   Seq4   Seq5   Seq6   Seq7   Seq8   Seq9   Seq10
Participant 7:   Seq1   Seq2   Seq3   Seq4   Seq5   Seq6   Seq7   Seq8   Seq9   Seq10
Participant 8:   Seq1   Seq2   Seq3   Seq4   Seq5   Seq6   Seq7   Seq8   Seq9   Seq10
Participant 9:   Seq1   Seq2   Seq3   Seq4   Seq5   Seq6   Seq7   Seq8   Seq9   Seq10
Participant 10: Seq1   Seq2   Seq3   Seq4   Seq5   Seq6   Seq7   Seq8   Seq9   Seq10
Participant 11: Seq1   Seq2   Seq3   Seq4   Seq5   Seq6   Seq7   Seq8   Seq9   Seq10
Participant 12: Seq1   Seq2   Seq3   Seq4   Seq5   Seq6   Seq7   Seq8   Seq9   Seq10
Participant 13: Seq1   Seq2   Seq3   Seq4   Seq5   Seq6   Seq7   Seq8   Seq9   Seq10
Participant 14: Seq1   Seq2   Seq3   Seq4   Seq5   Seq6   Seq7   Seq8   Seq9   Seq10
Participant 15: Seq1   Seq2   Seq3   Seq4   Seq5   Seq6   Seq7   Seq8   Seq9   Seq10
Participant 16: Seq1   Seq2   Seq3   Seq4   Seq5   Seq6   Seq7   Seq8   Seq9   Seq10
Participant 17: Seq1   Seq2   Seq3   Seq4   Seq5   Seq6   Seq7   Seq8   Seq9   Seq10
Participant 18: Seq1   Seq2   Seq3   Seq4   Seq5   Seq6   Seq7   Seq8   Seq9   Seq10
Participant 19: Seq1   Seq2   Seq3   Seq4   Seq5   Seq6   Seq7   Seq8   Seq9   Seq10
Participant 20: Seq1   Seq2   Seq3   Seq4   Seq5   Seq6   Seq7   Seq8   Seq9   Seq10
Participant 21: Seq1   Seq2   Seq3   Seq4   Seq5   Seq6   Seq7   Seq8   Seq9   Seq10
Participant 22: Seq1   Seq2   Seq3   Seq4   Seq5   Seq6   Seq7   Seq8   Seq9   Seq10
Participant 23: Seq1   Seq2   Seq3   Seq4   Seq5   Seq6   Seq7   Seq8   Seq9   Seq10
Participant 24: Seq1   Seq2   Seq3   Seq4   Seq5   Seq6   Seq7   Seq8   Seq9   Seq10
Participant 25: Seq1   Seq2   Seq3   Seq4   Seq5   Seq6   Seq7   Seq8   Seq9   Seq10
Participant 26: Seq1   Seq2   Seq3   Seq4   Seq5   Seq6   Seq7   Seq8   Seq9   Seq10
Participant 27: Seq1   Seq2   Seq3   Seq4   Seq5   Seq6   Seq7   Seq8   Seq9   Seq10
Participant 28: Seq1   Seq2   Seq3   Seq4   Seq5   Seq6   Seq7   Seq8   Seq9   Seq10
Participant 29: Seq1   Seq2   Seq3   Seq4   Seq5   Seq6   Seq7   Seq8   Seq9   Seq10
Participant 30: Seq1   Seq2   Seq3   Seq4   Seq5   Seq6   Seq7   Seq8   Seq9   Seq10
Participant 31: Seq1   Seq2   Seq3   Seq4   Seq5   Seq6   Seq7   Seq8   Seq9   Seq10
Participant 32: Seq1   Seq2   Seq3   Seq4   Seq5   Seq6   Seq7   Seq8   Seq9   Seq10
Participant 33: Seq1   Seq2   Seq3   Seq4   Seq5   Seq6   Seq7   Seq8   Seq9   Seq10
Participant 34: Seq1   Seq2   Seq3   Seq4   Seq5   Seq6   Seq7   Seq8   Seq9   Seq10
Participant 35: Seq1   Seq2   Seq3   Seq4   Seq5   Seq6   Seq7   Seq8   Seq9   Seq10
Participant 36: Seq1   Seq2   Seq3   Seq4   Seq5   Seq6   Seq7   Seq8   Seq9   Seq10
Participant 37: Seq1   Seq2   Seq3   Seq4   Seq5   Seq6   Seq7   Seq8   Seq9   Seq10
Participant 38: Seq1   Seq2   Seq3   Seq4   Seq5   Seq6   Seq7   Seq8   Seq9   Seq10
Participant 39: Seq1   Seq2   Seq3   Seq4   Seq5   Seq6   Seq7   Seq8   Seq9   Seq10
Participant 40: Seq1   Seq2   Seq3   Seq4   Seq5   Seq6   Seq7   Seq8   Seq9   Seq10
Participant 41: Seq1   Seq2   Seq3   Seq4   Seq5   Seq6   Seq7   Seq8   Seq9   Seq10
Participant 42: Seq1   Seq2   Seq3   Seq4   Seq5   Seq6   Seq7   Seq8   Seq9   Seq10
Participant 43: Seq1   Seq2   Seq3   Seq4   Seq5   Seq6   Seq7   Seq8   Seq9   Seq10
Participant 44: Seq1   Seq2   Seq3   Seq4   Seq5   Seq6   Seq7   Seq8   Seq9   Seq10
Participant 45: Seq1   Seq2   Seq3   Seq4   Seq5   Seq6   Seq7   Seq8   Seq9   Seq10
Participant 46: Seq1   Seq2   Seq3   Seq4   Seq5   Seq6   Seq7   Seq8   Seq9   Seq10
Participant 47: Seq1   Seq2   Seq3   Seq4   Seq5   Seq6   Seq7   Seq8   Seq9   Seq10
Participant 48: Seq1   Seq2   Seq3   Seq4   Seq5   Seq6   Seq7   Seq8   Seq9   Seq10
Participant 49: Seq1   Seq2   Seq3   Seq4   Seq5   Seq6   Seq7   Seq8   Seq9   Seq10
Participant 50: Seq1   Seq2   Seq3   Seq4   Seq5   Seq6   Seq7   Seq8   Seq9   Seq10
Participant 51: Seq1   Seq2   Seq3   Seq4   Seq5   Seq6   Seq7   Seq8   Seq9   Seq10
Participant 52: Seq1   Seq2   Seq3   Seq4   Seq5   Seq6   Seq7   Seq8   Seq9   Seq10
Participant 53: Seq1   Seq2   Seq3   Seq4   Seq5   Seq6   Seq7   Seq8   Seq9   Seq10
Participant 54: Seq1   Seq2   Seq3   Seq4   Seq5   Seq6   Seq7   Seq8   Seq9   Seq10
Participant 55: Seq1   Seq2   Seq3   Seq4   Seq5   Seq6   Seq7   Seq8   Seq9   Seq10
Participant 56: Seq1   Seq2   Seq3   Seq4   Seq5   Seq6   Seq7   Seq8   Seq9   Seq10
Participant 57: Seq1   Seq2   Seq3   Seq4   Seq5   Seq6   Seq7   Seq8   Seq9   Seq10
Participant 58: Seq1   Seq2   Seq3   Seq4   Seq5   Seq6   Seq7   Seq8   Seq9   Seq10
Participant 59: Seq1   Seq2   Seq3   Seq4   Seq5   Seq6   Seq7   Seq8   Seq9   Seq10
Participant 60: Seq1   Seq2   Seq3   Seq4   Seq5   Seq6   Seq7   Seq8   Seq9   Seq10
Participant 61: Seq1   Seq2   Seq3   Seq4   Seq5   Seq6   Seq7   Seq8   Seq9   Seq10
Participant 62: Seq1   Seq2   Seq3   Seq4   Seq5   Seq6   Seq7   Seq8   Seq9   Seq10
Participant 63: Seq1   Seq2   Seq3   Seq4   Seq5   Seq6   Seq7   Seq8   Seq9   Seq10
Participant 64: Seq1   Seq2   Seq3   Seq4   Seq5   Seq6   Seq7   Seq8   Seq9   Seq10
Participant 65: Seq1   Seq2   Seq3   Seq4   Seq5   Seq6   Seq7   Seq8   Seq9   Seq10
Participant 66: Seq1   Seq2   Seq3   Seq4   Seq5   Seq6   Seq7   Seq8   Seq9   Seq10
Participant 67: Seq1   Seq2   Seq3   Seq4   Seq5   Seq6   Seq7   Seq8   Seq9   Seq10
Participant 68: Seq1   Seq2   Seq3   Seq4   Seq5   Seq6   Seq7   Seq8   Seq9   Seq10
Participant 69: Seq1   Seq2   Seq3   Seq4   Seq5   Seq6   Seq7   Seq8   Seq9   Seq10
Participant 70: Seq1   Seq2   Seq3   Seq4   Seq5   Seq6   Seq7   Seq8   Seq9   Seq10
Participant 71: Seq1   Seq2   Seq3   Seq4   Seq5   Seq6   Seq7   Seq8   Seq9   Seq10
Participant 72: Seq1   Seq2   Seq3   Seq4   Seq5   Seq6   Seq7   Seq8   Seq9   Seq10
Participant 73: Seq1   Seq2   Seq3   Seq4   Seq5   Seq6   Seq7   Seq8   Seq9   Seq10
Participant 74: Seq1   Seq2   Seq3   Seq4   Seq5   Seq6   Seq7   Seq8   Seq9   Seq10
Participant 75: Seq1   Seq2   Seq3   Seq4   Seq5   Seq6   Seq7   Seq8   Seq9   Seq10
Participant 76: Seq1   Seq2   Seq3   Seq4   Seq5   Seq6   Seq7   Seq8   Seq9   Seq10
Participant 77: Seq1   Seq2   Seq3   Seq4   Seq5   Seq6   Seq7   Seq8   Seq9   Seq10

Part 2: Speaking Faces with Varying Pose (S3DFM-VP)

The video sequences above recorded the participants while they were facing forward and were essentially static, except for their speaking. We recorded an additional 10 videos where the 26 participants are moving their heads while speaking the same passphrase.

Note: The participants in the S3DFM-VP are the same as some in the S3DFM-FP. The identity number correspondences are given as follows for possible linking purposes.

S3DFM-VP	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26
S3DFM-FP	1	60	43	59	11	2	18	7	49	61	62	63	64	65	66	67	68	69	70	71	72	73	74	75	76	77

Research Team

PI: Jie Zhang (zhangjie09 @ buaa.edu.cn)
Collaborator: Luis Horna
Supervisor: Robert B. Fisher (rbf @ inf.ed.ac.uk)

The database was established by Jie Zhang as part of her PhD research while she was a visiting PhD student at the University of Edinburgh (UoE). Jie Zhang was with Beihang University and UoE. Robert B. Fisher and Luis Horna are with UoE.

You might be interested in these related papers:

J. Zhang, C. Maniatis, L. Horna, R. B. Fisher; Dynamic 3D Reconstruction Improvement via Intensity Video Guided 4D Fusion, Journal of Visual Communication and Image Representation, 55: 540-547, 2018.
J. Zhang, K. Richmond, R. B. Fisher; Dual-modality Talking-metrics: 3D Visual-Audio Integrated Behaviometric Cues from Speakers, Proc. Int. Conf. on Pattern Recognition, Beijing, 2018.

If you have any questions, please don't hesitate to contact us.

Acknowledgement

This research was supported by the funding from China Scholarship Council (CSC) under grant 201606020087 and National Council for Science and Technology (CONACyT) of Mexico. We would like to thank all the participants in the data acquisition and the support from DI4D Ltd.

Speech-driven 3D Facial Motion Database (S3DFM)