JEDI logo


Beyond FVD: Enhanced Evaluation Metrics for Video Generation Quality

Mila - Quebec AI Institute
JEDi_teaser

JEDi, the feature space of a V-JEPA model in combination with a Maximum Mean Discrepancy (MMD) metric, is a vastly more efficient framework for evaluating distributions of generated videos than conventional methods.

Abstract

The Fréchet Video Distance (FVD) is a widely adopted metric for evaluating video generation distribution quality. However, its effectiveness relies on critical assumptions. Our analysis reveals three significant limitations: (1) the non-Gaussianity of the Inflated 3D Convnet (I3D) feature space; (2) the insensitivity of I3D features to temporal distortions; (3) the impractical sample sizes required for reliable estimation. These findings undermine FVD's reliability and show that FVD falls short as a standalone metric for video generation evaluation. After extensive analysis of a wide range of metrics and backbone architectures, we propose JEDi, the JEPA Embedding Distance, based on features derived from a Joint Embedding Predictive Architecture, measured using Maximum Mean Discrepancy with polynomial kernel. Our experiments on multiple open-source datasets show clear evidence that it is a superior alternative to the widely used FVD metric, requiring only 16% of the samples to reach its steady value, while increasing alignment with human evaluation by 34%, on average.

Limitations of the Fréchet Video Distance

This paper addresses three significant challenges undermining the reliability of the FVD metric:

  • (1) Non-Gaussian feature space: The Inflated 3D Convnet (I3D) feature space exhibits non-Gaussianity.
  • (2) Temporal distortion insensitivity: I3D features are insensitive to temporal distortions.
  • (3) Impractical sample sizes: Reliable estimation requires impractically large sample sizes.

Non-Gaussian Feature Spaces

The Fréchet distance (FD) measures the difference between means and covariances. This can offer insights into the first two moments of the distributions, but fails to do so with respect to higher-order moments (e.g., skewness, kurtosis) that arise when either the real or generated data distribution is non-Gaussian. For many video datasets, the I3D feature space (in which FVD is computed) is non-Gaussian, which can lead to misleading FD values.

Gaussianity Assumption: FVD assumes a multivariate Gaussian distribution of the I3D feature space, which may not reflect the complexity and variability of real video distributions.

Temporal Distortion Insensitivity

Metrics in the I3D feature space are impacted by salt and pepper noise (a spatial distortion) significantly more than by other types of distortions. This is contrary to human perception, which is more sensitive to temporal distortions or elastic deformations. The insensitivity of I3D features to temporal distortions can lead to misleading FD values, as the metric may not capture the most salient aspects of video quality.

We demonstrate this insensitivity by comparing the FVD and JEDi metrics on a video with temporal distortions and salt and pepper noise. JEDi provides a more accurate evaluation of video quality, according to its agreement with survey responses.

Noise Sensitivity

Temporal Distortion Insensitivity: I3D features are insensitive to temporal distortions, leading to misleading FVD values. FVD strongly prefers the highly blurred video (left) over the video with salt and pepper noise (right), contrary to 95% of human raters. JEDi, on the other hand, agrees with human perception more closely.

Alignment with Human Evaluation

We additionally find that while metrics within a certain feature space generally perform at the same level, distances calculated in the feature space of a V-JEPA model agree with human evaluation more closely than alternative feature spaces. Twenty independent raters were presented with with anonymized video pairs differing only in noise distortion, and asked to select the video with the higher quality, or indicate no observable difference.

Following the Analytic Hierarchy Process (AHP) (Saaty, 1987), a pairwise comparison matrix was used to aggregate the responses and compute their similarity to a series of metrics computed in different feature spaces. The V-JEPA feature space was found to have the highest correlation with human perception.

Cosine Similarity, UCF

UCF-101

Cosine Similarity, Sky Scenes

Sky Scenes

Impractical Sample Sizes

It often takes thousands of video clips for FVD distances to converge to a stable value; however, many datasets contain an insufficient amount of unique videos to reach this threshold. This is often worked around by transforming videos into shorter, partly overlapping clips. This method is problematic and biases the metric due to the repetitiveness of the data. This fault has remained largely challenged.

We note three key hurdles stemming from this sample efficiency issue in video generation: (1) Data size: Limited samples compromise estimate reliability, undermining robust statistical analysis; (2) Computational resources: Generating samples is computationally expensive and time-demanding; (3) Metric convergence speed: Slow convergence rates hinder accurate assessments. While dataset size and computational resources are largely beyond our control, we can address the third concern by selecting metrics with higher sample efficiency where convergence happens with less samples.

Cosine Similarity, Sky Scenes

Impractical Sample Sizes: Comparing the convergence rates of FVD and JEDi. FVD requires substantially more samples to reach a stable value.

Stable Video Diffusion Fine-Tuning Dynamics: JEDi vs FVD

We evaluated FVD and JEDi video distribution distances across 5 training checkpoints while fine-tuning Stable Video Diffusion on the BDD dataset. JEDi tracks incremental gains in all checkpoints, whereas FVD detects monumental gains early. Visually, video quality continues to increase as fine-tuning progresses, but this is not observed when tracking performance using FVD.

Fine-tuning Progression

Stable Video Diffusion Fine-Tuning Dynamics: JEDi vs FVD

For reference, we show sample videos from the BDD dataset at different fine-tuning iterations. The videos show the progression of the fine-tuning process, with the quality of the generated videos improving over time.

Iteration 0 Iteration 1 Iteration 6 Iteration 1200

BibTeX

@misc{luo2024jedi,
        title={Beyond FVD: Enhanced Evaluation Metrics for Video Generation Quality}, 
        author={Ge Ya Luo and Gian Favero and Zhi Hao Luo and Alexia Jolicoeur-Martineau and Christopher Pal},
        year={2024},
        eprint={2410.05203},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2410.05203}
      }