Abstract

Current metrics for text-to-image models typically rely on statistical metrics which inadequately represent the real preference of humans. Although recent works attempt to learn these preferences via human annotated images, they reduce the rich tapestry of human preference to a single overall score. However, the preference results vary when humans evaluate images with different aspects. There- fore, to learn the multi-dimensional human preferences, we propose the Multi-dimensional Preference Score (MPS), the first multi-dimensional preference scoring model for the evaluation of text-to-image models. The MPS introduces the preference condition module upon CLIP model to learn these diverse preferences. It is trained based on our Multi-dimensional Human Preference (MHP) Dataset, which comprises 918,315 human preference choices across 4 dimensions (i.e., aesthetics, semantic alignment, detail quality and overall assessment) on 607,541 images. The im- ages are generated by a wide range of latest text-to-image models. The MPS outperforms existing scoring methods across 3 datasets in 4 dimensions, enabling it a promising metric for evaluating and improving text-to-image generation. The model and dataset will be made publicly available to facilitate future research.

Multi-dimensional Human Preference (MHP) Dataset

To learn the multi-dimensional human preferences, we propose the Multi-dimensional Human Preference (MHP) dataset. Compared to prior efforts, the MHP dataset offers significant enhancements in prompts collection, image generation, and preference annotation.
(1) For the prompt collection, based on the categories schema of Parti, we annotate the collected prompts into 7 category labels (e.g., characters, scenes, objects, animals, etc.). For the underrepresented tail categories, we employ Large Language Models (LLMs) (e.g., GPT-4 \cite{GPT-4}) to generate additional prompts. This process results in a balanced prompt collection across various categories, which is used for later image generation.
(2) For image generation, we not only utilize existing open-source Diffusion models and their variants, but also employ GANs and auto-regressive models to generate images. Consequently, we generate a dataset of 607,541 images, which are further used to create 918,315 pairwise comparisons of images for preference annotation.
(3) For the annotation of human preferences, contrary to the single annotation of existing work, we consider a broader range of dimensions for human preferences and employ human annotators to label each image pair across four dimensions, including aesthetics, detail quality, semantic alignment, and overall score.

Comparisons of text-to-image models quality databases

Dataset Prompt Collection Image Generation Preference Annotation
Source Annotation Source Number Rating Dimension
DiffusionDB DiffusionDB × Diffusion(1) 1,819,808 0 None
AGIQA-1K DiffusionDB × Diffusion(2) 1,080 23,760 Overall
PickScore Web Application × Diffusion(3) 583,747 583,747 Overall
ImageReward DiffusionDB × Auto Regressive; Diffusion(6) 136,892 410,676 Overall
HPS DiffusionDB × Diffusion(1) 98,807 98,807 Overall
HPS v2 DiffusionDB, COCO GAN; Auto Regressive; Diffusion, COCO(9) 430,060 798,090 Overall
AGIQA-3K DiffusionDB × GAN; Auto Regressive; Diffusion(6) 2,982 125,244 Overall; Alignment
MHP(Ours) DiffusionDB, PromptHero, KOLORS, GPT4 GAN; Auto Regressive; Diffusion(9) 607,541 918,315 Aesthetics, Detail, Alignment, Overall

Multi-dimensional Preference Score (MPS)

To learn human preferences, we propose the Multi-dimensional Preference Score (MPS), a unified model capable of predicting scores under various preference conditions.
(1) a certain preference is denoted by a series of descriptive words. For instance, the `aesthetic' condition is decomposed into words such as `light', `color', and `clarity' to describe the attributes of this condition.
(2) These attribute words are used to compute similarities with the prompt, resulting in a similarity matrix that reflects the correspondence between words in the prompt and the specified condition.
(3) Features from images and text are extracted using a pre-trained vision-language model. Subsequently, two modalities are fused through a multimodal cross-attention layer.
(4) The similarity matrix serves as a mask merged into the cross-attention layer, which ensures that the text only related to the condition is attended to by the visual modality. Then the fused features are used to predict the preference scores.



dog
-->

Visualization Results

The visualization results indicate that our HPS attends to different regions of prompts and images depending on the specific preference condition. This is attributed to the condition mask, which allows only those words in the prompt related to the preference condition to be observed by the image. The condition mask ensures that the model predicts the preference with different inputs, and the model only needs to calculate the similarity between patches in the image and the retained partial prompt to determine the final score. Therefore, the selective focus enabled by the condition mask allows utilizing a unified model to predict multinational preferences effectively, even if some preferences have weak correlations with others.


sample1
sample2
sample3
sample4

Citation

If our model or paper has been helpful to you, we kindly ask you to cite it as follows:

@inproceedings{MPS,
            title={Learning Multi-dimensional Human Preference for Text-to-Image Generation},
            author={Zhang, Sixian and Wang, Bohan and Wu, Junqiang and Li, Yan and Gao, Tingting and Zhang, Di and Wang, Zhongyuan},
            booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
            pages={8018--8027},
            year={2024}
          }

Thanks for your support!