Learning Multi-dimensional Human Preference for Text-to-Image Generation

Sixian Zhang^*, Bohan Wang^*, Junqiang Wu^*, Yan Li^‡, Tingting Gao, Di Zhang, Zhongyuan Wang

Kuaishou Technology
^*Equal contribution.
^‡Corresponding author.

Multi-dimensional Human Preference (MHP) Dataset

To learn the multi-dimensional human preferences, we propose the Multi-dimensional Human Preference (MHP) dataset. Compared to prior efforts, the MHP dataset offers significant enhancements in prompts collection, image generation, and preference annotation.
(1) For the prompt collection, based on the categories schema of Parti, we annotate the collected prompts into 7 category labels (e.g., characters, scenes, objects, animals, etc.). For the underrepresented tail categories, we employ Large Language Models (LLMs) (e.g., GPT-4 \cite{GPT-4}) to generate additional prompts. This process results in a balanced prompt collection across various categories, which is used for later image generation.
(2) For image generation, we not only utilize existing open-source Diffusion models and their variants, but also employ GANs and auto-regressive models to generate images. Consequently, we generate a dataset of 607,541 images, which are further used to create 918,315 pairwise comparisons of images for preference annotation.
(3) For the annotation of human preferences, contrary to the single annotation of existing work, we consider a broader range of dimensions for human preferences and employ human annotators to label each image pair across four dimensions, including aesthetics, detail quality, semantic alignment, and overall score.

Comparisons of text-to-image models quality databases
Dataset	Prompt Collection		Image Generation		Preference Annotation
Dataset	Source	Annotation	Source	Number	Rating	Dimension
DiffusionDB	DiffusionDB	×	Diffusion(1)	1,819,808	0	None
AGIQA-1K	DiffusionDB	×	Diffusion(2)	1,080	23,760	Overall
PickScore	Web Application	×	Diffusion(3)	583,747	583,747	Overall
ImageReward	DiffusionDB	×	Auto Regressive; Diffusion(6)	136,892	410,676	Overall
HPS	DiffusionDB	×	Diffusion(1)	98,807	98,807	Overall
HPS v2	DiffusionDB, COCO	✓	GAN; Auto Regressive; Diffusion, COCO(9)	430,060	798,090	Overall
AGIQA-3K	DiffusionDB	×	GAN; Auto Regressive; Diffusion(6)	2,982	125,244	Overall; Alignment
MHP(Ours)	DiffusionDB, PromptHero, KOLORS, GPT4	✓	GAN; Auto Regressive; Diffusion(9)	607,541	918,315	Aesthetics, Detail, Alignment, Overall

Multi-dimensional Preference Score (MPS)

To learn human preferences, we propose the Multi-dimensional Preference Score (MPS), a unified model capable of predicting scores under various preference conditions.
(1) a certain preference is denoted by a series of descriptive words. For instance, the `aesthetic' condition is decomposed into words such as `light', `color', and `clarity' to describe the attributes of this condition.
(2) These attribute words are used to compute similarities with the prompt, resulting in a similarity matrix that reflects the correspondence between words in the prompt and the specified condition.
(3) Features from images and text are extracted using a pre-trained vision-language model. Subsequently, two modalities are fused through a multimodal cross-attention layer.
(4) The similarity matrix serves as a mask merged into the cross-attention layer, which ensures that the text only related to the condition is attended to by the visual modality. Then the fused features are used to predict the preference scores.

Visualization Results

The visualization results indicate that our HPS attends to different regions of prompts and images depending on the specific preference condition. This is attributed to the condition mask, which allows only those words in the prompt related to the preference condition to be observed by the image. The condition mask ensures that the model predicts the preference with different inputs, and the model only needs to calculate the similarity between patches in the image and the retained partial prompt to determine the final score. Therefore, the selective focus enabled by the condition mask allows utilizing a unified model to predict multinational preferences effectively, even if some preferences have weak correlations with others.

Citation

If our model or paper has been helpful to you, we kindly ask you to cite it as follows:

@inproceedings{MPS,
            title={Learning Multi-dimensional Human Preference for Text-to-Image Generation},
            author={Zhang, Sixian and Wang, Bohan and Wu, Junqiang and Li, Yan and Gao, Tingting and Zhang, Di and Wang, Zhongyuan},
            booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
            pages={8018--8027},
            year={2024}
          }

Thanks for your support!

Learning Multi-dimensional Human Preference for Text-to-Image Generation

Abstract

Multi-dimensional Human Preference (MHP) Dataset

Comparisons of text-to-image models quality databases

Multi-dimensional Preference Score (MPS)

Visualization Results

Citation