ICCV 2025
Visual quality assessment plays a crucial role in computer vision, especially in tasks like image quality assessment (IQA), image super-resolution, and document image enhancement. Traditional visual quality assessment techniques often rely on scalar metrics such as the peak signal-to-noise ratio (PSNR) and the structural similarity index measure (SSIM), which do not capture the perceptual quality of images or videos as experienced by humans. Since visual quality assessment has become more critical across fields such as medical imaging, satellite remote sensing, and document processing, there is a growing need for more comprehensive evaluation methods that consider human perception more accurately.
Additionally, recent advances in multimodal large language models (MLLMs) have expanded the potential for visual quality assessment by incorporating open-ended questions and natural language explanations, enabling a more nuanced understanding of visual quality. However, current methods are mainly focused on absolute quality ratings, which have inherent ambiguities. Therefore, there is an urgent need to explore visual quality comparison using MLLMs, particularly in the context of open-ended comparative assessments. This approach would enable a more reliable and consistent evaluation of visual quality, improving the alignment of computer vision models with human perceptual judgments.
The first dataset is ISRGen-QA, which is specifically designed for evaluating state-of-the-art super-resolution (SR) algorithms. The dataset comprises 720 super-resolved images with resolutions of 2040×(1152~1440) (approximately 2K), generated using ×2, ×3, ×4, and ×8 upscaling factors from 14 latest SR algorithms. These include five GAN-based methods, three diffusion-based approaches, four transformer-based techniques, and two regression-based models. A total of 23 trained observers with normal vision participated in subjective quality assessments, with the final dataset incorporating scores from 21 participants after anomaly filtering. ISRGen-QA provides both the mean opinion score (MOS) and detailed individual score distributions for each image, enabling in-depth analysis of human perceptual preferences and the distinct characteristics of different SR algorithms.
The second dataset, DIQA-5000, is a document image perceptual quality assessment dataset. It comprises 5,000 sets of document image samples, each containing a user-captured raw document image and its corresponding enhanced image produced by document enhancement algorithms. The raw document images were captured using a variety of mobile devices with different resolutions, lighting conditions, shooting angles, and document types. The enhancement process incorporates random combinations of multiple algorithms, including document cropping, distortion correction, image enhancement, and clutter removal. The dataset employs MOS as the evaluation criterion, where 15 human raters assess document quality based on five key dimensions: overall document quality, image clarity, geometric rectification, color fidelity, and image cleanliness.
The third dataset is the short-video engagement prediction dataset, EVQA-SnapUGC. The dataset comprises 90,000 short videos, all of which were published on Snapchat Spotlight. For each video, we have curated corresponding aggregated engagement data derived from viewing statistics, e.g., average watch time and engagement continuation rate (probability of watch time > 5s). All short videos in our dataset have a duration ranging from 10 to 60 seconds. To mitigate sampling bias from a small number of views, only short videos with view numbers exceeding 2000 are selected. The dataset is notably diverse, encompassing a wide range of video types, including Family, Food & Dining, Pets, Hobbies, Travel, Music Appreciation, Sports, etc.
The fourth dataset is Co-Instruct-690K, a large-scale dataset for visual quality comparison, comprising 420K coarse-grained and 270K fine-grained open-ended question-answer pairs. A hybrid data collection strategy is adopted to integrate structured comparisons from existing IQA datasets with synthetic comparative annotations derived from Merge2Compare and Teach2Compare—two complementary methods generating pseudo-supervised labels from human-labeled single-image quality descriptions and high-accuracy GPT-4o comparisons. The test set has 3,200 expert-crafted multi-choice questions (MCQs), covering 2,000 pairwise, 600 triple-image, and 600 quad-image comparisons. In addition to standard question types (Yes/No, What, How), VQC-Bench introduces “Which” questions, explicitly designed for comparative reasoning—a core aspect of human visual evaluation. A team of 10 experts annotates the dataset, with each MCQ answer cross-verified by another expert.
Prof. Alan Bovik (HonFRPS) holds the Cockrell Family Endowed Regents Chair in Engineering in the Chandra Family Department of Electrical and Computer Engineering in the Cockrell School of Engineering at The University of Texas at Austin, where he is Director of the Laboratory for Image and Video Engineering (LIVE). He is a faculty member in the Department of Electrical and Computer Engineering, the Wireless Networking and Communication Group, and the Institute for Neuroscience. His research interests include digital television, digital photography, visual perception, social media, and image and video processing. His work broadly focuses on creating new theories and algorithms that allow for the perceptually optimized streaming and sharing of visual media. The outcomes of his work have the benefits of ensuring the visual satisfaction of billions of viewers worldwide, while substantially reducing global bandwidth consumption. He has published over 1,000 technical articles in these areas. His publications have been cited more than 175,000 times in the literature, his H-index is above 135, and he is listed as a Highly-Cited Researcher by The Web of Science Group. His several books include the Handbook of Image and Video Processing (Academic Press, 2000, 2005), Modern Image Quality Assessment (2006), and the companion volumes The Essential Guides to Image and Video Processing (Academic Press, 2009).
Dr. Balu Adsumilli (IEEE Fellow) is the Head of Media Algorithms group at YouTube/Google, where he and his team research and develop algorithms to transform the uploaded videos to formats played across all your devices. Over the past years, he was instrumental in building and scaling technologies in the areas of video processing, computer vision, video compression, and video quality, which garnered Two Technology and Engineering Emmy awards for Google. Prior to YouTube, he was the Director of Advanced Technology at GoPro, where he led the Camera Architecture, and the Advanced Software teams, and developed their ProTune mode in collaboration with ACES and Technicolor. This paved the way for GoPro cameras capturing Industry neutral formats, and enabled their widespread applicability in the movie and television industry. Dr. Adsumilli serves on the board of the Television Academy, on the Visual Effects Society board, on the NATAS technical committee, on the IEEE Multimedia Signal Processing (MMSP) Technical Committee, the IEEE Image, Video, Multidimensional Signal Processing (IVMSP) Technical Committee, and on ACM Mile High Video Steering Committee. He has co-authored 125+ technical publications and holds 200+ US patents. He is on TPCs and organizing committees for various conferences and organized numerous workshops. He is a Fellow of IEEE, and an active member of ACM, SMPTE, VES, SPIE, and the Internet Society. He received his PhD from the University of California Santa Barbara, and masters from the University of Wisconsin Madison.