Table of Contents
cs.CV [Back]
[1] Multimodal Cinematic Video Synthesis Using Text-to-Image and Audio Generation Models
Sridhar S,Nithin A,Shakeel Rifath,Vasantha Raj
Main category: cs.CV
TL;DR: 该论文提出了一种基于Stable Diffusion、GPT-2和混合音频管道的自动生成60秒电影视频的方法,结合线性帧插值和后处理技术,实现了高质量的文本到视频合成。
Details
Motivation: 利用生成式人工智能技术改进多媒体创作,实现从文本输入自动生成高质量电影视频,为创意、教育和工业应用提供支持。 Method: 采用Stable Diffusion生成高保真图像,GPT-2构建叙事结构,混合音频管道(gTTS和YouTube音乐)处理声音,并通过五场景框架、帧插值和后处理技术优化视频质量。 Result: 实验结果表明,该方法在视觉质量、叙事连贯性和效率方面表现优异,支持1024x768分辨率和15-30 FPS帧率。 Conclusion: 该方法进一步推动了文本到视频合成技术的发展,为多领域应用提供了高效可靠的解决方案。 Abstract: Advances in generative artificial intelligence have altered multimedia creation, allowing for automatic cinematic video synthesis from text inputs. This work describes a method for creating 60-second cinematic movies incorporating Stable Diffusion for high-fidelity image synthesis, GPT-2 for narrative structuring, and a hybrid audio pipeline using gTTS and YouTube-sourced music. It uses a five-scene framework, which is augmented by linear frame interpolation, cinematic post-processing (e.g., sharpening), and audio-video synchronization to provide professional-quality results. It was created in a GPU-accelerated Google Colab environment using Python 3.11. It has a dual-mode Gradio interface (Simple and Advanced), which supports resolutions of up to 1024x768 and frame rates of 15-30 FPS. Optimizations such as CUDA memory management and error handling ensure reliability. The experiments demonstrate outstanding visual quality, narrative coherence, and efficiency, furthering text-to-video synthesis for creative, educational, and industrial applications.[2] LoRA-Edit: Controllable First-Frame-Guided Video Editing via Mask-Aware LoRA Fine-Tuning
Chenjian Gao,Lihe Ding,Xin Cai,Zhanpeng Huang,Zibin Wang,Tianfan Xue
Main category: cs.CV
TL;DR: 提出了一种基于掩码的LoRA调优方法,用于灵活的视频编辑,无需改变模型架构。
Details
Motivation: 现有视频编辑方法依赖大规模预训练,缺乏灵活性,尤其是对后续帧的控制不足。 Method: 采用掩码驱动的LoRA调优策略,结合输入视频和参考图像,动态调整模型注意力。 Result: 实验结果表明,该方法在视频编辑性能上优于现有技术。 Conclusion: 该方法实现了高效且灵活的视频编辑,解决了现有方法的局限性。 Abstract: Video editing using diffusion models has achieved remarkable results in generating high-quality edits for videos. However, current methods often rely on large-scale pretraining, limiting flexibility for specific edits. First-frame-guided editing provides control over the first frame, but lacks flexibility over subsequent frames. To address this, we propose a mask-based LoRA (Low-Rank Adaptation) tuning method that adapts pretrained Image-to-Video (I2V) models for flexible video editing. Our approach preserves background regions while enabling controllable edits propagation. This solution offers efficient and adaptable video editing without altering the model architecture. To better steer this process, we incorporate additional references, such as alternate viewpoints or representative scene states, which serve as visual anchors for how content should unfold. We address the control challenge using a mask-driven LoRA tuning strategy that adapts a pre-trained image-to-video model to the editing context. The model must learn from two distinct sources: the input video provides spatial structure and motion cues, while reference images offer appearance guidance. A spatial mask enables region-specific learning by dynamically modulating what the model attends to, ensuring that each area draws from the appropriate source. Experimental results show our method achieves superior video editing performance compared to state-of-the-art methods.[3] DeepTraverse: A Depth-First Search Inspired Network for Algorithmic Visual Understanding
Bin Guo,John H. L. Hansen
Main category: cs.CV
TL;DR: DeepTraverse是一种新型视觉架构,受算法搜索策略启发,通过系统阐明和自适应细化学习特征,优于传统方法。
Details
Motivation: 传统视觉主干网络特征构建方式单一,缺乏自适应迭代细化路径,能否通过算法搜索原则实现更结构化、逻辑化的处理流程? Method: DeepTraverse包含递归探索模块(深化特征分析)和自适应校准模块(动态调整特征显著性),通过算法交互智能构建特征。 Result: 在多个图像分类基准测试中,DeepTraverse表现出色,分类准确率和特征区分度优于传统模型。 Conclusion: 集成算法先验为构建更高效、性能更强且结构化的视觉主干提供了有效策略。 Abstract: Conventional vision backbones, despite their success, often construct features through a largely uniform cascade of operations, offering limited explicit pathways for adaptive, iterative refinement. This raises a compelling question: can principles from classical search algorithms instill a more algorithmic, structured, and logical processing flow within these networks, leading to representations built through more interpretable, perhaps reasoning-like decision processes? We introduce DeepTraverse, a novel vision architecture directly inspired by algorithmic search strategies, enabling it to learn features through a process of systematic elucidation and adaptive refinement distinct from conventional approaches. DeepTraverse operationalizes this via two key synergistic components: recursive exploration modules that methodically deepen feature analysis along promising representational paths with parameter sharing for efficiency, and adaptive calibration modules that dynamically adjust feature salience based on evolving global context. The resulting algorithmic interplay allows DeepTraverse to intelligently construct and refine feature patterns. Comprehensive evaluations across a diverse suite of image classification benchmarks show that DeepTraverse achieves highly competitive classification accuracy and robust feature discrimination, often outperforming conventional models with similar or larger parameter counts. Our work demonstrates that integrating such algorithmic priors provides a principled and effective strategy for building more efficient, performant, and structured vision backbones.[4] Test-Time Adaptation for Generalizable Task Progress Estimation
Christos Ziakas,Alessandra Russo
Main category: cs.CV
TL;DR: 提出一种测试时适应方法,通过优化自监督目标,使进度估计模型能够在线适应测试轨迹的视觉和时间上下文。
Details
Motivation: 解决进度估计模型在多样化的任务、环境和体现中泛化能力不足的问题。 Method: 采用基于梯度的元学习策略,结合专家视觉轨迹和自然语言任务描述进行训练,测试时通过语义内容优化进度估计。 Result: 方法在多样化任务中表现优于当前最先进的上下文学习方法。 Conclusion: 测试时适应方法显著提升了模型在分布外任务中的泛化能力。 Abstract: We propose a test-time adaptation method that enables a progress estimation model to adapt online to the visual and temporal context of test trajectories by optimizing a learned self-supervised objective. To this end, we introduce a gradient-based meta-learning strategy to train the model on expert visual trajectories and their natural language task descriptions, such that test-time adaptation improves progress estimation relying on semantic content over temporal order. Our test-time adaptation method generalizes from a single training environment to diverse out-of-distribution tasks, environments, and embodiments, outperforming the state-of-the-art in-context learning approach using autoregressive vision-language models.[5] EfficientVLA: Training-Free Acceleration and Compression for Vision-Language-Action Models
Yantai Yang,Yuhao Wang,Zichen Wen,Luo Zhongwei,Chang Zou,Zhipeng Zhang,Chuan Wen,Linfeng Zhang
Main category: cs.CV
TL;DR: EfficientVLA是一种无需训练的高效推理加速框架,通过系统性消除冗余,显著提升Vision-Language-Action模型的推理速度和效率。
Details
Motivation: 现有VLA模型因计算和内存冗余严重,限制了实际部署能力,亟需一种全面解决方案。 Method: 结合三种策略:语言模块层剪枝、视觉任务感知优化和扩散动作头特征缓存。 Result: 在CogACT模型上实现1.93倍加速,FLOPs降至28.9%,性能仅下降0.6%。 Conclusion: EfficientVLA为VLA模型提供了一种高效、低损失的推理加速方案。 Abstract: Vision-Language-Action (VLA) models, particularly diffusion-based architectures, demonstrate transformative potential for embodied intelligence but are severely hampered by high computational and memory demands stemming from extensive inherent and inference-time redundancies. While existing acceleration efforts often target isolated inefficiencies, such piecemeal solutions typically fail to holistically address the varied computational and memory bottlenecks across the entire VLA pipeline, thereby limiting practical deployability. We introduce EfficientVLA, a structured and training-free inference acceleration framework that systematically eliminates these barriers by cohesively exploiting multifaceted redundancies. EfficientVLA synergistically integrates three targeted strategies: (1) pruning of functionally inconsequential layers from the language module, guided by an analysis of inter-layer redundancies; (2) optimizing the visual processing pathway through a task-aware strategy that selects a compact, diverse set of visual tokens, balancing task-criticality with informational coverage; and (3) alleviating temporal computational redundancy within the iterative diffusion-based action head by strategically caching and reusing key intermediate features. We apply our method to a standard VLA model CogACT, yielding a 1.93X inference speedup and reduces FLOPs to 28.9%, with only a 0.6% success rate drop in the SIMPLER benchmark.[6] A Manually Annotated Image-Caption Dataset for Detecting Children in the Wild
Klim Kireev,Ana-Maria Creţu,Raphael Meier,Sarah Adel Bargal,Elissa Redmiles,Carmela Troncoso
Main category: cs.CV
TL;DR: 论文介绍了Image-Caption Children in the Wild Dataset (ICCWD),一个用于检测未成年人图像的多模态数据集,填补了现有空白。
Details
Motivation: 目前缺乏用于多模态环境中检测未成年人内容的基准数据集,阻碍了相关工具的开发与评估。 Method: 发布了包含10,000张图像-标题对的ICCWD数据集,并手动标注了是否包含未成年人。使用该数据集评估了三种检测方法。 Result: 最佳检测方法的真阳性率为75.3%,表明未成年人检测任务具有挑战性。 Conclusion: ICCWD的发布有望促进更优的未成年人检测方法的设计,适用于多种场景。 Abstract: Platforms and the law regulate digital content depicting minors (defined as individuals under 18 years of age) differently from other types of content. Given the sheer amount of content that needs to be assessed, machine learning-based automation tools are commonly used to detect content depicting minors. To our knowledge, no dataset or benchmark currently exists for detecting these identification methods in a multi-modal environment. To fill this gap, we release the Image-Caption Children in the Wild Dataset (ICCWD), an image-caption dataset aimed at benchmarking tools that detect depictions of minors. Our dataset is richer than previous child image datasets, containing images of children in a variety of contexts, including fictional depictions and partially visible bodies. ICCWD contains 10,000 image-caption pairs manually labeled to indicate the presence or absence of a child in the image. To demonstrate the possible utility of our dataset, we use it to benchmark three different detectors, including a commercial age estimation system applied to images. Our results suggest that child detection is a challenging task, with the best method achieving a 75.3% true positive rate. We hope the release of our dataset will aid in the design of better minor detection methods in a wide range of scenarios.[7] Detecção da Psoríase Utilizando Visão Computacional: Uma Abordagem Comparativa Entre CNNs e Vision Transformers
Natanael Lucena,Fábio S. da Silva,Ricardo Rios
Main category: cs.CV
TL;DR: 本文比较了CNN和ViT在银屑病及类似病变图像多分类任务中的性能,发现ViT在小模型上表现更优,其中DaViT-B以96.4%的f1-score成为最推荐架构。
Details
Motivation: 探索CNN和ViT在医学图像分类任务中的性能差异,特别是针对银屑病及类似病变的多分类问题。 Method: 使用预训练于ImageNet的CNN和ViT模型,针对特定数据集进行适配和评估。 Result: ViT表现优于CNN,尤其是DaViT-B模型,f1-score达96.4%。 Conclusion: ViT在医学图像分类中具有显著潜力,DaViT-B是银屑病自动检测的高效架构。 Abstract: This paper presents a comparison of the performance of Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) in the task of multi-classifying images containing lesions of psoriasis and diseases similar to it. Models pre-trained on ImageNet were adapted to a specific data set. Both achieved high predictive metrics, but the ViTs stood out for their superior performance with smaller models. Dual Attention Vision Transformer-Base (DaViT-B) obtained the best results, with an f1-score of 96.4%, and is recommended as the most efficient architecture for automated psoriasis detection. This article reinforces the potential of ViTs for medical image classification tasks.[8] ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs
Xiyao Wang,Zhengyuan Yang,Chao Feng,Yongyuan Liang,Yuhang Zhou,Xiaoyu Liu,Ziyi Zang,Ming Li,Chung-Ching Lin,Kevin Lin,Linjie Li,Furong Huang,Lijuan Wang
Main category: cs.CV
TL;DR: ViCrit任务通过训练视觉语言模型定位人工注入的视觉幻觉,提升了视觉感知能力,并在多个基准测试中表现优异。
Details
Motivation: 视觉语言模型在视觉感知任务中缺乏挑战性且明确可验证的任务,阻碍了强化学习的成功应用。 Method: 引入ViCrit任务,通过注入细微的视觉描述错误并训练模型定位错误,提供明确的二元奖励。 Result: 模型在多个视觉语言基准测试中表现显著提升,且能力可迁移到抽象图像和视觉数学任务。 Conclusion: ViCrit任务是一种有效且通用的方法,可增强视觉语言模型的视觉感知能力。 Abstract: Reinforcement learning (RL) has shown great effectiveness for fine-tuning large language models (LLMs) using tasks that are challenging yet easily verifiable, such as math reasoning or code generation. However, extending this success to visual perception in vision-language models (VLMs) has been impeded by the scarcity of vision-centric tasks that are simultaneously challenging and unambiguously verifiable. To this end, we introduce ViCrit (Visual Caption Hallucination Critic), an RL proxy task that trains VLMs to localize a subtle, synthetic visual hallucination injected into paragraphs of human-written image captions. Starting from a 200-word captions, we inject a single, subtle visual description error-altering a few words on objects, attributes, counts, or spatial relations-and task the model to pinpoint the corrupted span given the image and the modified caption. This formulation preserves the full perceptual difficulty while providing a binary, exact-match reward that is easy to compute and unambiguous. Models trained with the ViCrit Task exhibit substantial gains across a variety of VL benchmarks. Crucially, the improvements transfer beyond natural-image training data to abstract image reasoning and visual math, showing promises of learning to perceive rather than barely memorizing seen objects. To facilitate evaluation, we further introduce ViCrit-Bench, a category-balanced diagnostic benchmark that systematically probes perception errors across diverse image domains and error types. Together, our results demonstrate that fine-grained hallucination criticism is an effective and generalizable objective for enhancing visual perception in VLMs.[9] RoCA: Robust Cross-Domain End-to-End Autonomous Driving
Rajeev Yasarla,Shizhong Han,Hsin-Pai Cheng,Litian Liu,Shweta Mahajan,Apratim Bhattacharyya,Yunxiao Shi,Risheek Garrepalli,Hong Cai,Fatih Porikli
Main category: cs.CV
TL;DR: RoCA是一个用于跨域端到端自动驾驶的框架,通过联合概率分布建模和基令牌学习提高泛化能力,无需额外推理计算即可适应新域。
Details
Motivation: 现有端到端自动驾驶方法在跨域部署时面临性能不足和重训练成本高的问题,RoCA旨在解决这些挑战。 Method: RoCA通过高斯过程建模联合概率分布,学习基令牌和轨迹,支持概率推断未来轨迹,并与基础模型结合提升泛化性。 Result: RoCA在多种跨域场景中表现出色,显著优于直接微调,实现了强大的域泛化和适应性能。 Conclusion: RoCA为跨域端到端自动驾驶提供了高效且鲁棒的解决方案,具有实际部署潜力。 Abstract: End-to-end (E2E) autonomous driving has recently emerged as a new paradigm, offering significant potential. However, few studies have looked into the practical challenge of deployment across domains (e.g., cities). Although several works have incorporated Large Language Models (LLMs) to leverage their open-world knowledge, LLMs do not guarantee cross-domain driving performance and may incur prohibitive retraining costs during domain adaptation. In this paper, we propose RoCA, a novel framework for robust cross-domain E2E autonomous driving. RoCA formulates the joint probabilistic distribution over the tokens that encode ego and surrounding vehicle information in the E2E pipeline. Instantiating with a Gaussian process (GP), RoCA learns a set of basis tokens with corresponding trajectories, which span diverse driving scenarios. Then, given any driving scene, it is able to probabilistically infer the future trajectory. By using RoCA together with a base E2E model in source-domain training, we improve the generalizability of the base model, without requiring extra inference computation. In addition, RoCA enables robust adaptation on new target domains, significantly outperforming direct finetuning. We extensively evaluate RoCA on various cross-domain scenarios and show that it achieves strong domain generalization and adaptation performance.[10] SPARKE: Scalable Prompt-Aware Diversity Guidance in Diffusion Models via RKE Score
Mohammad Jalali,Haoyu Lei,Amin Gohari,Farzan Farnia
Main category: cs.CV
TL;DR: 论文提出了一种名为SPARKE的方法,通过条件熵实现提示感知的多样性引导,解决了扩散模型在生成多样化样本时的计算挑战。
Details
Motivation: 扩散模型在生成高保真图像时表现优异,但在提示引导下生成多样化样本仍具挑战性,尤其是需要跨语义相似提示评估多样性时。 Method: SPARKE方法利用条件熵进行多样性引导,动态地将多样性测量与相似提示关联,并通过优化计算复杂度从O(n^3)降至O(n)。 Result: 实验表明,SPARKE在多个文本到图像扩散模型中显著提升了生成数据的提示感知多样性,且计算成本较低。 Conclusion: SPARKE方法有效解决了提示感知多样性引导的计算问题,为大规模生成任务提供了实用解决方案。 Abstract: Diffusion models have demonstrated remarkable success in high-fidelity image synthesis and prompt-guided generative modeling. However, ensuring adequate diversity in generated samples of prompt-guided diffusion models remains a challenge, particularly when the prompts span a broad semantic spectrum and the diversity of generated data needs to be evaluated in a prompt-aware fashion across semantically similar prompts. Recent methods have introduced guidance via diversity measures to encourage more varied generations. In this work, we extend the diversity measure-based approaches by proposing the Scalable Prompt-Aware R\'eny Kernel Entropy Diversity Guidance (SPARKE) method for prompt-aware diversity guidance. SPARKE utilizes conditional entropy for diversity guidance, which dynamically conditions diversity measurement on similar prompts and enables prompt-aware diversity control. While the entropy-based guidance approach enhances prompt-aware diversity, its reliance on the matrix-based entropy scores poses computational challenges in large-scale generation settings. To address this, we focus on the special case of Conditional latent RKE Score Guidance, reducing entropy computation and gradient-based optimization complexity from the $O(n^3)$ of general entropy measures to $O(n)$. The reduced computational complexity allows for diversity-guided sampling over potentially thousands of generation rounds on different prompts. We numerically test the SPARKE method on several text-to-image diffusion models, demonstrating that the proposed method improves the prompt-aware diversity of the generated data without incurring significant computational costs. We release our code on the project page: https://mjalali.github.io/SPARKE[11] Retrieval of Surface Solar Radiation through Implicit Albedo Recovery from Temporal Context
Yael Frischholz,Devis Tuia,Michael Lehning
Main category: cs.CV
TL;DR: 论文提出了一种基于注意力机制的模拟器,用于从卫星图像序列中学习晴空地表反射率,解决了传统方法在山区因雪盖变化导致的背景反射率估计不准确的问题。
Details
Motivation: 传统方法依赖月度统计数据估计背景反射率,在山区因雪盖变化频繁而失效,因此需要一种能动态学习地表反射率的方法。 Method: 采用基于时空视觉变换器的注意力模拟器,无需手工特征(如反照率图或云掩膜),通过多光谱卫星图像、地形特征和太阳几何数据训练。 Result: 模型在提供足够长的时间上下文时,性能与基于反照率的模型相当,尤其在山区表现优异。 Conclusion: 该方法能隐式学习地表反射率动态,提升复杂地形下的泛化能力,代码和数据已公开。 Abstract: Accurate retrieval of surface solar radiation (SSR) from satellite imagery critically depends on estimating the background reflectance that a spaceborne sensor would observe under clear-sky conditions. Deviations from this baseline can then be used to detect cloud presence and guide radiative transfer models in inferring atmospheric attenuation. Operational retrieval algorithms typically approximate background reflectance using monthly statistics, assuming surface properties vary slowly relative to atmospheric conditions. However, this approach fails in mountainous regions where intermittent snow cover and changing snow surfaces are frequent. We propose an attention-based emulator for SSR retrieval that implicitly learns to infer clear-sky surface reflectance from raw satellite image sequences. Built on the Temporo-Spatial Vision Transformer, our approach eliminates the need for hand-crafted features such as explicit albedo maps or cloud masks. The emulator is trained on instantaneous SSR estimates from the HelioMont algorithm over Switzerland, a region characterized by complex terrain and dynamic snow cover. Inputs include multi-spectral SEVIRI imagery from the Meteosat Second Generation platform, augmented with static topographic features and solar geometry. The target variable is HelioMont's SSR, computed as the sum of its direct and diffuse horizontal irradiance components, given at a spatial resolution of 1.7 km. We show that, when provided a sufficiently long temporal context, the model matches the performances of albedo-informed models, highlighting the model's ability to internally learn and exploit latent surface reflectance dynamics. Our geospatial analysis shows this effect is most powerful in mountainous regions and improves generalization in both simple and complex topographic settings. Code and datasets are publicly available at https://github.com/frischwood/HeMu-dev.git[12] Attention, Please! Revisiting Attentive Probing for Masked Image Modeling
Bill Psomas,Dionysis Christopoulos,Eirini Baltzi,Ioannis Kakogeorgiou,Tilemachos Aravanis,Nikos Komodakis,Konstantinos Karantzalos,Yannis Avrithis,Giorgos Tolias
Main category: cs.CV
TL;DR: 论文提出了一种高效探测方法(EP),通过多查询交叉注意力机制优化了注意力探测的性能和效率,显著提升了速度和参数效率。
Details
Motivation: 由于线性探测(LP)无法充分反映掩码图像建模(MIM)的潜力,且现有注意力探测方法存在参数冗余和计算效率低的问题,需要一种更高效的探测方法。 Method: 提出高效探测(EP),采用多查询交叉注意力机制,减少冗余投影和可训练参数,提升计算效率。 Result: EP在七个基准测试中优于LP和现有注意力探测方法,泛化能力强,生成可解释的注意力图,并在低样本和分层设置中表现优异。 Conclusion: EP是一种高效且通用的探测方法,显著提升了注意力探测的性能和效率。 Abstract: As fine-tuning (FT) becomes increasingly impractical at scale, probing is emerging as the preferred evaluation protocol for self-supervised learning (SSL). Yet, the standard linear probing (LP) fails to adequately reflect the potential of models trained with Masked Image Modeling (MIM), due to the distributed nature of patch tokens. This motivates the need for attentive probing, an alternative that uses attention to selectively aggregate patch-level features. Despite its growing adoption, attentive probing remains under-explored, with existing methods suffering from excessive parameterization and poor computational efficiency. In this work, we revisit attentive probing through the lens of the accuracy-efficiency trade-off. We conduct a systematic study of existing methods, analyzing their mechanisms and benchmarking their performance. We introduce efficient probing (EP), a multi-query cross-attention mechanism that eliminates redundant projections, reduces the number of trainable parameters, and achieves up to a 10$\times$ speed-up over conventional multi-head attention. Despite its simplicity, EP outperforms LP and prior attentive probing approaches across seven benchmarks, generalizes well beyond MIM to diverse pre-training paradigms, produces interpretable attention maps, and achieves strong gains in low-shot and layer-wise settings. Code available at https://github.com/billpsomas/efficient-probing.[13] Improving Personalized Search with Regularized Low-Rank Parameter Updates
Fiona Ryan,Josef Sivic,Fabian Caba Heilbron,Judy Hoffman,James M. Rehg,Bryan Russell
Main category: cs.CV
TL;DR: 本文提出了一种用于个性化视觉语言检索的方法,通过调整双编码器模型的内部表示,结合正则化低秩适应和参数加法策略,显著提升了新概念的识别能力。
Details
Motivation: 个性化视觉语言检索需要从少量示例中学习新概念,并整合个人与通用知识以识别不同上下文中的概念。 Method: 采用正则化低秩适应调整语言编码器最后一层的少量参数,结合参数加法策略整合多个学习到的个人概念。 Result: 在两个基准测试(DeepFashion2和ConCon-Chi)上,个性化检索准确率比现有方法提高了4%-22%。 Conclusion: 该方法在保留通用知识的同时,有效提升了新概念的识别能力,为个性化视觉语言检索提供了新思路。 Abstract: Personalized vision-language retrieval seeks to recognize new concepts (e.g. "my dog Fido") from only a few examples. This task is challenging because it requires not only learning a new concept from a few images, but also integrating the personal and general knowledge together to recognize the concept in different contexts. In this paper, we show how to effectively adapt the internal representation of a vision-language dual encoder model for personalized vision-language retrieval. We find that regularized low-rank adaption of a small set of parameters in the language encoder's final layer serves as a highly effective alternative to textual inversion for recognizing the personal concept while preserving general knowledge. Additionally, we explore strategies for combining parameters of multiple learned personal concepts, finding that parameter addition is effective. To evaluate how well general knowledge is preserved in a finetuned representation, we introduce a metric that measures image retrieval accuracy based on captions generated by a vision language model (VLM). Our approach achieves state-of-the-art accuracy on two benchmarks for personalized image retrieval with natural language queries - DeepFashion2 and ConCon-Chi - outperforming the prior art by 4%-22% on personal retrievals.[14] ScoreMix: Improving Face Recognition via Score Composition in Diffusion Generators
Parsa Rahimi,Sebastien Marcel
Main category: cs.CV
TL;DR: ScoreMix是一种利用扩散模型分数组合特性的数据增强策略,显著提升判别器性能,尤其在标记数据有限的情况下。
Details
Motivation: 解决标记数据有限时判别器性能不足的问题。 Method: 通过扩散采样中不同类条件轨迹的分数凸组合生成挑战性合成样本。 Result: 在多个基准测试中显著提升判别能力,且发现组合判别器嵌入空间中距离较远的类效果更好。 Conclusion: ScoreMix无需大量参数搜索即可显著提升性能,适用于数据收集困难的实际场景。 Abstract: In this paper, we propose ScoreMix, a novel yet simple data augmentation strategy leveraging the score compositional properties of diffusion models to enhance discriminator performance, particularly under scenarios with limited labeled data. By convexly mixing the scores from different class-conditioned trajectories during diffusion sampling, we generate challenging synthetic samples that significantly improve discriminative capabilities in all studied benchmarks. We systematically investigate class-selection strategies for mixing and discover that greater performance gains arise when combining classes distant in the discriminator's embedding space, rather than close in the generator's condition space. Moreover, we empirically show that, under standard metrics, the correlation between the generator's learned condition space and the discriminator's embedding space is minimal. Our approach achieves notable performance improvements without extensive parameter searches, demonstrating practical advantages for training discriminative models while effectively mitigating problems regarding collections of large datasets. Paper website: https://parsa-ra.github.io/scoremix[15] California Crop Yield Benchmark: Combining Satellite Image, Climate, Evapotranspiration, and Soil Data Layers for County-Level Yield Forecasting of Over 70 Crops
Hamid Kamangir,Mona Hajiesmaeeli,Mason Earles
Main category: cs.CV
TL;DR: 该研究提出了一个综合的加州农作物产量基准数据集,结合多源数据开发了一个多模态深度学习模型,用于县级作物产量预测,取得了较高的预测精度。
Details
Motivation: 尽管有丰富的历史产量数据,但由于环境、气候和土壤因素的复杂交互,准确的作物产量预测仍具挑战性。 Method: 整合了卫星影像、气候记录、蒸散发和土壤数据,开发了一个多模态深度学习模型,结合分层特征提取和时间序列编码。 Result: 模型在未见测试数据集上的R2得分为0.76,表现出强大的预测性能。 Conclusion: 该基准和模型框架为农业预测、气候适应和精准农业提供了重要基础,数据集和代码已公开。 Abstract: California is a global leader in agricultural production, contributing 12.5% of the United States total output and ranking as the fifth-largest food and cotton supplier in the world. Despite the availability of extensive historical yield data from the USDA National Agricultural Statistics Service, accurate and timely crop yield forecasting remains a challenge due to the complex interplay of environmental, climatic, and soil-related factors. In this study, we introduce a comprehensive crop yield benchmark dataset covering over 70 crops across all California counties from 2008 to 2022. The benchmark integrates diverse data sources, including Landsat satellite imagery, daily climate records, monthly evapotranspiration, and high-resolution soil properties. To effectively learn from these heterogeneous inputs, we develop a multi-modal deep learning model tailored for county-level, crop-specific yield forecasting. The model employs stratified feature extraction and a timeseries encoder to capture spatial and temporal dynamics during the growing season. Static inputs such as soil characteristics and crop identity inform long-term variability. Our approach achieves an overall R2 score of 0.76 across all crops of unseen test dataset, highlighting strong predictive performance across California diverse agricultural regions. This benchmark and modeling framework offer a valuable foundation for advancing agricultural forecasting, climate adaptation, and precision farming. The full dataset and codebase are publicly available at our GitHub repository.[16] DySS: Dynamic Queries and State-Space Learning for Efficient 3D Object Detection from Multi-Camera Videos
Rajeev Yasarla,Shizhong Han,Hong Cai,Fatih Porikli
Main category: cs.CV
TL;DR: DySS提出了一种基于状态空间学习和动态查询的3D物体检测方法,显著提升了性能和效率。
Details
Motivation: 现有方法依赖密集BEV特征或大量查询,计算成本高,难以扩展到多帧视频。 Method: DySS利用状态空间模型(SSM)处理时序特征,引入未来预测和掩码重建辅助任务,动态更新查询。 Result: 在nuScenes测试集上,DySS达到65.31 NDS和57.4 mAP,优于现有方法;实时推理速度为33 FPS。 Conclusion: DySS通过状态空间学习和动态查询,实现了高效且高性能的3D物体检测。 Abstract: Camera-based 3D object detection in Bird's Eye View (BEV) is one of the most important perception tasks in autonomous driving. Earlier methods rely on dense BEV features, which are costly to construct. More recent works explore sparse query-based detection. However, they still require a large number of queries and can become expensive to run when more video frames are used. In this paper, we propose DySS, a novel method that employs state-space learning and dynamic queries. More specifically, DySS leverages a state-space model (SSM) to sequentially process the sampled features over time steps. In order to encourage the model to better capture the underlying motion and correspondence information, we introduce auxiliary tasks of future prediction and masked reconstruction to better train the SSM. The state of the SSM then provides an informative yet efficient summarization of the scene. Based on the state-space learned features, we dynamically update the queries via merge, remove, and split operations, which help maintain a useful, lean set of detection queries throughout the network. Our proposed DySS achieves both superior detection performance and efficient inference. Specifically, on the nuScenes test split, DySS achieves 65.31 NDS and 57.4 mAP, outperforming the latest state of the art. On the val split, DySS achieves 56.2 NDS and 46.2 mAP, as well as a real-time inference speed of 33 FPS.[17] HalLoc: Token-level Localization of Hallucinations for Vision Language Models
Eunkyu Park,Minyeong Kim,Gunhee Kim
Main category: cs.CV
TL;DR: 论文提出HalLoc数据集和基线模型,用于高效、概率性的视觉语言模型幻觉检测,支持分级置信度检测,提升模型可靠性。
Details
Motivation: 当前幻觉检测方法计算成本高且无法处理真实场景中的模糊边界,亟需高效解决方案。 Method: 构建包含15万标注样本的HalLoc数据集,并开发低开销的基线模型,支持实时检测。 Result: HalLoc数据集和基线模型实现了高效幻觉检测,可无缝集成现有视觉语言模型。 Conclusion: HalLoc为提升视觉语言模型的可信度提供了新途径,数据集和代码已开源。 Abstract: Hallucinations pose a significant challenge to the reliability of large vision-language models, making their detection essential for ensuring accuracy in critical applications. Current detection methods often rely on computationally intensive models, leading to high latency and resource demands. Their definitive outcomes also fail to account for real-world scenarios where the line between hallucinated and truthful information is unclear. To address these issues, we propose HalLoc, a dataset designed for efficient, probabilistic hallucination detection. It features 150K token-level annotated samples, including hallucination types, across Visual Question Answering (VQA), instruction-following, and image captioning tasks. This dataset facilitates the development of models that detect hallucinations with graded confidence, enabling more informed user interactions. Additionally, we introduce a baseline model trained on HalLoc, offering low-overhead, concurrent hallucination detection during generation. The model can be seamlessly integrated into existing VLMs, improving reliability while preserving efficiency. The prospect of a robust plug-and-play hallucination detection module opens new avenues for enhancing the trustworthiness of vision-language models in real-world applications. The HalLoc dataset and code are publicly available at: https://github.com/dbsltm/cvpr25_halloc.[18] Uncertainty-Aware Deep Learning for Automated Skin Cancer Classification: A Comprehensive Evaluation
Hamzeh Asgharnezhad,Pegah Tabarisaadi,Abbas Khosravi,Roohallah Alizadehsani,U. Rajendra Acharya
Main category: cs.CV
TL;DR: 该研究评估了基于深度学习的皮肤病变分类方法,结合迁移学习和不确定性量化,发现CLIP-based视觉变换器性能最佳,集成方法在准确性和不确定性处理间取得平衡。
Details
Motivation: 皮肤癌的准确诊断对早期治疗至关重要,但现有深度学习模型受限于数据稀缺和缺乏不确定性意识。 Method: 使用迁移学习和不确定性量化(UQ)方法,在HAM10000数据集上评估多种预训练特征提取器和分类器,并引入UQ技术(如MCD和集成方法)。 Result: CLIP-based视觉变换器(如LAION CLIP ViT-H/14与SVM结合)表现最佳;集成方法在准确性和不确定性处理间表现良好。 Conclusion: 将UQ整合到基于深度学习的医疗诊断中,可提升性能和临床应用的可靠性。 Abstract: Accurate and reliable skin cancer diagnosis is critical for early treatment and improved patient outcomes. Deep learning (DL) models have shown promise in automating skin cancer classification, but their performance can be limited by data scarcity and a lack of uncertainty awareness. In this study, we present a comprehensive evaluation of DL-based skin lesion classification using transfer learning and uncertainty quantification (UQ) on the HAM10000 dataset. In the first phase, we benchmarked several pre-trained feature extractors-including Contrastive Language-Image Pretraining (CLIP) variants, Residual Network-50 (ResNet50), Densely Connected Convolutional Network (DenseNet121), Visual Geometry Group network (VGG16), and EfficientNet-V2-Large-combined with a range of traditional classifiers such as Support Vector Machine (SVM), eXtreme Gradient Boosting (XGBoost), and logistic regression. Our results show that CLIP-based vision transformers, particularly LAION CLIP ViT-H/14 with SVM, deliver the highest classification performance. In the second phase, we incorporated UQ using Monte Carlo Dropout (MCD), Ensemble, and Ensemble Monte Carlo Dropout (EMCD) to assess not only prediction accuracy but also the reliability of model outputs. We evaluated these models using uncertainty-aware metrics such as uncertainty accuracy(UAcc), uncertainty sensitivity(USen), uncertainty specificity(USpe), and uncertainty precision(UPre). The results demonstrate that ensemble methods offer a good trade-off between accuracy and uncertainty handling, while EMCD is more sensitive to uncertain predictions. This study highlights the importance of integrating UQ into DL-based medical diagnosis to enhance both performance and trustworthiness in real-world clinical applications.[19] Towards Scalable SOAP Note Generation: A Weakly Supervised Multimodal Framework
Sadia Kamal,Tim Oates,Joy Wan
Main category: cs.CV
TL;DR: 提出一种弱监督多模态框架,用于从有限输入(如病灶图像和稀疏临床文本)生成临床结构化SOAP笔记,减轻医生负担并减少对大量标注数据的依赖。
Details
Motivation: 皮肤癌是全球最常见的癌症,每年医疗支出超过80亿美元。手动生成SOAP笔记耗时且导致医生疲劳,因此需要自动化解决方案。 Method: 采用弱监督多模态框架,结合病灶图像和稀疏临床文本生成SOAP笔记,减少对人工标注的依赖。 Result: 性能与GPT-4o、Claude和DeepSeek Janus Pro相当,并引入MedConceptEval和CCS两个新指标评估临床质量。 Conclusion: 该方法可扩展且临床实用,显著减轻医生负担,同时减少对大规模标注数据的需求。 Abstract: Skin carcinoma is the most prevalent form of cancer globally, accounting for over $8 billion in annual healthcare expenditures. In clinical settings, physicians document patient visits using detailed SOAP (Subjective, Objective, Assessment, and Plan) notes. However, manually generating these notes is labor-intensive and contributes to clinician burnout. In this work, we propose a weakly supervised multimodal framework to generate clinically structured SOAP notes from limited inputs, including lesion images and sparse clinical text. Our approach reduces reliance on manual annotations, enabling scalable, clinically grounded documentation while alleviating clinician burden and reducing the need for large annotated data. Our method achieves performance comparable to GPT-4o, Claude, and DeepSeek Janus Pro across key clinical relevance metrics. To evaluate clinical quality, we introduce two novel metrics MedConceptEval and Clinical Coherence Score (CCS) which assess semantic alignment with expert medical concepts and input features, respectively.[20] Research on Audio-Visual Quality Assessment Dataset and Method for User-Generated Omnidirectional Video
Fei Zhao,Da Pan,Zelu Qi,Ping Shi
Main category: cs.CV
TL;DR: 本文针对元宇宙中用户生成的全向视频(ODV)音视频质量评估(AVQA)研究不足的问题,构建了一个UGC全向音视频数据集,并提出了一个基线模型。
Details
Motivation: 随着元宇宙的兴起,全向视频(ODV)从专业生成内容(PGC)转向用户生成内容(UGC),但音视频质量评估(AVQA)的研究仍有限。 Method: 构建了一个UGC全向音视频数据集,包含300个视频,覆盖10种场景类型,并进行了主观AVQA实验。随后,提出了一个包含视频特征提取、音频特征提取和音视频融合模块的基线模型。 Result: 实验结果表明,该模型在提出的数据集上表现最优。 Conclusion: 该研究为UGC-ODV AVQA领域的发展提供了数据集和基线模型支持。 Abstract: In response to the rising prominence of the Metaverse, omnidirectional videos (ODVs) have garnered notable interest, gradually shifting from professional-generated content (PGC) to user-generated content (UGC). However, the study of audio-visual quality assessment (AVQA) within ODVs remains limited. To address this, we construct a dataset of UGC omnidirectional audio and video (A/V) content. The videos are captured by five individuals using two different types of omnidirectional cameras, shooting 300 videos covering 10 different scene types. A subjective AVQA experiment is conducted on the dataset to obtain the Mean Opinion Scores (MOSs) of the A/V sequences. After that, to facilitate the development of UGC-ODV AVQA fields, we construct an effective AVQA baseline model on the proposed dataset, of which the baseline model consists of video feature extraction module, audio feature extraction and audio-visual fusion module. The experimental results demonstrate that our model achieves optimal performance on the proposed dataset.[21] Using Vision Language Models to Detect Students' Academic Emotion through Facial Expressions
Deliang Wang,Chao Yang,Gaowei Chen
Main category: cs.CV
TL;DR: 研究探讨了视觉语言模型(VLMs)在分析学生学术情绪中的潜力,发现Qwen2.5-VL-7B-Instruct表现优于Llama-3.2-11B-Vision-Instruct,尤其在识别快乐和困惑情绪方面。
Details
Motivation: 传统监督学习方法难以泛化,而VLMs通过零样本提示提供了一种无需微调的新方法。 Method: 使用两种VLMs(Llama-3.2-11B-Vision-Instruct和Qwen2.5-VL-7B-Instruct)对5000张学生面部表情图像进行零样本分析。 Result: Qwen2.5-VL-7B-Instruct表现更优,尤其在识别快乐和困惑情绪上,但无法检测分心行为。 Conclusion: VLMs在学术情绪分析中具有潜力,尤其是Qwen2.5-VL-7B-Instruct可用于识别学生困惑情绪。 Abstract: Students' academic emotions significantly influence their social behavior and learning performance. Traditional approaches to automatically and accurately analyze these emotions have predominantly relied on supervised machine learning algorithms. However, these models often struggle to generalize across different contexts, necessitating repeated cycles of data collection, annotation, and training. The emergence of Vision-Language Models (VLMs) offers a promising alternative, enabling generalization across visual recognition tasks through zero-shot prompting without requiring fine-tuning. This study investigates the potential of VLMs to analyze students' academic emotions via facial expressions in an online learning environment. We employed two VLMs, Llama-3.2-11B-Vision-Instruct and Qwen2.5-VL-7B-Instruct, to analyze 5,000 images depicting confused, distracted, happy, neutral, and tired expressions using zero-shot prompting. Preliminary results indicate that both models demonstrate moderate performance in academic facial expression recognition, with Qwen2.5-VL-7B-Instruct outperforming Llama-3.2-11B-Vision-Instruct. Notably, both models excel in identifying students' happy emotions but fail to detect distracted behavior. Additionally, Qwen2.5-VL-7B-Instruct exhibits relatively high performance in recognizing students' confused expressions, highlighting its potential for practical applications in identifying content that causes student confusion.[22] PointGS: Point Attention-Aware Sparse View Synthesis with Gaussian Splatting
Lintao Xiang,Hongpei Zheng,Yating Huang,Qijun Yang,Hujun Yin
Main category: cs.CV
TL;DR: 提出了一种基于点特征感知的高斯溅射框架,解决了3DGS在稀疏视图输入下过拟合的问题,实现了实时高质量渲染。
Details
Motivation: 现有3DGS方法需要大量校准视图,稀疏输入时易过拟合,导致渲染质量下降。 Method: 利用立体基础模型估计相机姿态和稠密点云,通过多尺度2D特征采样和自注意力机制增强点间交互,最终用MLP解码高斯参数。 Result: 在多样化基准测试中显著优于NeRF方法,在少样本设置下与最先进的3DGS方法竞争。 Conclusion: 该方法在稀疏视图下实现了高质量的实时渲染,解决了3DGS的过拟合问题。 Abstract: 3D Gaussian splatting (3DGS) is an innovative rendering technique that surpasses the neural radiance field (NeRF) in both rendering speed and visual quality by leveraging an explicit 3D scene representation. Existing 3DGS approaches require a large number of calibrated views to generate a consistent and complete scene representation. When input views are limited, 3DGS tends to overfit the training views, leading to noticeable degradation in rendering quality. To address this limitation, we propose a Point-wise Feature-Aware Gaussian Splatting framework that enables real-time, high-quality rendering from sparse training views. Specifically, we first employ the latest stereo foundation model to estimate accurate camera poses and reconstruct a dense point cloud for Gaussian initialization. We then encode the colour attributes of each 3D Gaussian by sampling and aggregating multiscale 2D appearance features from sparse inputs. To enhance point-wise appearance representation, we design a point interaction network based on a self-attention mechanism, allowing each Gaussian point to interact with its nearest neighbors. These enriched features are subsequently decoded into Gaussian parameters through two lightweight multi-layer perceptrons (MLPs) for final rendering. Extensive experiments on diverse benchmarks demonstrate that our method significantly outperforms NeRF-based approaches and achieves competitive performance under few-shot settings compared to the state-of-the-art 3DGS methods.[23] GeoCAD: Local Geometry-Controllable CAD Generation
Zhanwei Zhang,Kaiyuan Liu,Junjie Liu,Wenxiao Wang,Binbin Lin,Liang Xie,Chen Shen,Deng Cai
Main category: cs.CV
TL;DR: GeoCAD是一种用户友好且局部几何可控的CAD生成方法,通过互补标注策略生成局部几何指令,利用LLM预测被掩码部分,实验证明其有效性。
Details
Motivation: 现有方法无法同时满足遵循文本指令和聚焦局部部分的需求,GeoCAD旨在解决这一问题。 Method: 提出互补标注策略(顶点和VLLM标注),训练时随机掩码局部部分并利用LLM预测,推理时用户可指定局部修改。 Result: 实验表明GeoCAD在生成质量、有效性和文本-CAD一致性方面表现优异。 Conclusion: GeoCAD为局部几何可控的CAD生成提供了高效解决方案,代码已开源。 Abstract: Local geometry-controllable computer-aided design (CAD) generation aims to modify local parts of CAD models automatically, enhancing design efficiency. It also ensures that the shapes of newly generated local parts follow user-specific geometric instructions (e.g., an isosceles right triangle or a rectangle with one corner cut off). However, existing methods encounter challenges in achieving this goal. Specifically, they either lack the ability to follow textual instructions or are unable to focus on the local parts. To address this limitation, we introduce GeoCAD, a user-friendly and local geometry-controllable CAD generation method. Specifically, we first propose a complementary captioning strategy to generate geometric instructions for local parts. This strategy involves vertex-based and VLLM-based captioning for systematically annotating simple and complex parts, respectively. In this way, we caption $\sim$221k different local parts in total. In the training stage, given a CAD model, we randomly mask a local part. Then, using its geometric instruction and the remaining parts as input, we prompt large language models (LLMs) to predict the masked part. During inference, users can specify any local part for modification while adhering to a variety of predefined geometric instructions. Extensive experiments demonstrate the effectiveness of GeoCAD in generation quality, validity and text-to-CAD consistency. Code will be available at https://github.com/Zhanwei-Z/GeoCAD.[24] UrbanSense:AFramework for Quantitative Analysis of Urban Streetscapes leveraging Vision Large Language Models
Jun Yin,Jing Zhong,Peilin Li,Pengyu Zeng,Miao Zhang,Ran Luo,Shuai Lu
Main category: cs.CV
TL;DR: 提出了一种基于视觉语言模型的多模态研究框架,用于自动化分析城市街景风格差异,并构建了数据集UrbanDiffBench和框架UrbanSense。
Details
Motivation: 城市文化和建筑风格因地理、历史和社会政治因素差异显著,传统研究方法依赖专家解读和历史文档,难以标准化。 Method: 采用视觉语言模型构建多模态框架UrbanSense,定量生成和比较城市风格表示。 Result: 实验结果显示,超过80%的描述通过t检验(p<0.05),主观评估的高Phi分数(城市0.912,时期0.833)验证了方法的有效性。 Conclusion: 该方法能科学量化城市风格演变,为未来设计提供数据支持。 Abstract: Urban cultures and architectural styles vary significantly across cities due to geographical, chronological, historical, and socio-political factors. Understanding these differences is essential for anticipating how cities may evolve in the future. As representative cases of historical continuity and modern innovation in China, Beijing and Shenzhen offer valuable perspectives for exploring the transformation of urban streetscapes. However, conventional approaches to urban cultural studies often rely on expert interpretation and historical documentation, which are difficult to standardize across different contexts. To address this, we propose a multimodal research framework based on vision-language models, enabling automated and scalable analysis of urban streetscape style differences. This approach enhances the objectivity and data-driven nature of urban form research. The contributions of this study are as follows: First, we construct UrbanDiffBench, a curated dataset of urban streetscapes containing architectural images from different periods and regions. Second, we develop UrbanSense, the first vision-language-model-based framework for urban streetscape analysis, enabling the quantitative generation and comparison of urban style representations. Third, experimental results show that Over 80% of generated descriptions pass the t-test (p less than 0.05). High Phi scores (0.912 for cities, 0.833 for periods) from subjective evaluations confirm the method's ability to capture subtle stylistic differences. These results highlight the method's potential to quantify and interpret urban style evolution, offering a scientifically grounded lens for future design.[25] RealKeyMorph: Keypoints in Real-world Coordinates for Resolution-agnostic Image Registration
Mina C. Moghadam,Alan Q. Wang,Omer Taub,Martin R. Prince,Mert R. Sabuncu
Main category: cs.CV
TL;DR: RealKeyMorph (RKM) 是一种分辨率无关的医学图像配准方法,避免了传统方法因重采样导致的伪影问题。
Details
Motivation: 现有基于机器学习的配准方法需将图像重采样至固定分辨率,这会引入插值伪影,而医学图像常因采集参数不同而分辨率各异。 Method: RKM 扩展了 KeyMorph 框架,通过训练网络学习图像对的关键点,并在真实世界坐标系中输出关键点,利用扫描仪的仿射矩阵实现分辨率无关性。 Result: 实验表明,RKM 在腹部 MRI 正交 2D 堆栈和不同分辨率脑部 3D 数据集的配准任务中表现优越。 Conclusion: RKM 提供了一种无需重采样的高效配准方案,适用于多分辨率医学图像。 Abstract: Many real-world settings require registration of a pair of medical images that differ in spatial resolution, which may arise from differences in image acquisition parameters like pixel spacing, slice thickness, and field-of-view. However, all previous machine learning-based registration techniques resample images onto a fixed resolution. This is suboptimal because resampling can introduce artifacts due to interpolation. To address this, we present RealKeyMorph (RKM), a resolution-agnostic method for image registration. RKM is an extension of KeyMorph, a registration framework which works by training a network to learn corresponding keypoints for a given pair of images, after which a closed-form keypoint matching step is used to derive the transformation that aligns them. To avoid resampling and enable operating on the raw data, RKM outputs keypoints in real-world coordinates of the scanner. To do this, we leverage the affine matrix produced by the scanner (e.g., MRI machine) that encodes the mapping from voxel coordinates to real world coordinates. By transforming keypoints into real-world space and integrating this into the training process, RKM effectively enables the extracted keypoints to be resolution-agnostic. In our experiments, we demonstrate the advantages of RKM on the registration task for orthogonal 2D stacks of abdominal MRIs, as well as 3D volumes with varying resolutions in brain datasets.[26] Motion-R1: Chain-of-Thought Reasoning and Reinforcement Learning for Human Motion Generation
Runqi Ouyang,Haoyun Li,Zhenyuan Zhang,Xiaofeng Wang,Zheng Zhu,Guan Huang,Xingang Wang
Main category: cs.CV
TL;DR: Motion-R1是一个结合Chain-of-Thought机制的统一运动-语言建模框架,通过分解复杂文本指令为逻辑动作路径,显著提升运动生成的可控性和多样性。
Details
Motivation: 现有方法在语义对齐和运动合成方面虽有进展,但依赖端到端映射策略,难以捕捉深层语言结构和逻辑推理,导致生成的运动缺乏可控性和多样性。 Method: 提出Motion-R1框架,集成Chain-of-Thought机制,将复杂指令分解为逻辑动作路径,并采用Group Relative Policy Optimization强化学习算法联合优化推理链和运动合成。 Result: 在多个基准数据集上的实验表明,Motion-R1在需要细致语义理解和长期时间一致性的场景中表现优于现有方法。 Conclusion: Motion-R1通过显式推理和联合优化,显著提升了文本到运动生成的性能,代码和模型将公开。 Abstract: Recent advances in large language models, especially in natural language understanding and reasoning, have opened new possibilities for text-to-motion generation. Although existing approaches have made notable progress in semantic alignment and motion synthesis, they often rely on end-to-end mapping strategies that fail to capture deep linguistic structures and logical reasoning. Consequently, generated motions tend to lack controllability, consistency, and diversity. To address these limitations, we propose Motion-R1, a unified motion-language modeling framework that integrates a Chain-of-Thought mechanism. By explicitly decomposing complex textual instructions into logically structured action paths, Motion-R1 provides high-level semantic guidance for motion generation, significantly enhancing the model's ability to interpret and execute multi-step, long-horizon, and compositionally rich commands. To train our model, we adopt Group Relative Policy Optimization, a reinforcement learning algorithm designed for large models, which leverages motion quality feedback to optimize reasoning chains and motion synthesis jointly. Extensive experiments across multiple benchmark datasets demonstrate that Motion-R1 achieves competitive or superior performance compared to state-of-the-art methods, particularly in scenarios requiring nuanced semantic understanding and long-term temporal coherence. The code, model and data will be publicly available.[27] FaceLiVT: Face Recognition using Linear Vision Transformer with Structural Reparameterization For Mobile Device
Novendra Setyawan,Chi-Chia Sun,Mao-Hsiu Hsu,Wen-Kai Kuo,Jun-Wei Hsieh
Main category: cs.CV
TL;DR: FaceLiVT是一种轻量级但强大的人脸识别模型,结合了CNN-Transformer架构和创新的多头部线性注意力机制(MHLA),在降低计算复杂度的同时保持高精度。
Details
Motivation: 解决资源受限平台上实时人脸识别的需求,平衡模型轻量化和性能。 Method: 采用混合CNN-Transformer架构和MHLA机制,结合重新参数化的token mixer。 Result: 在多个基准测试中表现优于现有轻量级模型,推理速度显著提升(比EdgeFace快8.6倍,比纯ViT模型快21.2倍)。 Conclusion: FaceLiVT为资源受限平台提供了高效且实用的实时人脸识别解决方案。 Abstract: This paper introduces FaceLiVT, a lightweight yet powerful face recognition model that integrates a hybrid Convolution Neural Network (CNN)-Transformer architecture with an innovative and lightweight Multi-Head Linear Attention (MHLA) mechanism. By combining MHLA alongside a reparameterized token mixer, FaceLiVT effectively reduces computational complexity while preserving competitive accuracy. Extensive evaluations on challenging benchmarks; including LFW, CFP-FP, AgeDB-30, IJB-B, and IJB-C; highlight its superior performance compared to state-of-the-art lightweight models. MHLA notably improves inference speed, allowing FaceLiVT to deliver high accuracy with lower latency on mobile devices. Specifically, FaceLiVT is 8.6 faster than EdgeFace, a recent hybrid CNN-Transformer model optimized for edge devices, and 21.2 faster than a pure ViT-Based model. With its balanced design, FaceLiVT offers an efficient and practical solution for real-time face recognition on resource-constrained platforms.[28] FSATFusion: Frequency-Spatial Attention Transformer for Infrared and Visible Image Fusion
Tianpei Zhang,Jufeng Zhao,Yiming Zhu,Guangmang Cui,Yuhan Lyu
Main category: cs.CV
TL;DR: 提出了一种名为FSATFusion的红外与可见光图像融合网络,通过频率-空间注意力Transformer模块(FSAT)和改进的Transformer模块(ITM)提升全局上下文信息提取能力,显著优于现有方法。
Details
Motivation: 现有深度学习方法因卷积操作难以捕获全局上下文导致信息丢失,限制了融合性能。 Method: 设计了FSAT模块(含频率-空间注意力机制FSAM)和ITM模块,增强特征提取能力。 Result: 实验表明FSATFusion在融合质量和效率上优于其他方法,且具备优秀的泛化能力。 Conclusion: FSATFusion在红外与可见光图像融合及下游视觉任务中表现卓越。 Abstract: The infrared and visible images fusion (IVIF) is receiving increasing attention from both the research community and industry due to its excellent results in downstream applications. Existing deep learning approaches often utilize convolutional neural networks to extract image features. However, the inherently capacity of convolution operations to capture global context can lead to information loss, thereby restricting fusion performance. To address this limitation, we propose an end-to-end fusion network named the Frequency-Spatial Attention Transformer Fusion Network (FSATFusion). The FSATFusion contains a frequency-spatial attention Transformer (FSAT) module designed to effectively capture discriminate features from source images. This FSAT module includes a frequency-spatial attention mechanism (FSAM) capable of extracting significant features from feature maps. Additionally, we propose an improved Transformer module (ITM) to enhance the ability to extract global context information of vanilla Transformer. We conducted both qualitative and quantitative comparative experiments, demonstrating the superior fusion quality and efficiency of FSATFusion compared to other state-of-the-art methods. Furthermore, our network was tested on two additional tasks without any modifications, to verify the excellent generalization capability of FSATFusion. Finally, the object detection experiment demonstrated the superiority of FSATFusion in downstream visual tasks. Our code is available at https://github.com/Lmmh058/FSATFusion.[29] Revisiting Transformers with Insights from Image Filtering
Laziz U. Abdullaev,Maksim Tkachenko,Tan M. Nguyen
Main category: cs.CV
TL;DR: 本文提出了一种统一的图像处理框架,旨在解释自注意力机制及其组件(如位置编码和残差连接)的作用,并通过实验验证了其改进性能。
Details
Motivation: 自注意力机制虽成功但缺乏理论解释,现有框架对其组件的机理理解不足。 Method: 开发统一的图像处理框架,解释自注意力及其组件,并引入两种独立的结构修改。 Result: 实验表明,基于图像处理的修改不仅提升了可解释性,还提高了任务准确性和鲁棒性。 Conclusion: 该框架为自注意力机制提供了更深入的理论基础,并展示了其实际应用潜力。 Abstract: The self-attention mechanism, a cornerstone of Transformer-based state-of-the-art deep learning architectures, is largely heuristic-driven and fundamentally challenging to interpret. Establishing a robust theoretical foundation to explain its remarkable success and limitations has therefore become an increasingly prominent focus in recent research. Some notable directions have explored understanding self-attention through the lens of image denoising and nonparametric regression. While promising, existing frameworks still lack a deeper mechanistic interpretation of various architectural components that enhance self-attention, both in its original formulation and subsequent variants. In this work, we aim to advance this understanding by developing a unifying image processing framework, capable of explaining not only the self-attention computation itself but also the role of components such as positional encoding and residual connections, including numerous later variants. We also pinpoint potential distinctions between the two concepts building upon our framework, and make effort to close this gap. We introduce two independent architectural modifications within transformers. While our primary objective is interpretability, we empirically observe that image processing-inspired modifications can also lead to notably improved accuracy and robustness against data contamination and adversaries across language and vision tasks as well as better long sequence understanding.[30] Leveraging 6DoF Pose Foundation Models For Mapping Marine Sediment Burial
Jerry Yan,Chinmay Talegaonkar,Nicholas Antipa,Eric Terrill,Sophia Merrifield
Main category: cs.CV
TL;DR: 论文提出了一种名为PoseIDON的计算机视觉方法,结合深度基础模型特征和多视角摄影测量技术,用于从ROV视频中估计海底物体的六自由度姿态及埋藏深度。
Details
Motivation: 海底人为物体的埋藏状态对研究局部沉积动力学、评估生态风险、污染物迁移以及危险材料(如弹药)的回收或缓解策略至关重要。 Method: 通过结合CAD模型与观测图像对齐,并拟合海底局部平面近似,推断埋藏深度。方法在54个物体(包括桶和弹药)的ROV视频中验证。 Result: 模型平均埋藏深度误差约为10厘米,并能解析反映沉积物迁移过程的空间埋藏模式。 Conclusion: 该方法实现了海底埋藏的非侵入式、可扩展测绘,支持污染场地的环境评估。 Abstract: The burial state of anthropogenic objects on the seafloor provides insight into localized sedimentation dynamics and is also critical for assessing ecological risks, potential pollutant transport, and the viability of recovery or mitigation strategies for hazardous materials such as munitions. Accurate burial depth estimation from remote imagery remains difficult due to partial occlusion, poor visibility, and object degradation. This work introduces a computer vision pipeline, called PoseIDON, which combines deep foundation model features with multiview photogrammetry to estimate six degrees of freedom object pose and the orientation of the surrounding seafloor from ROV video. Burial depth is inferred by aligning CAD models of the objects with observed imagery and fitting a local planar approximation of the seafloor. The method is validated using footage of 54 objects, including barrels and munitions, recorded at a historic ocean dumpsite in the San Pedro Basin. The model achieves a mean burial depth error of approximately 10 centimeters and resolves spatial burial patterns that reflect underlying sediment transport processes. This approach enables scalable, non-invasive mapping of seafloor burial and supports environmental assessment at contaminated sites.[31] DART: Differentiable Dynamic Adaptive Region Tokenizer for Vision Transformer and Mamba
Shicheng Yin,Kaixuan Yin,Yang Liu,Weixing Chen,Liang Lin
Main category: cs.CV
TL;DR: 论文提出了一种动态自适应区域标记器(DART),通过自适应划分图像为不同大小的内容相关补丁,解决了固定大小补丁在背景区域过度编码和局部细节丢失的问题。
Details
Motivation: 现有非卷积模型(如ViT和Vim)依赖固定大小补丁,导致背景区域过度编码和关键局部细节丢失,尤其在信息稀疏分布时。 Method: DART结合可学习区域评分和分段可微分分位数操作,自适应分配更密集的标记到信息丰富区域。 Result: DART仅增加约1M参数,在DeiT(ImageNet-1K)上准确率提升2.1%,同时减少45% FLOPs。 Conclusion: DART在DeiT、Vim和VideoMamba上一致提升性能,计算开销极小甚至减少,是一种高效解决方案。 Abstract: Recently, non-convolutional models such as the Vision Transformer (ViT) and Vision Mamba (Vim) have achieved remarkable performance in computer vision tasks. However, their reliance on fixed-size patches often results in excessive encoding of background regions and omission of critical local details, especially when informative objects are sparsely distributed. To address this, we introduce a fully differentiable Dynamic Adaptive Region Tokenizer (DART), which adaptively partitions images into content-dependent patches of varying sizes. DART combines learnable region scores with piecewise differentiable quantile operations to allocate denser tokens to information-rich areas. Despite introducing only approximately 1 million (1M) additional parameters, DART improves accuracy by 2.1% on DeiT (ImageNet-1K). Unlike methods that uniformly increase token density to capture fine-grained details, DART offers a more efficient alternative, achieving 45% FLOPs reduction with superior performance. Extensive experiments on DeiT, Vim, and VideoMamba confirm that DART consistently enhances accuracy while incurring minimal or even reduced computational overhead. Code is available at https://github.com/HCPLab-SYSU/DART.[32] ReconMOST: Multi-Layer Sea Temperature Reconstruction with Observations-Guided Diffusion
Yuanyi Song,Pumeng Lyu,Ben Fei,Fenghua Ling,Wanli Ouyang,Lei Bai
Main category: cs.CV
TL;DR: 本文提出ReconMOST,一种基于数据驱动的扩散模型框架,用于多层海水温度重建,解决了传统方法在稀疏数据和高计算成本下的局限性。
Details
Motivation: 准确的海洋重建对全球气候动力学和海洋气象研究至关重要,但传统方法因数据稀疏和计算成本高而受限,机器学习方法也仅适用于表层或局部区域。 Method: 提出ReconMOST框架,先通过历史数值模拟数据预训练无条件扩散模型,再以稀疏高精度观测数据为引导点进行反向扩散,实现多层温度重建。 Result: 在缺失数据超过92.5%的情况下,仍保持高精度重建,MSE值在引导、重建和总体上分别为0.049、0.680和0.633。 Conclusion: ReconMOST扩展了机器学习在海洋温度重建中的应用,具有高精度、高分辨率和强泛化能力。 Abstract: Accurate reconstruction of ocean is essential for reflecting global climate dynamics and supporting marine meteorological research. Conventional methods face challenges due to sparse data, algorithmic complexity, and high computational costs, while increasing usage of machine learning (ML) method remains limited to reconstruction problems at the sea surface and local regions, struggling with issues like cloud occlusion. To address these limitations, this paper proposes ReconMOST, a data-driven guided diffusion model framework for multi-layer sea temperature reconstruction. Specifically, we first pre-train an unconditional diffusion model using a large collection of historical numerical simulation data, enabling the model to attain physically consistent distribution patterns of ocean temperature fields. During the generation phase, sparse yet high-accuracy in-situ observational data are utilized as guidance points for the reverse diffusion process, generating accurate reconstruction results. Importantly, in regions lacking direct observational data, the physically consistent spatial distribution patterns learned during pre-training enable implicitly guided and physically plausible reconstructions. Our method extends ML-based SST reconstruction to a global, multi-layer setting, handling over 92.5% missing data while maintaining reconstruction accuracy, spatial resolution, and superior generalization capability. We pre-train our model on CMIP6 numerical simulation data and conduct guided reconstruction experiments on CMIP6 and EN4 analysis data. The results of mean squared error (MSE) values achieve 0.049 on guidance, 0.680 on reconstruction, and 0.633 on total, respectively, demonstrating the effectiveness and robustness of the proposed framework. Our source code is available at https://github.com/norsheep/ReconMOST.[33] Pisces: An Auto-regressive Foundation Model for Image Understanding and Generation
Zhiyang Xu,Jiuhai Chen,Zhaojiang Lin,Xichen Pan,Lifu Huang,Tianyi Zhou,Madian Khabsa,Qifan Wang,Di Jin,Michihiro Yasunaga,Lili Yu,Xi Victoria Lin,Shaoliang Nie
Main category: cs.CV
TL;DR: Pisces是一个多模态基础模型,通过解耦视觉编码架构和定制训练技术,在图像理解和生成任务中均表现出色。
Details
Motivation: 统一模型在图像理解和生成任务中表现不佳,主要因视觉特征和训练过程的差异。 Method: 采用解耦视觉编码架构和定制训练技术,结合数据筛选、预训练和微调。 Result: 在20多个图像理解基准和GenEval生成基准上表现优异。 Conclusion: Pisces展示了图像理解与生成的协同关系,推动了统一多模态模型的发展。 Abstract: Recent advances in large language models (LLMs) have enabled multimodal foundation models to tackle both image understanding and generation within a unified framework. Despite these gains, unified models often underperform compared to specialized models in either task. A key challenge in developing unified models lies in the inherent differences between the visual features needed for image understanding versus generation, as well as the distinct training processes required for each modality. In this work, we introduce Pisces, an auto-regressive multimodal foundation model that addresses this challenge through a novel decoupled visual encoding architecture and tailored training techniques optimized for multimodal generation. Combined with meticulous data curation, pretraining, and finetuning, Pisces achieves competitive performance in both image understanding and image generation. We evaluate Pisces on over 20 public benchmarks for image understanding, where it demonstrates strong performance across a wide range of tasks. Additionally, on GenEval, a widely adopted benchmark for image generation, Pisces exhibits robust generative capabilities. Our extensive analysis reveals the synergistic relationship between image understanding and generation, and the benefits of using separate visual encoders, advancing the field of unified multimodal models.[34] It's Not the Target, It's the Background: Rethinking Infrared Small Target Detection via Deep Patch-Free Low-Rank Representations
Guoyi Zhang,Guangsheng Xu,Siyang Chen,Han Wang,Xiaohu Zhang
Main category: cs.CV
TL;DR: LRRNet是一种新型端到端红外小目标检测框架,利用背景的低秩特性,通过压缩-重建-减法范式直接建模结构感知的低秩背景表示,无需依赖基于块的处理或显式矩阵分解。
Details
Motivation: 红外小目标检测在复杂背景下因低信噪比、目标形态多样及缺乏视觉线索而具有挑战性,现有深度学习方法因目标内在变异性及弱先验导致性能不稳定。 Method: 提出LRRNet框架,采用压缩-重建-减法(CRS)范式,直接学习图像域中的低秩背景结构,无需显式矩阵分解。 Result: 在多个公开数据集上,LRRNet在检测精度、鲁棒性和计算效率方面优于38种先进方法,实时性能达82.34 FPS,且在噪声环境下表现稳健。 Conclusion: LRRNet首次通过端到端深度学习直接建模低秩背景结构,显著提升了红外小目标检测的性能和效率。 Abstract: Infrared small target detection (IRSTD) remains a long-standing challenge in complex backgrounds due to low signal-to-clutter ratios (SCR), diverse target morphologies, and the absence of distinctive visual cues. While recent deep learning approaches aim to learn discriminative representations, the intrinsic variability and weak priors of small targets often lead to unstable performance. In this paper, we propose a novel end-to-end IRSTD framework, termed LRRNet, which leverages the low-rank property of infrared image backgrounds. Inspired by the physical compressibility of cluttered scenes, our approach adopts a compression--reconstruction--subtraction (CRS) paradigm to directly model structure-aware low-rank background representations in the image domain, without relying on patch-based processing or explicit matrix decomposition. To the best of our knowledge, this is the first work to directly learn low-rank background structures using deep neural networks in an end-to-end manner. Extensive experiments on multiple public datasets demonstrate that LRRNet outperforms 38 state-of-the-art methods in terms of detection accuracy, robustness, and computational efficiency. Remarkably, it achieves real-time performance with an average speed of 82.34 FPS. Evaluations on the challenging NoisySIRST dataset further confirm the model's resilience to sensor noise. The source code will be made publicly available upon acceptance.[35] MF2Summ: Multimodal Fusion for Video Summarization with Temporal Alignment
Shuo wang,Jihao Zhang
Main category: cs.CV
TL;DR: MF2Summ是一种基于多模态内容理解的视频摘要模型,结合视觉和听觉信息,通过五阶段流程提升摘要效果。
Details
Motivation: 传统单模态视频摘要方法难以捕捉视频的完整语义丰富性,因此需要多模态方法。 Method: MF2Summ采用五阶段流程:特征提取、跨模态注意力交互、特征融合、片段预测和关键镜头选择,使用Transformer建模模态依赖和时间对应关系。 Result: 在SumMe和TVSum数据集上,MF2Summ的F1分数分别比DSNet模型提高了1.9%和0.6%,优于其他先进方法。 Conclusion: MF2Summ通过多模态融合显著提升了视频摘要的性能,证明了多模态方法的有效性。 Abstract: The rapid proliferation of online video content necessitates effective video summarization techniques. Traditional methods, often relying on a single modality (typically visual), struggle to capture the full semantic richness of videos. This paper introduces MF2Summ, a novel video summarization model based on multimodal content understanding, integrating both visual and auditory information. MF2Summ employs a five-stage process: feature extraction, cross-modal attention interaction, feature fusion, segment prediction, and key shot selection. Visual features are extracted using a pre-trained GoogLeNet model, while auditory features are derived using SoundNet. The core of our fusion mechanism involves a cross-modal Transformer and an alignment-guided self-attention Transformer, designed to effectively model inter-modal dependencies and temporal correspondences. Segment importance, location, and center-ness are predicted, followed by key shot selection using Non-Maximum Suppression (NMS) and the Kernel Temporal Segmentation (KTS) algorithm. Experimental results on the SumMe and TVSum datasets demonstrate that MF2Summ achieves competitive performance, notably improving F1-scores by 1.9\% and 0.6\% respectively over the DSNet model, and performing favorably against other state-of-the-art methods.[36] Towards Robust Multimodal Emotion Recognition under Missing Modalities and Distribution Shifts
Guowei Zhong,Ruohong Huan,Mingzhen Wu,Ronghua Liang,Peng Chen
Main category: cs.CV
TL;DR: CIDer是一个新型的多模态情感识别框架,通过自蒸馏和因果推理模块解决模态缺失和分布外数据问题,性能优于现有方法。
Details
Motivation: 多模态情感识别(MER)面临模态缺失和分布外(OOD)数据的双重挑战,现有方法实用性受限。 Method: 提出CIDer框架,包含自蒸馏模块(MSSD)和因果推理模块(MACI),并引入RMFM任务。MSSD通过权重共享提升鲁棒性,MACI通过因果图减少偏差。 Result: CIDer在RMFM和OOD场景中表现优异,参数更少且训练更快。 Conclusion: CIDer为MER提供了一种高效且鲁棒的解决方案,适用于实际应用。 Abstract: Recent advancements in Multimodal Emotion Recognition (MER) face challenges in addressing both modality missing and Out-Of-Distribution (OOD) data simultaneously. Existing methods often rely on specific models or introduce excessive parameters, which limits their practicality. To address these issues, we propose a novel robust MER framework, Causal Inference Distiller (CIDer), and introduce a new task, Random Modality Feature Missing (RMFM), to generalize the definition of modality missing. CIDer integrates two key components: a Model-Specific Self-Distillation (MSSD) module and a Model-Agnostic Causal Inference (MACI) module. MSSD enhances robustness under the RMFM task through a weight-sharing self-distillation approach applied across low-level features, attention maps, and high-level representations. Additionally, a Word-level Self-aligned Attention Module (WSAM) reduces computational complexity, while a Multimodal Composite Transformer (MCT) facilitates efficient multimodal fusion. To tackle OOD challenges, MACI employs a tailored causal graph to mitigate label and language biases using a Multimodal Causal Module (MCM) and fine-grained counterfactual texts. Notably, MACI can independently enhance OOD generalization with minimal additional parameters. Furthermore, we also introduce the new repartitioned MER OOD datasets. Experimental results demonstrate that CIDer achieves robust performance in both RMFM and OOD scenarios, with fewer parameters and faster training compared to state-of-the-art methods. The implementation of this work is publicly accessible at https://github.com/gw-zhong/CIDer.[37] Rethinking Generative Human Video Coding with Implicit Motion Transformation
Bolin Chen,Ru-Ling Liao,Jie Chen,Yan Ye
Main category: cs.CV
TL;DR: 论文提出了一种基于隐式运动变换(IMT)的生成式人体视频编码(GHVC)方法,以解决显式运动引导在复杂人体视频压缩中的失真问题。
Details
Motivation: 传统显式运动引导方法在人体视频压缩中因复杂多样的运动模式导致重建结果失真和运动不准确,因此需要探索更有效的方法。 Method: 通过将复杂人体信号表征为紧凑视觉特征,并将其转换为隐式运动引导以重建信号。 Result: 实验证明IMT范式能有效提升GHVC的压缩效率和重建保真度。 Conclusion: 隐式运动变换(IMT)为生成式人体视频编码提供了高效且高保真的解决方案。 Abstract: Beyond traditional hybrid-based video codec, generative video codec could achieve promising compression performance by evolving high-dimensional signals into compact feature representations for bitstream compactness at the encoder side and developing explicit motion fields as intermediate supervision for high-quality reconstruction at the decoder side. This paradigm has achieved significant success in face video compression. However, compared to facial videos, human body videos pose greater challenges due to their more complex and diverse motion patterns, i.e., when using explicit motion guidance for Generative Human Video Coding (GHVC), the reconstruction results could suffer severe distortions and inaccurate motion. As such, this paper highlights the limitations of explicit motion-based approaches for human body video compression and investigates the GHVC performance improvement with the aid of Implicit Motion Transformation, namely IMT. In particular, we propose to characterize complex human body signal into compact visual features and transform these features into implicit motion guidance for signal reconstruction. Experimental results demonstrate the effectiveness of the proposed IMT paradigm, which can facilitate GHVC to achieve high-efficiency compression and high-fidelity synthesis.[38] Boosting Adversarial Transferability for Hyperspectral Image Classification Using 3D Structure-invariant Transformation and Intermediate Feature Distance
Chun Liu,Bingqian Zhu,Tao Xu,Zheng Zheng,Zheng Li,Wei Yang,Zhigang Han,Jiayao Wang
Main category: cs.CV
TL;DR: 本文提出了一种增强高光谱图像分类模型对抗样本可迁移性的新方法,通过分块处理和特征距离损失设计,有效提升了攻击效果。
Details
Motivation: 高光谱图像(HSI)因其高维和丰富的光谱信息与自然图像不同,现有对抗攻击方法在HSI上研究有限且难以充分利用图像结构信息。 Method: 方法包括分块处理(保持图像结构不变,随机分块并应用变换)和设计特征距离损失(以中间层特征距离为主损失,输出层预测为辅损失)。 Result: 实验表明,该方法在公开HSI数据集上生成的对抗样本对黑盒模型具有有效可迁移性,且在防御策略下仍保持强攻击性能。 Conclusion: 该方法通过分块和特征距离损失设计,显著提升了对抗样本的可迁移性和攻击效果。 Abstract: Deep Neural Networks (DNNs) are vulnerable to adversarial attacks, which pose security challenges to hyperspectral image (HSI) classification technologies based on DNNs. In the domain of natural images, numerous transfer-based adversarial attack methods have been studied. However, HSIs differ from natural images due to their high-dimensional and rich spectral information. Current research on HSI adversarial examples remains limited and faces challenges in fully utilizing the structural and feature information of images. To address these issues, this paper proposes a novel method to enhance the transferability of the adversarial examples for HSI classification models. First, while keeping the image structure unchanged, the proposed method randomly divides the image into blocks in both spatial and spectral dimensions. Then, various transformations are applied on a block by block basis to increase input diversity and mitigate overfitting. Second, a feature distancing loss targeting intermediate layers is designed, which measures the distance between the amplified features of the original examples and the features of the adversarial examples as the primary loss, while the output layer prediction serves as the auxiliary loss. This guides the perturbation to disrupt the features of the true class in adversarial examples, effectively enhancing transferability. Extensive experiments demonstrate that the adversarial examples generated by the proposed method achieve effective transferability to black-box models on two public HSI datasets. Furthermore, the method maintains robust attack performance even under defense strategies.[39] Starting Positions Matter: A Study on Better Weight Initialization for Neural Network Quantization
Stone Yun,Alexander Wong
Main category: cs.CV
TL;DR: 研究探讨了深度神经网络(DNN)量化中权重初始化对量化鲁棒性的影响,并提出了一种基于图超网络(GHN)的新方法GHN-QAT,显著提升了量化精度。
Details
Motivation: 量化是降低DNN推理成本的重要工具,但现有研究很少关注权重初始化对量化鲁棒性的影响。 Method: 通过分析不同权重初始化对CNN量化鲁棒性的影响,提出GHN-QAT方法,利用GHN预测量化DNN参数并微调。 Result: GHN-QAT显著提升了4-bit量化的精度,并在2-bit量化中表现优于随机初始化。 Conclusion: GHN-QAT为量化DNN设计提供了新思路,未来可结合量化感知训练进一步优化。 Abstract: Deep neural network (DNN) quantization for fast, efficient inference has been an important tool in limiting the cost of machine learning (ML) model inference. Quantization-specific model development techniques such as regularization, quantization-aware training, and quantization-robustness penalties have served to greatly boost the accuracy and robustness of modern DNNs. However, very little exploration has been done on improving the initial conditions of DNN training for quantization. Just as random weight initialization has been shown to significantly impact test accuracy of floating point models, it would make sense that different weight initialization methods impact quantization robustness of trained models. We present an extensive study examining the effects of different weight initializations on a variety of CNN building blocks commonly used in efficient CNNs. This analysis reveals that even with varying CNN architectures, the choice of random weight initializer can significantly affect final quantization robustness. Next, we explore a new method for quantization-robust CNN initialization -- using Graph Hypernetworks (GHN) to predict parameters of quantized DNNs. Besides showing that GHN-predicted parameters are quantization-robust after regular float32 pretraining (of the GHN), we find that finetuning GHNs to predict parameters for quantized graphs (which we call GHN-QAT) can further improve quantized accuracy of CNNs. Notably, GHN-QAT shows significant accuracy improvements for even 4-bit quantization and better-than-random accuracy for 2-bits. To the best of our knowledge, this is the first in-depth study on quantization-aware DNN weight initialization. GHN-QAT offers a novel approach to quantized DNN model design. Future investigations, such as using GHN-QAT-initialized parameters for quantization-aware training, can further streamline the DNN quantization process.[40] MedSeg-R: Reasoning Segmentation in Medical Images with Multimodal Large Language Models
Yu Huang,Zelin Peng,Yichen Zhao,Piao Yang,Xiaokang Yang,Wei Shen
Main category: cs.CV
TL;DR: 论文提出了一种名为MedSeg-R的端到端框架,用于解决医学图像推理分割任务,结合多模态大语言模型(MLLMs)生成精确分割掩码,并引入MedSeg-QA数据集。
Details
Motivation: 现有医学图像分割模型依赖显式人类指令且缺乏主动推理能力,无法处理复杂临床问题。 Method: MedSeg-R框架包含全局上下文理解模块和像素级接地模块,结合MLLMs生成中间令牌并解码为分割掩码和文本响应。 Result: 实验表明MedSeg-R在多个基准测试中表现优异,实现了高分割精度和可解释的医学图像文本分析。 Conclusion: MedSeg-R为医学图像推理分割提供了有效解决方案,结合MLLMs的能力,推动了自动医学诊断的发展。 Abstract: Medical image segmentation is crucial for clinical diagnosis, yet existing models are limited by their reliance on explicit human instructions and lack the active reasoning capabilities to understand complex clinical questions. While recent advancements in multimodal large language models (MLLMs) have improved medical question-answering (QA) tasks, most methods struggle to generate precise segmentation masks, limiting their application in automatic medical diagnosis. In this paper, we introduce medical image reasoning segmentation, a novel task that aims to generate segmentation masks based on complex and implicit medical instructions. To address this, we propose MedSeg-R, an end-to-end framework that leverages the reasoning abilities of MLLMs to interpret clinical questions while also capable of producing corresponding precise segmentation masks for medical images. It is built on two core components: 1) a global context understanding module that interprets images and comprehends complex medical instructions to generate multi-modal intermediate tokens, and 2) a pixel-level grounding module that decodes these tokens to produce precise segmentation masks and textual responses. Furthermore, we introduce MedSeg-QA, a large-scale dataset tailored for the medical image reasoning segmentation task. It includes over 10,000 image-mask pairs and multi-turn conversations, automatically annotated using large language models and refined through physician reviews. Experiments show MedSeg-R's superior performance across several benchmarks, achieving high segmentation accuracy and enabling interpretable textual analysis of medical images.[41] LLMs Are Not Yet Ready for Deepfake Image Detection
Shahroz Tariq,David Nguyen,M. A. P. Chamikara,Tingmin Wu,Alsharif Abuadbba,Kristen Moore
Main category: cs.CV
TL;DR: 该研究评估了四种视觉语言模型(VLMs)在深度伪造检测中的零样本表现,发现其虽能生成合理解释但尚不可靠,但可作为辅助工具。
Details
Motivation: 深度伪造技术的复杂性威胁媒体完整性和公众信任,而VLMs的多领域潜力引发对其在深度伪造检测中应用的兴趣。 Method: 研究对ChatGPT、Claude、Gemini和Grok四种VLMs进行结构化零样本评估,测试其在三种深度伪造类型(换脸、重演和合成生成)上的分类准确性和推理深度。 Result: VLMs能生成连贯解释并检测表面异常,但依赖度不足,易受误导性视觉模式影响。其优势在于可解释性和上下文分析能力。 Conclusion: 通用模型虽无法独立用于深度伪造检测,但可作为混合或人机协作检测框架的组成部分。 Abstract: The growing sophistication of deepfakes presents substantial challenges to the integrity of media and the preservation of public trust. Concurrently, vision-language models (VLMs), large language models enhanced with visual reasoning capabilities, have emerged as promising tools across various domains, sparking interest in their applicability to deepfake detection. This study conducts a structured zero-shot evaluation of four prominent VLMs: ChatGPT, Claude, Gemini, and Grok, focusing on three primary deepfake types: faceswap, reenactment, and synthetic generation. Leveraging a meticulously assembled benchmark comprising authentic and manipulated images from diverse sources, we evaluate each model's classification accuracy and reasoning depth. Our analysis indicates that while VLMs can produce coherent explanations and detect surface-level anomalies, they are not yet dependable as standalone detection systems. We highlight critical failure modes, such as an overemphasis on stylistic elements and vulnerability to misleading visual patterns like vintage aesthetics. Nevertheless, VLMs exhibit strengths in interpretability and contextual analysis, suggesting their potential to augment human expertise in forensic workflows. These insights imply that although general-purpose models currently lack the reliability needed for autonomous deepfake detection, they hold promise as integral components in hybrid or human-in-the-loop detection frameworks.[42] Sheet Music Benchmark: Standardized Optical Music Recognition Evaluation
Juan C. Martinez-Sevilla,Joan Cerveto-Serrano,Noelia Luna,Greg Chapman,Craig Sapp,David Rizo,Jorge Calvo-Zaragoza
Main category: cs.CV
TL;DR: 介绍了Sheet Music Benchmark(SMB)数据集和OMR-NED评估指标,用于优化光学音乐识别(OMR)研究。
Details
Motivation: 解决OMR研究中缺乏标准化评估方法和数据集的问题。 Method: 创建SMB数据集和OMR-NED指标,并进行基线实验验证。 Result: 提供了多样化的音乐纹理数据集和细粒度的评估指标,支持OMR性能的清晰比较。 Conclusion: 填补了OMR评估的空白,为研究者和用户提供了实用工具。 Abstract: In this work, we introduce the Sheet Music Benchmark (SMB), a dataset of six hundred and eighty-five pages specifically designed to benchmark Optical Music Recognition (OMR) research. SMB encompasses a diverse array of musical textures, including monophony, pianoform, quartet, and others, all encoded in Common Western Modern Notation using the Humdrum **kern format. Alongside SMB, we introduce the OMR Normalized Edit Distance (OMR-NED), a new metric tailored explicitly for evaluating OMR performance. OMR-NED builds upon the widely-used Symbol Error Rate (SER), offering a fine-grained and detailed error analysis that covers individual musical elements such as note heads, beams, pitches, accidentals, and other critical notation features. The resulting numeric score provided by OMR-NED facilitates clear comparisons, enabling researchers and end-users alike to identify optimal OMR approaches. Our work thus addresses a long-standing gap in OMR evaluation, and we support our contributions with baseline experiments using standardized SMB dataset splits for training and assessing state-of-the-art methods.[43] Class-Incremental Learning for Honey Botanical Origin Classification with Hyperspectral Images: A Study with Continual Backpropagation
Guyang Zhang,Waleed Abdulla
Main category: cs.CV
TL;DR: 该研究通过结合持续反向传播(CB)算法,提出了一种改进类增量学习(CIL)性能的新技术,用于蜂蜜植物来源分类。
Details
Motivation: 蜂蜜的植物来源影响其风味和健康价值,但一次性收集所有蜂蜜品种训练模型不现实,因此需要类增量学习技术。 Method: 研究比较了多种CIL算法,并提出结合CB算法的方法,通过重新初始化较少使用的隐藏神经元来提升性能。 Result: 实验表明,CB方法将大多数CIL算法的性能提高了1-7%。 Conclusion: CB算法有效解决了类增量学习中的可塑性损失问题,提升了蜂蜜植物来源分类的准确性。 Abstract: Honey is an important commodity in the global market. Honey types of different botanical origins provide diversified flavors and health benefits, thus having different market values. Developing accurate and effective botanical origin-distinguishing techniques is crucial to protect consumers' interests. However, it is impractical to collect all the varieties of honey products at once to train a model for botanical origin differentiation. Therefore, researchers developed class-incremental learning (CIL) techniques to address this challenge. This study examined and compared multiple CIL algorithms on a real-world honey hyperspectral imaging dataset. A novel technique is also proposed to improve the performance of class-incremental learning algorithms by combining with a continual backpropagation (CB) algorithm. The CB method addresses the issue of loss-of-plasticity by reinitializing a proportion of less-used hidden neurons to inject variability into neural networks. Experiments showed that CB improved the performance of most CIL methods by 1-7\%.[44] Semantic Localization Guiding Segment Anything Model For Reference Remote Sensing Image Segmentation
Shuyang Li,Shuang Wang,Zhuangzhuang Sun,Jing Xiao
Main category: cs.CV
TL;DR: PSLG-SAM框架通过两阶段分解RRSIS任务,结合视觉定位网络和增强的SAM模型,显著减少标注负担并提升性能。
Details
Motivation: 解决现有RRSIS方法对密集标注的依赖和复杂场景解释的挑战。 Method: 分为粗定位和精细分割两阶段:视觉定位网络粗略定位目标,SAM模型在坐标引导下进行精确分割。 Result: 在RRSIS-D和RRSIS-M数据集上性能显著提升,超越现有最优模型。 Conclusion: PSLG-SAM有效减少标注需求并提升分割精度,适用于复杂场景。 Abstract: The Reference Remote Sensing Image Segmentation (RRSIS) task generates segmentation masks for specified objects in images based on textual descriptions, which has attracted widespread attention and research interest. Current RRSIS methods rely on multi-modal fusion backbones and semantic segmentation heads but face challenges like dense annotation requirements and complex scene interpretation. To address these issues, we propose a framework named \textit{prompt-generated semantic localization guiding Segment Anything Model}(PSLG-SAM), which decomposes the RRSIS task into two stages: coarse localization and fine segmentation. In coarse localization stage, a visual grounding network roughly locates the text-described object. In fine segmentation stage, the coordinates from the first stage guide the Segment Anything Model (SAM), enhanced by a clustering-based foreground point generator and a mask boundary iterative optimization strategy for precise segmentation. Notably, the second stage can be train-free, significantly reducing the annotation data burden for the RRSIS task. Additionally, decomposing the RRSIS task into two stages allows for focusing on specific region segmentation, avoiding interference from complex scenes.We further contribute a high-quality, multi-category manually annotated dataset. Experimental validation on two datasets (RRSIS-D and RRSIS-M) demonstrates that PSLG-SAM achieves significant performance improvements and surpasses existing state-of-the-art models.Our code will be made publicly available.[45] J-DDL: Surface Damage Detection and Localization System for Fighter Aircraft
Jin Huang,Mingqiang Wei,Zikuan Li,Hangyu Qu,Wei Zhao,Xinyu Bai
Main category: cs.CV
TL;DR: 提出了一种名为J-DDL的智能表面损伤检测与定位系统,用于战斗机,结合2D图像和3D点云技术,优化了YOLO架构,提高了检测效率和准确性。
Details
Motivation: 战斗机表面检测需求频繁且复杂,传统人工方法在扩展性、效率和一致性上存在局限。 Method: 整合2D图像和3D点云数据,采用优化的YOLO架构,引入轻量级Fasternet块、EMA模块和Inner-CIOU损失函数。 Result: 系统能够高效检测并精确定位表面损伤,实验验证了其有效性。 Conclusion: J-DDL显著提升了自动化飞机检测技术,并公开了首个飞机损伤数据集。 Abstract: Ensuring the safety and extended operational life of fighter aircraft necessitates frequent and exhaustive inspections. While surface defect detection is feasible for human inspectors, manual methods face critical limitations in scalability, efficiency, and consistency due to the vast surface area, structural complexity, and operational demands of aircraft maintenance. We propose a smart surface damage detection and localization system for fighter aircraft, termed J-DDL. J-DDL integrates 2D images and 3D point clouds of the entire aircraft surface, captured using a combined system of laser scanners and cameras, to achieve precise damage detection and localization. Central to our system is a novel damage detection network built on the YOLO architecture, specifically optimized for identifying surface defects in 2D aircraft images. Key innovations include lightweight Fasternet blocks for efficient feature extraction, an optimized neck architecture incorporating Efficient Multiscale Attention (EMA) modules for superior feature aggregation, and the introduction of a novel loss function, Inner-CIOU, to enhance detection accuracy. After detecting damage in 2D images, the system maps the identified anomalies onto corresponding 3D point clouds, enabling accurate 3D localization of defects across the aircraft surface. Our J-DDL not only streamlines the inspection process but also ensures more comprehensive and detailed coverage of large and complex aircraft exteriors. To facilitate further advancements in this domain, we have developed the first publicly available dataset specifically focused on aircraft damage. Experimental evaluations validate the effectiveness of our framework, underscoring its potential to significantly advance automated aircraft inspection technologies.[46] CogStream: Context-guided Streaming Video Question Answering
Zicheng Zhao,Kangyu Wang,Shijie Li,Rui Qian,Weiyao Lin,Huabin Liu
Main category: cs.CV
TL;DR: 论文提出了一种名为CogStream的新任务,专注于流媒体视频推理中的上下文引导问题,并提出了一个基线模型CogReasoner,通过视觉流压缩和历史对话检索高效解决任务。
Details
Motivation: 现有视频大语言模型(Vid-LLMs)在处理流媒体视频时依赖全部历史上下文信息,导致计算负担重且易受无关信息干扰。 Method: 提出CogStream任务,并开发CogReasoner模型,利用视觉流压缩和历史对话检索技术。 Result: 实验证明该方法有效。 Conclusion: CogStream任务和CogReasoner模型为流媒体视频推理提供了新的解决方案,代码即将发布。 Abstract: Despite advancements in Video Large Language Models (Vid-LLMs) improving multimodal understanding, challenges persist in streaming video reasoning due to its reliance on contextual information. Existing paradigms feed all available historical contextual information into Vid-LLMs, resulting in a significant computational burden for visual data processing. Furthermore, the inclusion of irrelevant context distracts models from key details. This paper introduces a challenging task called Context-guided Streaming Video Reasoning (CogStream), which simulates real-world streaming video scenarios, requiring models to identify the most relevant historical contextual information to deduce answers for questions about the current stream. To support CogStream, we present a densely annotated dataset featuring extensive and hierarchical question-answer pairs, generated by a semi-automatic pipeline. Additionally, we present CogReasoner as a baseline model. It efficiently tackles this task by leveraging visual stream compression and historical dialogue retrieval. Extensive experiments prove the effectiveness of this method. Code will be released soon.[47] ALBERT: Advanced Localization and Bidirectional Encoder Representations from Transformers for Automotive Damage Evaluation
Teerapong Panboonyuen
Main category: cs.CV
TL;DR: ALBERT是一个专为汽车损伤和部件分割设计的实例分割模型,结合双向编码器表示和高级定位机制,能准确区分真实与虚假损伤,并分割汽车部件。
Details
Motivation: 为智能汽车检测和评估提供高效、准确的损伤和部件分割解决方案。 Method: 利用双向编码器表示和高级定位机制,在大规模标注的汽车数据集上训练。 Result: 模型在分割精度和损伤分类方面表现优异,支持26种损伤类型、7种虚假损伤变体和61种汽车部件的识别。 Conclusion: ALBERT为智能汽车检测应用提供了强有力的技术支持。 Abstract: This paper introduces ALBERT, an instance segmentation model specifically designed for comprehensive car damage and part segmentation. Leveraging the power of Bidirectional Encoder Representations, ALBERT incorporates advanced localization mechanisms to accurately identify and differentiate between real and fake damages, as well as segment individual car parts. The model is trained on a large-scale, richly annotated automotive dataset that categorizes damage into 26 types, identifies 7 fake damage variants, and segments 61 distinct car parts. Our approach demonstrates strong performance in both segmentation accuracy and damage classification, paving the way for intelligent automotive inspection and assessment applications.[48] SLICK: Selective Localization and Instance Calibration for Knowledge-Enhanced Car Damage Segmentation in Automotive Insurance
Teerapong Panboonyuen
Main category: cs.CV
TL;DR: SLICK是一个用于精确和鲁棒的汽车损伤分割的新框架,通过结构先验和领域知识解决实际汽车检测挑战。
Details
Motivation: 解决实际汽车检测中因遮挡、变形或复杂场景导致的损伤分割不准确问题。 Method: 引入五个关键组件:选择性部件分割、定位感知注意力块、实例敏感细化头、跨通道校准和知识融合模块。 Result: 在大规模汽车数据集上表现出优越的分割性能、鲁棒性和实际应用性。 Conclusion: SLICK框架在汽车损伤分割中具有高效性和实用性,适用于保险和汽车检测工作流程。 Abstract: We present SLICK, a novel framework for precise and robust car damage segmentation that leverages structural priors and domain knowledge to tackle real-world automotive inspection challenges. SLICK introduces five key components: (1) Selective Part Segmentation using a high-resolution semantic backbone guided by structural priors to achieve surgical accuracy in segmenting vehicle parts even under occlusion, deformation, or paint loss; (2) Localization-Aware Attention blocks that dynamically focus on damaged regions, enhancing fine-grained damage detection in cluttered and complex street scenes; (3) an Instance-Sensitive Refinement head that leverages panoptic cues and shape priors to disentangle overlapping or adjacent parts, enabling precise boundary alignment; (4) Cross-Channel Calibration through multi-scale channel attention that amplifies subtle damage signals such as scratches and dents while suppressing noise like reflections and decals; and (5) a Knowledge Fusion Module that integrates synthetic crash data, part geometry, and real-world insurance datasets to improve generalization and handle rare cases effectively. Experiments on large-scale automotive datasets demonstrate SLICK's superior segmentation performance, robustness, and practical applicability for insurance and automotive inspection workflows.[49] ContextRefine-CLIP for EPIC-KITCHENS-100 Multi-Instance Retrieval Challenge 2025
Jing He,Yiqing Wang,Lingling Li,Kexin Zhang,Puhua Chen
Main category: cs.CV
TL;DR: CR-CLIP是一种高效的视觉-文本多实例检索模型,通过跨模态注意力流模块实现动态交互和特征优化,显著提升了检索性能。
Details
Motivation: 解决视觉-文本多实例检索任务中特征对齐和语义优化的挑战。 Method: 基于AVION双编码器,引入跨模态注意力流模块,结合对称多相似性损失进行优化。 Result: 在EPIC-KITCHENS-100上达到66.78mAP和82.08nDCG,显著优于基线模型。 Conclusion: CR-CLIP在跨模态检索中表现出色,验证了其有效性。 Abstract: This report presents ContextRefine-CLIP (CR-CLIP), an efficient model for visual-textual multi-instance retrieval tasks. The approach is based on the dual-encoder AVION, on which we introduce a cross-modal attention flow module to achieve bidirectional dynamic interaction and refinement between visual and textual features to generate more context-aware joint representations. For soft-label relevance matrices provided in tasks such as EPIC-KITCHENS-100, CR-CLIP can work with Symmetric Multi-Similarity Loss to achieve more accurate semantic alignment and optimization using the refined features. Without using ensemble learning, the CR-CLIP model achieves 66.78mAP and 82.08nDCG on the EPIC-KITCHENS-100 public leaderboard, which significantly outperforms the baseline model and fully validates its effectiveness in cross-modal retrieval. The code will be released open-source on https://github.com/delCayr/ContextRefine-Clip[50] From Images to Insights: Explainable Biodiversity Monitoring with Plain Language Habitat Explanations
Yutong Zhou,Masahiro Ryo
Main category: cs.CV
TL;DR: 提出了一种端到端的视觉-因果框架,将物种图像转化为可解释的栖息地偏好因果洞察。
Details
Motivation: 理解物种栖息地偏好对生态研究和保护生物多样性至关重要,但现有方法碎片化且难以被非专家使用。 Method: 整合物种识别、全球分布检索、伪缺失采样和气候数据提取,结合因果推断方法分析环境特征对物种分布的影响。 Result: 展示了蜜蜂和花卉物种的初步结果,验证了多模态AI助手在生态建模中的潜力。 Conclusion: 该框架为物种栖息地描述提供了统计基础和人类可理解的解释,有望推动生态学研究和实践。 Abstract: Explaining why the species lives at a particular location is important for understanding ecological systems and conserving biodiversity. However, existing ecological workflows are fragmented and often inaccessible to non-specialists. We propose an end-to-end visual-to-causal framework that transforms a species image into interpretable causal insights about its habitat preference. The system integrates species recognition, global occurrence retrieval, pseudo-absence sampling, and climate data extraction. We then discover causal structures among environmental features and estimate their influence on species occurrence using modern causal inference methods. Finally, we generate statistically grounded, human-readable causal explanations from structured templates and large language models. We demonstrate the framework on a bee and a flower species and report early results as part of an ongoing project, showing the potential of the multimodal AI assistant backed up by a recommended ecological modeling practice for describing species habitat in human-understandable language.[51] Balancing Tails when Comparing Distributions: Comprehensive Equity Index (CEI) with Application to Bias Evaluation in Operational Face Biometrics
Imanol Solano,Julian Fierrez,Aythami Morales,Alejandro Peña,Ruben Tolosana,Francisco Zamora-Martinez,Javier San Agustin
Main category: cs.CV
TL;DR: 论文提出了一种名为CEI的新指标,用于检测高性能人脸识别系统中的细微人口统计偏差,尤其是分数分布尾部的差异。
Details
Motivation: 现有指标难以检测人脸识别系统中的细微人口统计偏差,尤其是在分数分布尾部。 Method: 引入CEI指标,分别分析真实和冒名顶替分数分布,可配置关注尾部概率并考虑整体分布形状。还提出了自动化版本CEI^A。 Result: 实验证明CEI在检测细微偏差方面优于现有方法。 Conclusion: CEI为评估人脸识别系统的公平性提供了敏感且鲁棒的工具,适用于任何需要分析分布尾部的统计问题。 Abstract: Demographic bias in high-performance face recognition (FR) systems often eludes detection by existing metrics, especially with respect to subtle disparities in the tails of the score distribution. We introduce the Comprehensive Equity Index (CEI), a novel metric designed to address this limitation. CEI uniquely analyzes genuine and impostor score distributions separately, enabling a configurable focus on tail probabilities while also considering overall distribution shapes. Our extensive experiments (evaluating state-of-the-art FR systems, intentionally biased models, and diverse datasets) confirm CEI's superior ability to detect nuanced biases where previous methods fall short. Furthermore, we present CEI^A, an automated version of the metric that enhances objectivity and simplifies practical application. CEI provides a robust and sensitive tool for operational FR fairness assessment. The proposed methods have been developed particularly for bias evaluation in face biometrics but, in general, they are applicable for comparing statistical distributions in any problem where one is interested in analyzing the distribution tails.[52] LRSLAM: Low-rank Representation of Signed Distance Fields in Dense Visual SLAM System
Hongbeen Park,Minjeong Park,Giljoo Nam,Jinkyu Kim
Main category: cs.CV
TL;DR: LRSLAM是一种基于低秩张量分解的高效视觉SLAM模型,解决了现有方法在实时性、内存占用和可扩展性上的问题,并在多个指标上优于现有技术。
Details
Motivation: 密集视觉SLAM在实时性、鲁棒性和大规模场景扩展性方面存在挑战,现有神经隐式表示方法计算和内存成本高。 Method: 采用低秩张量分解方法(Six-axis和CP分解),提升收敛速度、内存效率和重建/定位质量。 Result: 在多个室内RGB-D数据集上验证,LRSLAM在参数效率、处理时间和准确性上表现优异。 Conclusion: LRSLAM通过低秩分解显著提升了SLAM性能,代码将开源。 Abstract: Simultaneous Localization and Mapping (SLAM) has been crucial across various domains, including autonomous driving, mobile robotics, and mixed reality. Dense visual SLAM, leveraging RGB-D camera systems, offers advantages but faces challenges in achieving real-time performance, robustness, and scalability for large-scale scenes. Recent approaches utilizing neural implicit scene representations show promise but suffer from high computational costs and memory requirements. ESLAM introduced a plane-based tensor decomposition but still struggled with memory growth. Addressing these challenges, we propose a more efficient visual SLAM model, called LRSLAM, utilizing low-rank tensor decomposition methods. Our approach, leveraging the Six-axis and CP decompositions, achieves better convergence rates, memory efficiency, and reconstruction/localization quality than existing state-of-the-art approaches. Evaluation across diverse indoor RGB-D datasets demonstrates LRSLAM's superior performance in terms of parameter efficiency, processing time, and accuracy, retaining reconstruction and localization quality. Our code will be publicly available upon publication.[53] DreamActor-H1: High-Fidelity Human-Product Demonstration Video Generation via Motion-designed Diffusion Transformers
Lizhen Wang,Zhurong Xia,Tianshu Hu,Pengrui Wang,Pengfei Wang,Zerong Zheng,Ming Zhou
Main category: cs.CV
TL;DR: 提出了一种基于扩散变换器(DiT)的框架,用于生成高质量的人与产品演示视频,同时保留人和产品的身份特征,并理解其空间关系。
Details
Motivation: 现有框架在生成人与产品演示视频时,难以同时保留人和产品的身份特征,且缺乏对空间关系的理解,导致不真实的表现和不自然的交互。 Method: 采用DiT框架,注入配对的人与产品参考信息,并使用掩码交叉注意力机制;利用3D身体网格模板和产品边界框提供精确运动指导;结合结构化文本编码增强3D一致性。 Result: 在混合数据集上训练,采用数据增强策略,方法在保持人和产品身份完整性及生成真实演示动作方面优于现有技术。 Conclusion: 提出的框架有效解决了人与产品演示视频生成中的身份保留和空间关系问题,生成结果更真实自然。 Abstract: In e-commerce and digital marketing, generating high-fidelity human-product demonstration videos is important for effective product presentation. However, most existing frameworks either fail to preserve the identities of both humans and products or lack an understanding of human-product spatial relationships, leading to unrealistic representations and unnatural interactions. To address these challenges, we propose a Diffusion Transformer (DiT)-based framework. Our method simultaneously preserves human identities and product-specific details, such as logos and textures, by injecting paired human-product reference information and utilizing an additional masked cross-attention mechanism. We employ a 3D body mesh template and product bounding boxes to provide precise motion guidance, enabling intuitive alignment of hand gestures with product placements. Additionally, structured text encoding is used to incorporate category-level semantics, enhancing 3D consistency during small rotational changes across frames. Trained on a hybrid dataset with extensive data augmentation strategies, our approach outperforms state-of-the-art techniques in maintaining the identity integrity of both humans and products and generating realistic demonstration motions. Project page: https://submit2025-dream.github.io/DreamActor-H1/.[54] Improving Medical Visual Representation Learning with Pathological-level Cross-Modal Alignment and Correlation Exploration
Jun Wang,Lixing Zhu,Xiaohan Yu,Abhir Bhalerao,Yulan He
Main category: cs.CV
TL;DR: PLACE框架通过病理级对齐和相关性探索提升医学视觉表示学习,无需额外标注,在多项下游任务中达到SOTA性能。
Details
Motivation: 解决医学领域数据稀缺问题,同时应对复杂报告和语义病理的挑战,强调病理级一致性。 Method: 提出病理级跨模态对齐(PCMA)方法,结合视觉病理观察提取器和相关性探索代理任务。 Result: 在分类、图像到文本检索、语义分割、目标检测和报告生成等任务中表现优异。 Conclusion: PLACE框架有效提升了医学视觉表示学习的性能,具有广泛适用性和鲁棒性。 Abstract: Learning medical visual representations from image-report pairs through joint learning has garnered increasing research attention due to its potential to alleviate the data scarcity problem in the medical domain. The primary challenges stem from the lengthy reports that feature complex discourse relations and semantic pathologies. Previous works have predominantly focused on instance-wise or token-wise cross-modal alignment, often neglecting the importance of pathological-level consistency. This paper presents a novel framework PLACE that promotes the Pathological-Level Alignment and enriches the fine-grained details via Correlation Exploration without additional human annotations. Specifically, we propose a novel pathological-level cross-modal alignment (PCMA) approach to maximize the consistency of pathology observations from both images and reports. To facilitate this, a Visual Pathology Observation Extractor is introduced to extract visual pathological observation representations from localized tokens. The PCMA module operates independently of any external disease annotations, enhancing the generalizability and robustness of our methods. Furthermore, we design a proxy task that enforces the model to identify correlations among image patches, thereby enriching the fine-grained details crucial for various downstream tasks. Experimental results demonstrate that our proposed framework achieves new state-of-the-art performance on multiple downstream tasks, including classification, image-to-text retrieval, semantic segmentation, object detection and report generation.[55] DanceChat: Large Language Model-Guided Music-to-Dance Generation
Qing Wang,Xiaohang Yang,Yilan Dong,Naveen Raj Govindaraj,Gregory Slabaugh,Shanxin Yuan
Main category: cs.CV
TL;DR: DanceChat利用大型语言模型(LLM)作为编舞指导,通过文本指令生成多样且与音乐风格一致的舞蹈动作,解决了音乐与舞蹈之间的语义鸿沟问题。
Details
Motivation: 音乐与舞蹈之间存在语义鸿沟,音乐仅提供抽象线索,而舞蹈动作需要更明确的指导。此外,单一音乐可对应多种舞蹈动作,数据稀缺也限制了模型学习。 Method: DanceChat包含三个模块:LLM生成伪指令、多模态特征提取与融合、基于扩散的运动合成与对齐损失。 Result: 在AIST++数据集和人工评估中,DanceChat在质量和多样性上均优于现有方法。 Conclusion: DanceChat通过LLM的显式指导,实现了更高质量和多样性的音乐到舞蹈生成。 Abstract: Music-to-dance generation aims to synthesize human dance motion conditioned on musical input. Despite recent progress, significant challenges remain due to the semantic gap between music and dance motion, as music offers only abstract cues, such as melody, groove, and emotion, without explicitly specifying the physical movements. Moreover, a single piece of music can produce multiple plausible dance interpretations. This one-to-many mapping demands additional guidance, as music alone provides limited information for generating diverse dance movements. The challenge is further amplified by the scarcity of paired music and dance data, which restricts the model\^a\u{A}\'Zs ability to learn diverse dance patterns. In this paper, we introduce DanceChat, a Large Language Model (LLM)-guided music-to-dance generation approach. We use an LLM as a choreographer that provides textual motion instructions, offering explicit, high-level guidance for dance generation. This approach goes beyond implicit learning from music alone, enabling the model to generate dance that is both more diverse and better aligned with musical styles. Our approach consists of three components: (1) an LLM-based pseudo instruction generation module that produces textual dance guidance based on music style and structure, (2) a multi-modal feature extraction and fusion module that integrates music, rhythm, and textual guidance into a shared representation, and (3) a diffusion-based motion synthesis module together with a multi-modal alignment loss, which ensures that the generated dance is aligned with both musical and textual cues. Extensive experiments on AIST++ and human evaluations show that DanceChat outperforms state-of-the-art methods both qualitatively and quantitatively.[56] Text to Image for Multi-Label Image Recognition with Joint Prompt-Adapter Learning
Chun-Mei Feng,Kai Yu,Xinxing Xu,Salman Khan,Rick Siow Mong Goh,Wangmeng Zuo,Yong Liu
Main category: cs.CV
TL;DR: T2I-PAL利用文本生成图像减少模态差异,结合热图和原型增强多标签识别,显著提升性能。
Details
Motivation: 解决CLIP模型中文本与图像模态差异问题,提升仅用文本进行参数高效微调时的图像识别性能。 Method: 利用文本生成图像模型生成多样化图像,结合类热图和可学习原型,同时采用提示调优和适配器学习。 Result: 在多个基准测试中平均性能提升3.47%,优于现有方法。 Conclusion: T2I-PAL无需全语义标注,减少人工工作量,且兼容现有CLIP框架,性能显著提升。 Abstract: Benefited from image-text contrastive learning, pre-trained vision-language models, e.g., CLIP, allow to direct leverage texts as images (TaI) for parameter-efficient fine-tuning (PEFT). While CLIP is capable of making image features to be similar to the corresponding text features, the modality gap remains a nontrivial issue and limits image recognition performance of TaI. Using multi-label image recognition (MLR) as an example, we present a novel method, called T2I-PAL to tackle the modality gap issue when using only text captions for PEFT. The core design of T2I-PAL is to leverage pre-trained text-to-image generation models to generate photo-realistic and diverse images from text captions, thereby reducing the modality gap. To further enhance MLR, T2I-PAL incorporates a class-wise heatmap and learnable prototypes. This aggregates local similarities, making the representation of local visual features more robust and informative for multi-label recognition. For better PEFT, we further combine both prompt tuning and adapter learning to enhance classification performance. T2I-PAL offers significant advantages: it eliminates the need for fully semantically annotated training images, thereby reducing the manual annotation workload, and it preserves the intrinsic mode of the CLIP model, allowing for seamless integration with any existing CLIP framework. Extensive experiments on multiple benchmarks, including MS-COCO, VOC2007, and NUS-WIDE, show that our T2I-PAL can boost recognition performance by 3.47% in average above the top-ranked state-of-the-art methods.[57] Harmonizing Geometry and Uncertainty: Diffusion with Hyperspheres
Muskan Dosi,Chiranjeev Chiranjeev,Kartik Thakral,Mayank Vatsa,Richa Singh
Main category: cs.CV
TL;DR: 论文研究了扩散模型在超球面数据上的表现,提出了HyperSphereDiff方法以更好地保留类几何结构。
Details
Motivation: 传统扩散模型基于欧几里得空间的各向同性高斯噪声,无法有效处理超球面数据的角度几何特性,导致生成性能不佳。 Method: 提出了HyperSphereDiff方法,通过方向性噪声对齐超球面结构,保留类几何并捕捉角度不确定性。 Result: 理论和实验证明,该方法能更好地与超球面数据的本征几何对齐,生成更准确的几何感知模型。 Conclusion: HyperSphereDiff在多个数据集上验证了其有效性,能更好地保留超球面流形结构。 Abstract: Do contemporary diffusion models preserve the class geometry of hyperspherical data? Standard diffusion models rely on isotropic Gaussian noise in the forward process, inherently favoring Euclidean spaces. However, many real-world problems involve non-Euclidean distributions, such as hyperspherical manifolds, where class-specific patterns are governed by angular geometry within hypercones. When modeled in Euclidean space, these angular subtleties are lost, leading to suboptimal generative performance. To address this limitation, we introduce HyperSphereDiff to align hyperspherical structures with directional noise, preserving class geometry and effectively capturing angular uncertainty. We demonstrate both theoretically and empirically that this approach aligns the generative process with the intrinsic geometry of hyperspherical data, resulting in more accurate and geometry-aware generative models. We evaluate our framework on four object datasets and two face datasets, showing that incorporating angular uncertainty better preserves the underlying hyperspherical manifold. Resources are available at: {https://github.com/IAB-IITJ/Harmonizing-Geometry-and-Uncertainty-Diffusion-with-Hyperspheres/}[58] Rethinking Random Masking in Self Distillation on ViT
Jihyeon Seong,Hyunkyung Han
Main category: cs.CV
TL;DR: 研究发现,在DINO自蒸馏框架中,仅对学生全局视图进行随机掩码,同时保留学生局部视图和教师全局视图,能提升注意力图的鲁棒性和细粒度表现,从而优化下游任务性能。
Details
Motivation: 随机掩码可能无意中消除关键语义信息,因此需要更明智的掩码策略。 Method: 在DINO框架中,仅对学生全局视图应用随机掩码,保留学生局部视图和教师全局视图的原始形式。 Result: 在mini-ImageNet数据集上,该方法生成了更鲁棒和细粒度的注意力图,提升了性能。 Conclusion: 不对称随机掩码策略在自蒸馏框架中有效,能增强模型表现。 Abstract: Vision Transformers (ViTs) have demonstrated remarkable performance across a wide range of vision tasks. In particular, self-distillation frameworks such as DINO have contributed significantly to these advances. Within such frameworks, random masking is often utilized to improve training efficiency and introduce regularization. However, recent studies have raised concerns that indiscriminate random masking may inadvertently eliminate critical semantic information, motivating the development of more informed masking strategies. In this study, we explore the role of random masking in the self-distillation setting, focusing on the DINO framework. Specifically, we apply random masking exclusively to the student's global view, while preserving the student's local views and the teacher's global view in their original, unmasked forms. This design leverages DINO's multi-view augmentation scheme to retain clean supervision while inducing robustness through masked inputs. We evaluate our approach using DINO-Tiny on the mini-ImageNet dataset and show that random masking under this asymmetric setup yields more robust and fine-grained attention maps, ultimately enhancing downstream performance.[59] Hierarchical Error Assessment of CAD Models for Aircraft Manufacturing-and-Measurement
Jin Huang,Honghua Chen,Mingqiang Wei
Main category: cs.CV
TL;DR: 提出了一种名为HEA-MM的分层误差评估框架,用于飞机CAD模型的质量检测,通过全局、部件和特征三个层次进行误差分析。
Details
Motivation: 航空设备的高质量要求(高性能、高稳定性和高可靠性)需要精确的误差评估方法。 Method: 使用结构光扫描仪获取工件3D数据,通过点云配准和分层误差分析(全局、部件、特征),提出基于优化的基元细化方法和两阶段圆形孔检测算法。 Result: 实验证明该方法在多种飞机CAD模型上有效。 Conclusion: HEA-MM框架能有效评估飞机CAD模型的误差,满足高质量制造需求。 Abstract: The most essential feature of aviation equipment is high quality, including high performance, high stability and high reliability. In this paper, we propose a novel hierarchical error assessment framework for aircraft CAD models within a manufacturing-and-measurement platform, termed HEA-MM. HEA-MM employs structured light scanners to obtain comprehensive 3D measurements of manufactured workpieces. The measured point cloud is registered with the reference CAD model, followed by an error analysis conducted at three hierarchical levels: global, part, and feature. At the global level, the error analysis evaluates the overall deviation of the scanned point cloud from the reference CAD model. At the part level, error analysis is performed on these patches underlying the point clouds. We propose a novel optimization-based primitive refinement method to obtain a set of meaningful patches of point clouds. Two basic operations, splitting and merging, are introduced to refine the coarse primitives. At the feature level, error analysis is performed on circular holes, which are commonly found in CAD models. To facilitate it, a two-stage algorithm is introduced for the detection of circular holes. First, edge points are identified using a tensor-voting algorithm. Then, multiple circles are fitted through a hypothesize-and-clusterize framework, ensuring accurate detection and analysis of the circular features. Experimental results on various aircraft CAD models demonstrate the effectiveness of our proposed method.[60] Semantic-decoupled Spatial Partition Guided Point-supervised Oriented Object Detection
Xinyuan Liu,Hang Xu,Yike Ma,Yucheng Zhang,Feng Dai
Main category: cs.CV
TL;DR: SSP框架通过语义解耦的空间分区,解决了高密度场景中定向目标检测的样本分配和实例混淆问题,显著提升了性能。
Details
Motivation: 高密度场景中定向目标检测的标注成本高,现有方法因规则设计僵硬导致样本分配不足和实例混淆。 Method: SSP结合规则驱动的先验注入和数据驱动的标签净化,提出像素级和语义级的空间分区方法。 Result: 在DOTA-v1.0等数据集上,SSP的mAP达45.78%,优于PointOBB-v2 4.10%,与ORCNN和ReDet结合后性能进一步提升。 Conclusion: SSP为高密度场景的定向目标检测提供了一种高效且性能优越的解决方案。 Abstract: Recent remote sensing tech advancements drive imagery growth, making oriented object detection rapid development, yet hindered by labor-intensive annotation for high-density scenes. Oriented object detection with point supervision offers a cost-effective solution for densely packed scenes in remote sensing, yet existing methods suffer from inadequate sample assignment and instance confusion due to rigid rule-based designs. To address this, we propose SSP (Semantic-decoupled Spatial Partition), a unified framework that synergizes rule-driven prior injection and data-driven label purification. Specifically, SSP introduces two core innovations: 1) Pixel-level Spatial Partition-based Sample Assignment, which compactly estimates the upper and lower bounds of object scales and mines high-quality positive samples and hard negative samples through spatial partitioning of pixel maps. 2) Semantic Spatial Partition-based Box Extraction, which derives instances from spatial partitions modulated by semantic maps and reliably converts them into bounding boxes to form pseudo-labels for supervising the learning of downstream detectors. Experiments on DOTA-v1.0 and others demonstrate SSP\' s superiority: it achieves 45.78% mAP under point supervision, outperforming SOTA method PointOBB-v2 by 4.10%. Furthermore, when integrated with ORCNN and ReDet architectures, the SSP framework achieves mAP values of 47.86% and 48.50%, respectively. The code is available at https://github.com/antxinyuan/ssp.[61] High-resolution efficient image generation from WiFi CSI using a pretrained latent diffusion model
Eshan Ramesh,Nishio Takayuki
Main category: cs.CV
TL;DR: LatentCSI利用预训练的潜在扩散模型(LDM)从WiFi CSI测量生成环境图像,避免了传统方法的复杂计算,并通过文本引导实现高效高质量的图像合成。
Details
Motivation: 传统方法依赖复杂的GANs技术,计算量大且效率低,LatentCSI旨在通过轻量级网络和LDM简化流程并提升质量。 Method: 使用轻量级神经网络将CSI振幅直接映射到LDM的潜在空间,结合文本引导的扩散模型和预训练解码器生成高分辨率图像。 Result: 在公开数据集和自采数据集上验证,LatentCSI在计算效率和感知质量上优于基线方法,并支持文本引导控制。 Conclusion: LatentCSI提供了一种高效、高质量且可控的图像生成方法,适用于WiFi CSI到图像的转换。 Abstract: We present LatentCSI, a novel method for generating images of the physical environment from WiFi CSI measurements that leverages a pretrained latent diffusion model (LDM). Unlike prior approaches that rely on complex and computationally intensive techniques such as GANs, our method employs a lightweight neural network to map CSI amplitudes directly into the latent space of an LDM. We then apply the LDM's denoising diffusion model to the latent representation with text-based guidance before decoding using the LDM's pretrained decoder to obtain a high-resolution image. This design bypasses the challenges of pixel-space image generation and avoids the explicit image encoding stage typically required in conventional image-to-image pipelines, enabling efficient and high-quality image synthesis. We validate our approach on two datasets: a wide-band CSI dataset we collected with off-the-shelf WiFi devices and cameras; and a subset of the publicly available MM-Fi dataset. The results demonstrate that LatentCSI outperforms baselines of comparable complexity trained directly on ground-truth images in both computational efficiency and perceptual quality, while additionally providing practical advantages through its unique capacity for text-guided controllability.[62] MSTAR: Box-free Multi-query Scene Text Retrieval with Attention Recycling
Liang Yin,Xudong Xie,Zhang Li,Xiang Bai,Yuliang Liu
Main category: cs.CV
TL;DR: MSTAR是一种无需边界框标注的场景文本检索方法,通过动态捕获多粒度文本表示和统一自由文本查询,显著提升了检索性能。
Details
Motivation: 现有方法依赖昂贵的边界框标注且难以统一多样化的查询需求。 Method: 采用渐进式视觉嵌入和风格感知指令,结合多实例匹配模块增强视觉-语言对齐。 Result: 在多个数据集上表现优异,尤其在MQTR基准上平均提升8.5%。 Conclusion: MSTAR在减少标注成本的同时显著提升了多查询场景文本检索能力。 Abstract: Scene text retrieval has made significant progress with the assistance of accurate text localization. However, existing approaches typically require costly bounding box annotations for training. Besides, they mostly adopt a customized retrieval strategy but struggle to unify various types of queries to meet diverse retrieval needs. To address these issues, we introduce Muti-query Scene Text retrieval with Attention Recycling (MSTAR), a box-free approach for scene text retrieval. It incorporates progressive vision embedding to dynamically capture the multi-grained representation of texts and harmonizes free-style text queries with style-aware instructions. Additionally, a multi-instance matching module is integrated to enhance vision-language alignment. Furthermore, we build the Multi-Query Text Retrieval (MQTR) dataset, the first benchmark designed to evaluate the multi-query scene text retrieval capability of models, comprising four query types and 16k images. Extensive experiments demonstrate the superiority of our method across seven public datasets and the MQTR dataset. Notably, MSTAR marginally surpasses the previous state-of-the-art model by 6.4% in MAP on Total-Text while eliminating box annotation costs. Moreover, on the MQTR benchmark, MSTAR significantly outperforms the previous models by an average of 8.5%. The code and datasets are available at https://github.com/yingift/MSTAR.[63] TexTailor: Customized Text-aligned Texturing via Effective Resampling
Suin Lee,Dae-Shik Kim
Main category: cs.CV
TL;DR: TexTailor提出了一种从文本描述生成一致物体纹理的新方法,解决了现有方法在纹理合成过程中视角间纹理属性逐渐偏移的问题。
Details
Motivation: 现有文本到纹理合成方法因扩散过程中对先前合成纹理信息整合不足以及预定义相机位置选择不合理,导致纹理一致性差。 Method: TexTailor通过重采样方案整合先前合成纹理信息,并微调深度感知扩散模型,同时提出性能保持损失和自适应相机位置调整。 Result: 在Objaverse和ShapeNet汽车数据集上的实验表明,TexTailor在合成视角一致纹理方面优于现有方法。 Conclusion: TexTailor通过改进纹理信息整合和相机位置选择,显著提升了纹理合成的视角一致性。 Abstract: We present TexTailor, a novel method for generating consistent object textures from textual descriptions. Existing text-to-texture synthesis approaches utilize depth-aware diffusion models to progressively generate images and synthesize textures across predefined multiple viewpoints. However, these approaches lead to a gradual shift in texture properties across viewpoints due to (1) insufficient integration of previously synthesized textures at each viewpoint during the diffusion process and (2) the autoregressive nature of the texture synthesis process. Moreover, the predefined selection of camera positions, which does not account for the object's geometry, limits the effective use of texture information synthesized from different viewpoints, ultimately degrading overall texture consistency. In TexTailor, we address these issues by (1) applying a resampling scheme that repeatedly integrates information from previously synthesized textures within the diffusion process, and (2) fine-tuning a depth-aware diffusion model on these resampled textures. During this process, we observed that using only a few training images restricts the model's original ability to generate high-fidelity images aligned with the conditioning, and therefore propose an performance preservation loss to mitigate this issue. Additionally, we improve the synthesis of view-consistent textures by adaptively adjusting camera positions based on the object's geometry. Experiments on a subset of the Objaverse dataset and the ShapeNet car dataset demonstrate that TexTailor outperforms state-of-the-art methods in synthesizing view-consistent textures. The source code for TexTailor is available at https://github.com/Adios42/Textailor[64] Anatomy-Grounded Weakly Supervised Prompt Tuning for Chest X-ray Latent Diffusion Models
Konstantinos Vilouras,Ilias Stogiannidis,Junyu Yan,Alison Q. O'Neil,Sotirios A. Tsaftaris
Main category: cs.CV
TL;DR: 论文提出了一种针对医学影像的文本到图像潜在扩散模型微调框架,解决了临床信息与影像对齐的问题,并在标准数据集上取得了最优性能。
Details
Motivation: 医学影像领域缺乏文本到图像的潜在扩散模型研究,主要由于数据隐私问题导致数据有限。标准模型未能有效对齐临床文本与影像信息。 Method: 提出了一种微调框架,改进预训练模型的多模态对齐能力,使其适用于下游任务(如短语定位)。 Result: 在标准数据集(MS-CXR)上达到最优性能,并在外部数据(VinDr-CXR)上表现稳健。 Conclusion: 该方法为医学影像领域的文本到图像生成任务提供了高效解决方案,并展示了良好的泛化能力。 Abstract: Latent Diffusion Models have shown remarkable results in text-guided image synthesis in recent years. In the domain of natural (RGB) images, recent works have shown that such models can be adapted to various vision-language downstream tasks with little to no supervision involved. On the contrary, text-to-image Latent Diffusion Models remain relatively underexplored in the field of medical imaging, primarily due to limited data availability (e.g., due to privacy concerns). In this work, focusing on the chest X-ray modality, we first demonstrate that a standard text-conditioned Latent Diffusion Model has not learned to align clinically relevant information in free-text radiology reports with the corresponding areas of the given scan. Then, to alleviate this issue, we propose a fine-tuning framework to improve multi-modal alignment in a pre-trained model such that it can be efficiently repurposed for downstream tasks such as phrase grounding. Our method sets a new state-of-the-art on a standard benchmark dataset (MS-CXR), while also exhibiting robust performance on out-of-distribution data (VinDr-CXR). Our code will be made publicly available.[65] Symmetrical Flow Matching: Unified Image Generation, Segmentation, and Classification with Score-Based Generative Models
Francisco Caetano,Christiaan Viviers,Peter H. N. De With,Fons van der Sommen
Main category: cs.CV
TL;DR: SymmFlow是一种对称流匹配框架,统一了语义分割、分类和图像生成任务,通过双向一致性学习和保留语义信息,实现了高效采样和多样化生成。
Details
Motivation: 传统方法在分布转换中缺乏双向一致性和语义保留能力,SymmFlow旨在解决这些问题,同时支持灵活的像素级和图像级条件生成。 Method: 采用对称学习目标联合建模正向和反向转换,引入新训练目标显式保留语义信息,支持一步分割和分类。 Result: 在CelebAMask-HQ和COCO-Stuff上分别达到FID 11.9和7.0,仅需25步推理,同时在分割和分类任务中表现优异。 Conclusion: SymmFlow在多任务中实现了高效且一致的性能,为生成模型和语义任务提供了统一框架。 Abstract: Flow Matching has emerged as a powerful framework for learning continuous transformations between distributions, enabling high-fidelity generative modeling. This work introduces Symmetrical Flow Matching (SymmFlow), a new formulation that unifies semantic segmentation, classification, and image generation within a single model. Using a symmetric learning objective, SymmFlow models forward and reverse transformations jointly, ensuring bi-directional consistency, while preserving sufficient entropy for generative diversity. A new training objective is introduced to explicitly retain semantic information across flows, featuring efficient sampling while preserving semantic structure, allowing for one-step segmentation and classification without iterative refinement. Unlike previous approaches that impose strict one-to-one mapping between masks and images, SymmFlow generalizes to flexible conditioning, supporting both pixel-level and image-level class labels. Experimental results on various benchmarks demonstrate that SymmFlow achieves state-of-the-art performance on semantic image synthesis, obtaining FID scores of 11.9 on CelebAMask-HQ and 7.0 on COCO-Stuff with only 25 inference steps. Additionally, it delivers competitive results on semantic segmentation and shows promising capabilities in classification tasks. The code will be publicly available.[66] VideoDeepResearch: Long Video Understanding With Agentic Tool Using
Huaying Yuan,Zheng Liu,Junjie Zhou,Ji-Rong Wen,Zhicheng Dou
Main category: cs.CV
TL;DR: VideoDeepResearch是一种新型代理框架,仅依赖文本大型推理模型(LRM)和多模态工具包,显著提升了长视频理解(LVU)任务的性能,超越了现有多模态大语言模型(MLLM)基线。
Details
Motivation: 挑战当前多模态大语言模型(MLLM)在长视频理解(LVU)任务中的局限性,提出一种无需扩展上下文窗口或强视觉感知能力的替代方案。 Method: 结合文本大型推理模型(LRM)和模块化多模态工具包(包括多模态检索器和视觉感知器),通过推理制定问题解决策略,并选择性访问视频内容。 Result: 在MLVU、Video-MME和LVBench等基准测试中,性能分别提升9.6%、6.6%和3.9%,超越现有最佳模型。 Conclusion: 代理系统在解决长视频理解任务中的关键挑战方面具有潜力,无需依赖复杂的多模态大语言模型。 Abstract: Long video understanding (LVU) presents a significant challenge for current multi-modal large language models (MLLMs) due to the task's inherent complexity and context window constraint. It is widely assumed that addressing LVU tasks requires foundation MLLMs with extended context windows, strong visual perception capabilities, and proficient domain expertise. In this work, we challenge this common belief by introducing VideoDeepResearch, a novel agentic framework for long video understanding. Our approach relies solely on a text-only large reasoning model (LRM) combined with a modular multi-modal toolkit, including multimodal retrievers and visual perceivers, all of which are readily available in practice. For each LVU task, the system formulates a problem-solving strategy through reasoning, while selectively accessing and utilizing essential video content via tool using. We conduct extensive experiments on popular LVU benchmarks, including MLVU, Video-MME, and LVBench. Our results demonstrate that VideoDeepResearch achieves substantial improvements over existing MLLM baselines, surpassing the previous state-of-the-art by 9.6%, 6.6%, and 3.9% on MLVU (test), LVBench, and LongVideoBench, respectively. These findings highlight the promise of agentic systems in overcoming key challenges in LVU problems.[67] GigaVideo-1: Advancing Video Generation via Automatic Feedback with 4 GPU-Hours Fine-Tuning
Xiaoyi Bao,Jindi Lv,Xiaofeng Wang,Zheng Zhu,Xinze Chen,YuKun Zhou,Jiancheng Lv,Xingang Wang,Guan Huang
Main category: cs.CV
TL;DR: GigaVideo-1提出了一种无需人工标注的高效视频生成微调框架,通过自动反馈提升预训练模型的性能。
Details
Motivation: 现有视频生成模型的微调方法依赖人工标注和大规模计算资源,限制了实用性。 Method: 设计了基于提示的数据引擎和奖励引导的训练策略,利用预训练视觉语言模型的反馈优化样本权重。 Result: 在VBench-2.0基准测试中,GigaVideo-1在17个维度上平均提升4%,仅需4 GPU小时。 Conclusion: GigaVideo-1无需人工标注和大量真实数据,展示了高效和有效性。 Abstract: Recent progress in diffusion models has greatly enhanced video generation quality, yet these models still require fine-tuning to improve specific dimensions like instance preservation, motion rationality, composition, and physical plausibility. Existing fine-tuning approaches often rely on human annotations and large-scale computational resources, limiting their practicality. In this work, we propose GigaVideo-1, an efficient fine-tuning framework that advances video generation without additional human supervision. Rather than injecting large volumes of high-quality data from external sources, GigaVideo-1 unlocks the latent potential of pre-trained video diffusion models through automatic feedback. Specifically, we focus on two key aspects of the fine-tuning process: data and optimization. To improve fine-tuning data, we design a prompt-driven data engine that constructs diverse, weakness-oriented training samples. On the optimization side, we introduce a reward-guided training strategy, which adaptively weights samples using feedback from pre-trained vision-language models with a realism constraint. We evaluate GigaVideo-1 on the VBench-2.0 benchmark using Wan2.1 as the baseline across 17 evaluation dimensions. Experiments show that GigaVideo-1 consistently improves performance on almost all the dimensions with an average gain of about 4% using only 4 GPU-hours. Requiring no manual annotations and minimal real data, GigaVideo-1 demonstrates both effectiveness and efficiency. Code, model, and data will be publicly available.[68] VINCIE: Unlocking In-context Image Editing from Video
Leigang Qu,Feng Cheng,Ziyan Yang,Qi Zhao,Shanchuan Lin,Yichun Shi,Yicong Li,Wenjie Wang,Tat-Seng Chua,Lu Jiang
Main category: cs.CV
TL;DR: 本文提出了一种直接从视频中学习上下文图像编辑的方法,通过多模态序列标注和块因果扩散变换器,实现了多任务学习和多轮图像编辑的先进性能。
Details
Motivation: 现有方法依赖任务特定流程和专家模型,限制了数据标注的灵活性和模型的可扩展性。本文探索直接从视频中学习上下文图像编辑的可能性。 Method: 提出了一种可扩展的视频标注方法,将其转化为多模态序列,并设计了块因果扩散变换器,训练于三个代理任务:下一图像预测、当前分割预测和下一分割预测。 Result: 模型在上下文图像编辑方面表现出色,在多轮图像编辑基准测试中达到最先进水平,同时在多概念组合、故事生成和编辑链应用中展现出潜力。 Conclusion: 直接从视频中学习上下文图像编辑是可行的,且模型在多任务和多轮编辑中具有广泛应用前景。 Abstract: In-context image editing aims to modify images based on a contextual sequence comprising text and previously generated images. Existing methods typically depend on task-specific pipelines and expert models (e.g., segmentation and inpainting) to curate training data. In this work, we explore whether an in-context image editing model can be learned directly from videos. We introduce a scalable approach to annotate videos as interleaved multimodal sequences. To effectively learn from this data, we design a block-causal diffusion transformer trained on three proxy tasks: next-image prediction, current segmentation prediction, and next-segmentation prediction. Additionally, we propose a novel multi-turn image editing benchmark to advance research in this area. Extensive experiments demonstrate that our model exhibits strong in-context image editing capabilities and achieves state-of-the-art results on two multi-turn image editing benchmarks. Despite being trained exclusively on videos, our model also shows promising abilities in multi-concept composition, story generation, and chain-of-editing applications.[69] PiPViT: Patch-based Visual Interpretable Prototypes for Retinal Image Analysis
Marzieh Oghbaie,Teresa Araújoa,Hrvoje Bogunović
Main category: cs.CV
TL;DR: PiPViT是一种基于视觉Transformer的原型模型,通过学习可解释的病灶原型提升医学图像分类的透明性。
Details
Motivation: 解决现有原型方法在医学图像中可视化不一致和原型过于细粒度的问题。 Method: 利用ViT捕捉长距离依赖,结合对比学习和多分辨率输入处理,学习可解释的病灶原型。 Result: 在视网膜OCT图像分类中表现优异,原型具有临床相关性。 Conclusion: PiPViT能透明解释决策,辅助临床诊断。 Abstract: Background and Objective: Prototype-based methods improve interpretability by learning fine-grained part-prototypes; however, their visualization in the input pixel space is not always consistent with human-understandable biomarkers. In addition, well-known prototype-based approaches typically learn extremely granular prototypes that are less interpretable in medical imaging, where both the presence and extent of biomarkers and lesions are critical. Methods: To address these challenges, we propose PiPViT (Patch-based Visual Interpretable Prototypes), an inherently interpretable prototypical model for image recognition. Leveraging a vision transformer (ViT), PiPViT captures long-range dependencies among patches to learn robust, human-interpretable prototypes that approximate lesion extent only using image-level labels. Additionally, PiPViT benefits from contrastive learning and multi-resolution input processing, which enables effective localization of biomarkers across scales. Results: We evaluated PiPViT on retinal OCT image classification across four datasets, where it achieved competitive quantitative performance compared to state-of-the-art methods while delivering more meaningful explanations. Moreover, quantitative evaluation on a hold-out test set confirms that the learned prototypes are semantically and clinically relevant. We believe PiPViT can transparently explain its decisions and assist clinicians in understanding diagnostic outcomes. Github page: https://github.com/marziehoghbaie/PiPViT[70] MMMG: A Massive, Multidisciplinary, Multi-Tier Generation Benchmark for Text-to-Image Reasoning
Yuxuan Luo,Yuhui Yuan,Junwen Chen,Haonan Cai,Ziyi Yue,Yuwei Yang,Fatima Zohra Daha,Ji Li,Zhouhui Lian
Main category: cs.CV
TL;DR: 论文提出知识图像生成任务及MMMG基准,评估图像生成模型的多模态推理能力,发现现有模型表现不佳,并发布基线模型FLUX-Reason。
Details
Motivation: 知识图像对人类文明和学习至关重要,但生成此类图像需融合世界知识与像素级推理,现有模型能力不足。 Method: 构建MMMG基准(含4,456个专家验证的图像-提示对),采用统一知识图谱表示,提出MMMG-Score评估指标。 Result: 评估16个先进文本到图像模型,发现推理能力不足(如GPT-4o得分仅50.20),并发布基线模型FLUX-Reason(得分34.45)。 Conclusion: MMMG基准揭示了图像生成模型的推理缺陷,FLUX-Reason为未来研究提供了开放基线。 Abstract: In this paper, we introduce knowledge image generation as a new task, alongside the Massive Multi-Discipline Multi-Tier Knowledge-Image Generation Benchmark (MMMG) to probe the reasoning capability of image generation models. Knowledge images have been central to human civilization and to the mechanisms of human learning--a fact underscored by dual-coding theory and the picture-superiority effect. Generating such images is challenging, demanding multimodal reasoning that fuses world knowledge with pixel-level grounding into clear explanatory visuals. To enable comprehensive evaluation, MMMG offers 4,456 expert-validated (knowledge) image-prompt pairs spanning 10 disciplines, 6 educational levels, and diverse knowledge formats such as charts, diagrams, and mind maps. To eliminate confounding complexity during evaluation, we adopt a unified Knowledge Graph (KG) representation. Each KG explicitly delineates a target image's core entities and their dependencies. We further introduce MMMG-Score to evaluate generated knowledge images. This metric combines factual fidelity, measured by graph-edit distance between KGs, with visual clarity assessment. Comprehensive evaluations of 16 state-of-the-art text-to-image generation models expose serious reasoning deficits--low entity fidelity, weak relations, and clutter--with GPT-4o achieving an MMMG-Score of only 50.20, underscoring the benchmark's difficulty. To spur further progress, we release FLUX-Reason (MMMG-Score of 34.45), an effective and open baseline that combines a reasoning LLM with diffusion models and is trained on 16,000 curated knowledge image-prompt pairs.[71] Enhancing Deepfake Detection using SE Block Attention with CNN
Subhram Dasgupta,Janelle Mason,Xiaohong Yuan,Olusola Odeyomi,Kaushik Roy
Main category: cs.CV
TL;DR: 提出了一种轻量级CNN结合SE注意力模块的Deepfake检测模型,实现了高准确率和低资源消耗。
Details
Motivation: Deepfake技术对信息真实性和安全性构成威胁,现有检测模型体积大、资源消耗高。 Method: 采用轻量级CNN结合SE注意力模块,动态调整特征权重,提升检测效率。 Result: 在Style GAN数据集上达到94.14%的分类准确率和0.985的AUC-ROC分数。 Conclusion: 该方法为高效、低成本的Deepfake检测提供了可行方案。 Abstract: In the digital age, Deepfake present a formidable challenge by using advanced artificial intelligence to create highly convincing manipulated content, undermining information authenticity and security. These sophisticated fabrications surpass traditional detection methods in complexity and realism. To address this issue, we aim to harness cutting-edge deep learning methodologies to engineer an innovative deepfake detection model. However, most of the models designed for deepfake detection are large, causing heavy storage and memory consumption. In this research, we propose a lightweight convolution neural network (CNN) with squeeze and excitation block attention (SE) for Deepfake detection. The SE block module is designed to perform dynamic channel-wise feature recalibration. The SE block allows the network to emphasize informative features and suppress less useful ones, which leads to a more efficient and effective learning module. This module is integrated with a simple sequential model to perform Deepfake detection. The model is smaller in size and it achieves competing accuracy with the existing models for deepfake detection tasks. The model achieved an overall classification accuracy of 94.14% and AUC-ROC score of 0.985 on the Style GAN dataset from the Diverse Fake Face Dataset. Our proposed approach presents a promising avenue for combating the Deepfake challenge with minimal computational resources, developing efficient and scalable solutions for digital content verification.[72] Unsourced Adversarial CAPTCHA: A Bi-Phase Adversarial CAPTCHA Framework
Xia Du,Xiaoyuan Liu,Jizhe Zhou,Zheng Lin,Chi-man Pun,Zhe Chen,Wei Ni,Jun Luo
Main category: cs.CV
TL;DR: 提出了一种名为UAC的新框架,通过攻击者指定的文本提示生成高保真对抗样本,解决了传统对抗攻击方法的局限性。
Details
Motivation: 传统CAPTCHA方案在深度学习快速发展下易受攻击,现有对抗攻击方法依赖原始图像特征,导致失真且适用性受限。 Method: UAC利用大型语言模型(LLM)增强CAPTCHA多样性,支持定向和非定向攻击。针对定向攻击,使用EDICT方法优化扩散模型的双潜变量;针对非定向攻击,提出BP-UAC策略,采用多模态梯度和双路径优化。 Result: 实验表明BP-UAC在多种系统中攻击成功率高,生成的CAPTCHA对人类和DNN均难以区分。 Conclusion: UAC框架有效提升了对抗攻击的适用性和隐蔽性,为CAPTCHA安全性提供了新思路。 Abstract: With the rapid advancements in deep learning, traditional CAPTCHA schemes are increasingly vulnerable to automated attacks powered by deep neural networks (DNNs). Existing adversarial attack methods often rely on original image characteristics, resulting in distortions that hinder human interpretation and limit applicability in scenarios lacking initial input images. To address these challenges, we propose the Unsourced Adversarial CAPTCHA (UAC), a novel framework generating high-fidelity adversarial examples guided by attacker-specified text prompts. Leveraging a Large Language Model (LLM), UAC enhances CAPTCHA diversity and supports both targeted and untargeted attacks. For targeted attacks, the EDICT method optimizes dual latent variables in a diffusion model for superior image quality. In untargeted attacks, especially for black-box scenarios, we introduce bi-path unsourced adversarial CAPTCHA (BP-UAC), a two-step optimization strategy employing multimodal gradients and bi-path optimization for efficient misclassification. Experiments show BP-UAC achieves high attack success rates across diverse systems, generating natural CAPTCHAs indistinguishable to humans and DNNs.[73] Underage Detection through a Multi-Task and MultiAge Approach for Screening Minors in Unconstrained Imagery
Christopher Gaul,Eduardo Fidalgo,Enrique Alegre,Rocío Alaiz Rodríguez,Eri Pérez Corral
Main category: cs.CV
TL;DR: 论文提出了一种多任务架构,结合冻结的FaRL视觉语言主干和紧凑的MLP,用于未成年人筛查,解决了数据分布偏移和样本不平衡问题。
Details
Motivation: 解决未成年人筛查中数据分布偏移和样本不平衡问题,提升模型在真实场景中的鲁棒性。 Method: 采用多任务架构,结合冻结的FaRL主干和MLP,引入α-加权焦点损失和年龄平衡小批量采样。 Result: 模型在ASORES-39k测试集上降低了RMSE,提升了F2分数,在ASWIFT-20k测试集上表现出强泛化能力。 Conclusion: 提出的方法显著提升了未成年人筛查的准确性和鲁棒性,适用于真实场景。 Abstract: Accurate automatic screening of minors in unconstrained images demands models that are robust to distribution shift and resilient to the children under-representation in publicly available data. To overcome these issues, we propose a multi-task architecture with dedicated under/over-age discrimination tasks based on a frozen FaRL vision-language backbone joined with a compact two-layer MLP that shares features across one age-regression head and four binary under-age heads for age thresholds of 12, 15, 18, and 21 years, focusing on the legally critical age range. To address the severe class imbalance, we introduce an $\alpha$-reweighted focal-style loss and age-balanced mini-batch sampling, which equalizes twelve age bins during stochastic optimization. Further improvement is achieved with an age gap that removes edge cases from the loss. Moreover, we set a rigorous evaluation by proposing the Overall Under-Age Benchmark, with 303k cleaned training images and 110k test images, defining both the "ASORES-39k" restricted overall test, which removes the noisiest domains, and the age estimation wild shifts test "ASWIFT-20k" of 20k-images, stressing extreme pose ($>$45{\deg}), expression, and low image quality to emulate real-world shifts. Trained on the cleaned overall set with resampling and age gap, our multiage model "F" lowers the root-mean-square-error on the ASORES-39k restricted test from 5.733 (age-only baseline) to 5.656 years and lifts under-18 detection from F2 score of 0.801 to 0.857 at 1% false-adult rate. Under the domain shift to the wild data of ASWIFT-20k, the same configuration nearly sustains 0.99 recall while boosting F2 from 0.742 to 0.833 with respect to the age-only baseline, demonstrating strong generalization under distribution shift. For the under-12 and under-15 tasks, the respective boosts in F2 are from 0.666 to 0.955 and from 0.689 to 0.916, respectively.[74] Continual Hyperbolic Learning of Instances and Classes
Melika Ayoughi,Mina Ghadimi Atigh,Mohammad Mahdi Derakhshani,Cees G. M. Snoek,Pascal Mettes,Paul Groth
Main category: cs.CV
TL;DR: 论文提出了一种名为HyperCLIC的持续学习算法,用于同时处理实例和类别的层次化学习任务,利用双曲空间建模层次关系,并在动态真实环境中验证了其有效性。
Details
Motivation: 现实应用(如机器人和自动驾驶)需要模型同时处理实例和类别的持续学习,但目前的研究仅关注单一层面。 Method: 提出HyperCLIC算法,利用双曲空间建模层次关系,结合双曲分类和蒸馏目标。 Result: 在EgoObjects数据集上验证,HyperCLIC在多粒度任务中表现优异,提升了层次泛化能力。 Conclusion: HyperCLIC通过双曲空间有效建模层次关系,为复杂现实场景中的持续学习提供了新思路。 Abstract: Continual learning has traditionally focused on classifying either instances or classes, but real-world applications, such as robotics and self-driving cars, require models to handle both simultaneously. To mirror real-life scenarios, we introduce the task of continual learning of instances and classes, at the same time. This task challenges models to adapt to multiple levels of granularity over time, which requires balancing fine-grained instance recognition with coarse-grained class generalization. In this paper, we identify that classes and instances naturally form a hierarchical structure. To model these hierarchical relationships, we propose HyperCLIC, a continual learning algorithm that leverages hyperbolic space, which is uniquely suited for hierarchical data due to its ability to represent tree-like structures with low distortion and compact embeddings. Our framework incorporates hyperbolic classification and distillation objectives, enabling the continual embedding of hierarchical relations. To evaluate performance across multiple granularities, we introduce continual hierarchical metrics. We validate our approach on EgoObjects, the only dataset that captures the complexity of hierarchical object recognition in dynamic real-world environments. Empirical results show that HyperCLIC operates effectively at multiple granularities with improved hierarchical generalization.[75] Uncertainty-Masked Bernoulli Diffusion for Camouflaged Object Detection Refinement
Yuqi Shen,Fengyang Xiao,Sujie Hu,Youwei Pang,Yifan Pu,Chengyu Fang,Xiu Li,Chunming He
Main category: cs.CV
TL;DR: 论文提出了一种名为UMBD的生成式细化框架,用于提升伪装目标检测(COD)的性能,通过不确定性引导的掩蔽机制和伯努利扩散实现针对性优化。
Details
Motivation: 现有COD方法在目标与背景视觉差异较小的情况下表现有限,尤其是后处理细化方面尚未充分探索。 Method: 提出UMBD框架,结合不确定性引导的掩蔽机制和伯努利扩散,并设计HUQNet网络以多分支架构融合多源不确定性。 Result: 在多个COD基准测试中,UMBD显著提升了性能,平均MAE提升5.5%,加权F-measure提升3.2%。 Conclusion: UMBD框架能够无缝集成现有COD模型,结合判别与生成优势,为COD任务提供了有效的后处理解决方案。 Abstract: Camouflaged Object Detection (COD) presents inherent challenges due to the subtle visual differences between targets and their backgrounds. While existing methods have made notable progress, there remains significant potential for post-processing refinement that has yet to be fully explored. To address this limitation, we propose the Uncertainty-Masked Bernoulli Diffusion (UMBD) model, the first generative refinement framework specifically designed for COD. UMBD introduces an uncertainty-guided masking mechanism that selectively applies Bernoulli diffusion to residual regions with poor segmentation quality, enabling targeted refinement while preserving correctly segmented areas. To support this process, we design the Hybrid Uncertainty Quantification Network (HUQNet), which employs a multi-branch architecture and fuses uncertainty from multiple sources to improve estimation accuracy. This enables adaptive guidance during the generative sampling process. The proposed UMBD framework can be seamlessly integrated with a wide range of existing Encoder-Decoder-based COD models, combining their discriminative capabilities with the generative advantages of diffusion-based refinement. Extensive experiments across multiple COD benchmarks demonstrate consistent performance improvements, achieving average gains of 5.5% in MAE and 3.2% in weighted F-measure with only modest computational overhead. Code will be released.[76] Deep Learning-based Multi Project InP Wafer Simulation for Unsupervised Surface Defect Detection
Emílio Dolgener Cantú,Rolf Klemens Wittmann,Oliver Abdeen,Patrick Wagner,Wojciech Samek,Moritz Baier,Sebastian Lapuschkin
Main category: cs.CV
TL;DR: 提出一种基于深度神经网络的合成黄金标准方法,用于InP晶圆制造中的缺陷检测,替代传统的手动和劳动密集型方法。
Details
Motivation: InP晶圆制造中,由于生产规模小和设计变异性高,缺乏黄金标准模板,导致缺陷检测依赖手动且效率低下。 Method: 使用深度神经网络从CAD数据生成逼真的InP晶圆图像,作为合成黄金标准,并评估不同训练目标的效果。 Result: 深度学习方法优于基于决策树的基线方法,能够从CAD计划生成模拟黄金标准,提高缺陷检测效率。 Conclusion: 该方法在表面缺陷检测中具有实际应用价值,可通过模板匹配提升检测效率。 Abstract: Quality management in semiconductor manufacturing often relies on template matching with known golden standards. For Indium-Phosphide (InP) multi-project wafer manufacturing, low production scale and high design variability lead to such golden standards being typically unavailable. Defect detection, in turn, is manual and labor-intensive. This work addresses this challenge by proposing a methodology to generate a synthetic golden standard using Deep Neural Networks, trained to simulate photo-realistic InP wafer images from CAD data. We evaluate various training objectives and assess the quality of the simulated images on both synthetic data and InP wafer photographs. Our deep-learning-based method outperforms a baseline decision-tree-based approach, enabling the use of a 'simulated golden die' from CAD plans in any user-defined region of a wafer for more efficient defect detection. We apply our method to a template matching procedure, to demonstrate its practical utility in surface defect detection.[77] IQE-CLIP: Instance-aware Query Embedding for Zero-/Few-shot Anomaly Detection in Medical Domain
Hong Huang,Weixiang Sun,Zhijian Wu,Jingwen Niu,Donghuan Lu,Xian Wu,Yefeng Zheng
Main category: cs.CV
TL;DR: IQE-CLIP是一种针对医学领域零样本和少样本异常检测的新框架,通过结合文本和视觉信息生成异常敏感嵌入,显著提升了性能。
Details
Motivation: 现有CLIP方法依赖特定场景的文本提示,难以区分正常与异常实例,且医学领域研究有限。 Method: 提出IQE-CLIP框架,结合类基础和可学习提示令牌,设计实例感知查询模块提取区域级上下文信息。 Result: 在六个医学数据集上,IQE-CLIP在零样本和少样本设置中均达到最优性能。 Conclusion: IQE-CLIP通过多模态信息融合,有效解决了医学异常检测的挑战。 Abstract: Recent advances in vision-language models, such as CLIP, have significantly improved performance in zero- and few-shot anomaly detection (ZFSAD) tasks. However, most existing CLIP-based methods assume prior knowledge of categories and rely on carefully designed prompts tailored to specific scenarios. While these text prompts capture semantic information in the textual space, they often fail to distinguish normal and anomalous instances in the joint embedding space. Moreover, most ZFSAD approaches focus on industrial domains, with limited exploration in medical tasks. To address these limitations, we propose IQE-CLIP, a novel framework for ZFSAD in the medical domain. We show that query embeddings integrating both textual and instance-aware visual information serve as more effective indicators of anomalies. Specifically, we introduce class-based and learnable prompting tokens to better adapt CLIP to the medical setting. Furthermore, we design an instance-aware query module that extracts region-level contextual information from both modalities, enabling the generation of anomaly-sensitive embeddings. Extensive experiments on six medical datasets demonstrate that IQE-CLIP achieves state-of-the-art performance in both zero-shot and few-shot settings. Code and data are available at \href{https://github.com/hongh0/IQE-CLIP/}{this https URL}.[78] PosterCraft: Rethinking High-Quality Aesthetic Poster Generation in a Unified Framework
SiXiang Chen,Jianyu Lai,Jialin Gao,Tian Ye,Haoyu Chen,Hengyu Shi,Shitong Shao,Yunlong Lin,Song Fei,Zhaohu Xing,Yeying Jin,Junfeng Luo,Xiaoming Wei,Lei Zhu
Main category: cs.CV
TL;DR: PosterCraft是一个统一框架,用于生成高质量海报,通过多阶段优化流程提升文本渲染和艺术内容整合能力。
Details
Motivation: 生成美观海报需要精确的文本渲染和艺术内容的无缝整合,现有方法存在模块化流程和固定布局的限制。 Method: PosterCraft采用级联工作流,包括文本渲染优化、区域感知微调、美学文本强化学习和联合视觉语言反馈细化。 Result: PosterCraft在渲染准确性、布局连贯性和视觉吸引力方面显著优于开源基线,接近商业系统水平。 Conclusion: PosterCraft通过自动化数据构建和多阶段优化,实现了高质量海报生成,代码和数据集已开源。 Abstract: Generating aesthetic posters is more challenging than simple design images: it requires not only precise text rendering but also the seamless integration of abstract artistic content, striking layouts, and overall stylistic harmony. To address this, we propose PosterCraft, a unified framework that abandons prior modular pipelines and rigid, predefined layouts, allowing the model to freely explore coherent, visually compelling compositions. PosterCraft employs a carefully designed, cascaded workflow to optimize the generation of high-aesthetic posters: (i) large-scale text-rendering optimization on our newly introduced Text-Render-2M dataset; (ii) region-aware supervised fine-tuning on HQ-Poster100K; (iii) aesthetic-text-reinforcement learning via best-of-n preference optimization; and (iv) joint vision-language feedback refinement. Each stage is supported by a fully automated data-construction pipeline tailored to its specific needs, enabling robust training without complex architectural modifications. Evaluated on multiple experiments, PosterCraft significantly outperforms open-source baselines in rendering accuracy, layout coherence, and overall visual appeal-approaching the quality of SOTA commercial systems. Our code, models, and datasets can be found in the Project page: https://ephemeral182.github.io/PosterCraft[79] Stroke-based Cyclic Amplifier: Image Super-Resolution at Arbitrary Ultra-Large Scales
Wenhao Guo,Peng Lu,Xujun Peng,Zhaoran Zhao,Sheng Li
Main category: cs.CV
TL;DR: 提出了一种基于笔划向量放大器的统一模型SbCA,用于超大规模图像超分辨率任务,解决了传统方法在超出训练范围时的性能下降问题。
Details
Motivation: 传统任意尺度图像超分辨率方法在超出训练范围时性能显著下降,导致模糊问题。 Method: 通过笔划向量放大器将图像分解为矢量图形进行放大,结合细节补全模块恢复缺失细节,采用循环策略迭代优化。 Result: 实验表明,SbCA在超大规模上采样任务(如×100)中显著优于现有方法,生成高质量超分辨率图像。 Conclusion: SbCA有效解决了分布漂移问题,消除了伪影和模糊,为超大规模图像超分辨率提供了高效解决方案。 Abstract: Prior Arbitrary-Scale Image Super-Resolution (ASISR) methods often experience a significant performance decline when the upsampling factor exceeds the range covered by the training data, introducing substantial blurring. To address this issue, we propose a unified model, Stroke-based Cyclic Amplifier (SbCA), for ultra-large upsampling tasks. The key of SbCA is the stroke vector amplifier, which decomposes the image into a series of strokes represented as vector graphics for magnification. Then, the detail completion module also restores missing details, ensuring high-fidelity image reconstruction. Our cyclic strategy achieves ultra-large upsampling by iteratively refining details with this unified SbCA model, trained only once for all, while keeping sub-scales within the training range. Our approach effectively addresses the distribution drift issue and eliminates artifacts, noise and blurring, producing high-quality, high-resolution super-resolved images. Experimental validations on both synthetic and real-world datasets demonstrate that our approach significantly outperforms existing methods in ultra-large upsampling tasks (e.g. $\times100$), delivering visual quality far superior to state-of-the-art techniques.[80] SlotPi: Physics-informed Object-centric Reasoning Models
Jian Li,Wan Han,Ning Lin,Yu-Liang Zhan,Ruizhi Chengze,Haining Wang,Yi Zhang,Hongsheng Liu,Zidong Wang,Fan Yu,Hao Sun
Main category: cs.CV
TL;DR: SlotPi是一个基于物理知识的对象中心推理模型,解决了现有方法忽视物理知识整合和模型适应性验证的问题。
Details
Motivation: 人类通过观察世界获取物理知识并应用于动态场景推理,现有方法缺乏物理知识整合和多样性场景验证。 Method: SlotPi结合了基于哈密顿原理的物理模块和时空预测模块,用于动态预测。 Result: 实验表明,SlotPi在预测和视觉问答任务中表现优异,并在真实世界数据集上验证了其适应性。 Conclusion: SlotPi的强适应性为开发更先进的世界模型奠定了基础。 Abstract: Understanding and reasoning about dynamics governed by physical laws through visual observation, akin to human capabilities in the real world, poses significant challenges. Currently, object-centric dynamic simulation methods, which emulate human behavior, have achieved notable progress but overlook two critical aspects: 1) the integration of physical knowledge into models. Humans gain physical insights by observing the world and apply this knowledge to accurately reason about various dynamic scenarios; 2) the validation of model adaptability across diverse scenarios. Real-world dynamics, especially those involving fluids and objects, demand models that not only capture object interactions but also simulate fluid flow characteristics. To address these gaps, we introduce SlotPi, a slot-based physics-informed object-centric reasoning model. SlotPi integrates a physical module based on Hamiltonian principles with a spatio-temporal prediction module for dynamic forecasting. Our experiments highlight the model's strengths in tasks such as prediction and Visual Question Answering (VQA) on benchmark and fluid datasets. Furthermore, we have created a real-world dataset encompassing object interactions, fluid dynamics, and fluid-object interactions, on which we validated our model's capabilities. The model's robust performance across all datasets underscores its strong adaptability, laying a foundation for developing more advanced world models.[81] Human-Robot Navigation using Event-based Cameras and Reinforcement Learning
Ignacio Bugueno-Cordova,Javier Ruiz-del-Solar,Rodrigo Verschae
Main category: cs.CV
TL;DR: 提出了一种结合事件相机与强化学习的机器人导航控制器,实现实时人中心导航与避障。
Details
Motivation: 传统图像控制器存在固定帧率、运动模糊和延迟问题,事件相机的异步特性可解决这些问题。 Method: 整合事件相机感知、其他距离传感器及深度确定性策略梯度优化,辅以模仿学习提高效率。 Result: 在模拟环境中展示了稳健导航、行人跟随和避障能力。 Conclusion: 该方法在实时导航中表现优异,具有实际应用潜力。 Abstract: This work introduces a robot navigation controller that combines event cameras and other sensors with reinforcement learning to enable real-time human-centered navigation and obstacle avoidance. Unlike conventional image-based controllers, which operate at fixed rates and suffer from motion blur and latency, this approach leverages the asynchronous nature of event cameras to process visual information over flexible time intervals, enabling adaptive inference and control. The framework integrates event-based perception, additional range sensing, and policy optimization via Deep Deterministic Policy Gradient, with an initial imitation learning phase to improve sample efficiency. Promising results are achieved in simulated environments, demonstrating robust navigation, pedestrian following, and obstacle avoidance. A demo video is available at the project website.[82] Prompts to Summaries: Zero-Shot Language-Guided Video Summarization
Mario Barbara,Alaa Maalouf
Main category: cs.CV
TL;DR: 本文提出了一种零样本、基于文本查询的视频摘要方法Prompts-to-Summaries,无需训练数据即可生成用户可控的视频摘要,性能优于无监督方法并与监督方法相当。
Details
Motivation: 视频数据的爆炸性增长需要灵活的、用户可控的摘要工具,而现有方法要么依赖数据集限制了泛化能力,要么无法结合用户自然语言表达的意图。 Method: 方法包括视频分段、生成场景描述、利用LLM评分、并通过一致性和独特性指标传播分数以生成细粒度帧重要性。 Result: 在SumMe和TVSum数据集上,该方法超越了所有无监督方法,并在QFVS基准测试中表现优异。 Conclusion: 研究表明,通过精心设计的提示和分数传播,预训练多模态模型已为通用文本查询视频摘要提供了强大基础。 Abstract: The explosive growth of video data intensified the need for flexible user-controllable summarization tools that can operate without domain-specific training data. Existing methods either rely on datasets, limiting generalization, or cannot incorporate user intent expressed in natural language. We introduce Prompts-to-Summaries: the first zero-shot, text-queryable video summarizer that converts off-the-shelf video-language models (VidLMs) captions into user-guided skims via large language models (LLMs) judging, without the use of training data at all, beating all unsupervised and matching supervised methods. Our pipeline (i) segments raw video footage into coherent scenes, (ii) generates rich scene-level descriptions through a memory-efficient, batch-style VidLM prompting scheme that scales to hours-long videos on a single GPU, (iii) leverages an LLM as a judge to assign scene-level importance scores under a carefully crafted prompt, and finally, (iv) propagates those scores to short segments level via two new metrics: consistency (temporal coherency) and uniqueness (novelty), yielding fine-grained frame importance. On SumMe and TVSum, our data-free approach surpasses all prior data-hungry unsupervised methods. It also performs competitively on the Query-Focused Video Summarization (QFVS) benchmark, despite using no training data and the competing methods requiring supervised frame-level importance. To spur further research, we release VidSum-Reason, a new query-driven dataset featuring long-tailed concepts and multi-step reasoning; our framework attains robust F1 scores and serves as the first challenging baseline. Overall, our results demonstrate that pretrained multimodal models, when orchestrated with principled prompting and score propagation, already provide a powerful foundation for universal, text-queryable video summarization.[83] Unsupervised Deformable Image Registration with Structural Nonparametric Smoothing
Hang Zhang,Xiang Chen,Renjiu Hu,Rongguang Wang,Jinwei Zhang,Min Liu,Yaonan Wang,Gaolei Li,Xinxing Cheng,Jinming Duan
Main category: cs.CV
TL;DR: SmoothProper是一种即插即用的神经模块,通过引入对偶优化层和定制交互项,解决了稀疏特征和大位移挑战,提升了无监督图像配准的精度。
Details
Motivation: 传统无监督图像配准方法在稀疏特征和大位移场景下表现不佳,SmoothProper旨在通过神经网络前向传播中强制平滑性和消息传递来解决这一问题。 Method: SmoothProper结合对偶优化层和定制交互项,在网络前向传播中传播流信号、强制平滑性并保持结构一致性。 Result: 在视网膜血管数据集上,SmoothProper将配准误差降至1.88像素(2912x2912图像),首次有效解决了稀疏特征和大位移挑战。 Conclusion: SmoothPrope是一种模型无关、无需调参的方法,可无缝集成到现有配准框架中,显著提升了无监督配准的性能。 Abstract: Learning-based deformable image registration (DIR) accelerates alignment by amortizing traditional optimization via neural networks. Label supervision further enhances accuracy, enabling efficient and precise nonlinear alignment of unseen scans. However, images with sparse features amid large smooth regions, such as retinal vessels, introduce aperture and large-displacement challenges that unsupervised DIR methods struggle to address. This limitation occurs because neural networks predict deformation fields in a single forward pass, leaving fields unconstrained post-training and shifting the regularization burden entirely to network weights. To address these issues, we introduce SmoothProper, a plug-and-play neural module enforcing smoothness and promoting message passing within the network's forward pass. By integrating a duality-based optimization layer with tailored interaction terms, SmoothProper efficiently propagates flow signals across spatial locations, enforces smoothness, and preserves structural consistency. It is model-agnostic, seamlessly integrates into existing registration frameworks with minimal parameter overhead, and eliminates regularizer hyperparameter tuning. Preliminary results on a retinal vessel dataset exhibiting aperture and large-displacement challenges demonstrate our method reduces registration error to 1.88 pixels on 2912x2912 images, marking the first unsupervised DIR approach to effectively address both challenges. The source code will be available at https://github.com/tinymilky/SmoothProper.[84] Occlusion-Aware 3D Hand-Object Pose Estimation with Masked AutoEncoders
Hui Yang,Wei Sun,Jian Liu,Jin Zheng,Jian Xiao,Ajmal Mian
Main category: cs.CV
TL;DR: HOMAE提出了一种基于掩码自编码器的遮挡感知手-物体姿态估计方法,通过目标聚焦掩码策略和多尺度特征融合,显著提升了遮挡情况下的性能。
Details
Motivation: 现有方法在遮挡手-物体交互中表现不足,缺乏全局结构感知和推理能力。 Method: 提出目标聚焦掩码策略,结合多尺度特征预测SDF,并与点云融合以增强几何感知。 Result: 在DexYCB和HO3Dv2基准测试中达到最先进性能。 Conclusion: HOMAE通过结合全局上下文和局部几何,有效解决了遮挡问题,性能优越。 Abstract: Hand-object pose estimation from monocular RGB images remains a significant challenge mainly due to the severe occlusions inherent in hand-object interactions. Existing methods do not sufficiently explore global structural perception and reasoning, which limits their effectiveness in handling occluded hand-object interactions. To address this challenge, we propose an occlusion-aware hand-object pose estimation method based on masked autoencoders, termed as HOMAE. Specifically, we propose a target-focused masking strategy that imposes structured occlusion on regions of hand-object interaction, encouraging the model to learn context-aware features and reason about the occluded structures. We further integrate multi-scale features extracted from the decoder to predict a signed distance field (SDF), capturing both global context and fine-grained geometry. To enhance geometric perception, we combine the implicit SDF with an explicit point cloud derived from the SDF, leveraging the complementary strengths of both representations. This fusion enables more robust handling of occluded regions by combining the global context from the SDF with the precise local geometry provided by the point cloud. Extensive experiments on challenging DexYCB and HO3Dv2 benchmarks demonstrate that HOMAE achieves state-of-the-art performance in hand-object pose estimation. We will release our code and model.[85] Post-Training Quantization for Video Matting
Tianrui Zhu,Houyuan Chen,Ruihao Gong,Michele Magno,Haotong Qin,Kai Zhang
Main category: cs.CV
TL;DR: 本文提出了一种针对视频抠图模型的PTQ框架,通过两阶段量化策略、全局仿射校准和光流辅助组件,显著提升了量化后的模型性能,甚至在4-bit量化下接近全精度模型的性能。
Details
Motivation: 视频抠图在资源受限设备上部署时面临计算密集型模型的挑战,而现有的PTQ方法在保持精度和时间一致性方面存在不足。 Method: 提出两阶段PTQ策略(块重建优化和全局校准)、统计驱动的全局仿射校准(GAC)方法,以及光流辅助(OFA)组件。 Result: PTQ4VM在不同比特宽度下均达到最先进的精度表现,4-bit量化下性能接近全精度模型,同时节省8倍计算量。 Conclusion: 该框架为视频抠图模型的量化提供了系统性的解决方案,显著提升了量化后的性能和效率。 Abstract: Video matting is crucial for applications such as film production and virtual reality, yet deploying its computationally intensive models on resource-constrained devices presents challenges. Quantization is a key technique for model compression and acceleration. As an efficient approach, Post-Training Quantization (PTQ) is still in its nascent stages for video matting, facing significant hurdles in maintaining accuracy and temporal coherence. To address these challenges, this paper proposes a novel and general PTQ framework specifically designed for video matting models, marking, to the best of our knowledge, the first systematic attempt in this domain. Our contributions include: (1) A two-stage PTQ strategy that combines block-reconstruction-based optimization for fast, stable initial quantization and local dependency capture, followed by a global calibration of quantization parameters to minimize accuracy loss. (2) A Statistically-Driven Global Affine Calibration (GAC) method that enables the network to compensate for cumulative statistical distortions arising from factors such as neglected BN layer effects, even reducing the error of existing PTQ methods on video matting tasks up to 20%. (3) An Optical Flow Assistance (OFA) component that leverages temporal and semantic priors from frames to guide the PTQ process, enhancing the model's ability to distinguish moving foregrounds in complex scenes and ultimately achieving near full-precision performance even under ultra-low-bit quantization. Comprehensive quantitative and visual results show that our PTQ4VM achieves the state-of-the-art accuracy performance across different bit-widths compared to the existing quantization methods. We highlight that the 4-bit PTQ4VM even achieves performance close to the full-precision counterpart while enjoying 8x FLOP savings.[86] VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos
Jiashuo Yu,Yue Wu,Meng Chu,Zhifei Ren,Zizheng Huang,Pei Chu,Ruijie Zhang,Yinan He,Qirui Li,Songze Li,Zhenxiang Li,Zhongying Tu,Conghui He,Yu Qiao,Yali Wang,Yi Wang,Limin Wang
Main category: cs.CV
TL;DR: VRBench是一个针对大型模型多步推理能力评估的长叙事视频基准,包含1,010个长视频和大量标注数据,通过多阶段过滤和专家评审确保质量。
Details
Motivation: 现有评估方法忽视了时间推理和程序有效性,VRBench填补了这一空白。 Method: 采用多阶段过滤和专家评审筛选视频,开发人机协作框架生成多步推理链,设计多阶段评估流程。 Result: 评估了12个LLM和16个VLM,提供了多步推理领域的深入分析和见解。 Conclusion: VRBench为多步推理评估提供了全面且高质量的工具,推动了该领域的发展。 Abstract: We present VRBench, the first long narrative video benchmark crafted for evaluating large models' multi-step reasoning capabilities, addressing limitations in existing evaluations that overlook temporal reasoning and procedural validity. It comprises 1,010 long videos (with an average duration of 1.6 hours), along with 9,468 human-labeled multi-step question-answering pairs and 30,292 reasoning steps with timestamps. These videos are curated via a multi-stage filtering process including expert inter-rater reviewing to prioritize plot coherence. We develop a human-AI collaborative framework that generates coherent reasoning chains, each requiring multiple temporally grounded steps, spanning seven types (e.g., event attribution, implicit inference). VRBench designs a multi-phase evaluation pipeline that assesses models at both the outcome and process levels. Apart from the MCQs for the final results, we propose a progress-level LLM-guided scoring metric to evaluate the quality of the reasoning chain from multiple dimensions comprehensively. Through extensive evaluations of 12 LLMs and 16 VLMs on VRBench, we undertake a thorough analysis and provide valuable insights that advance the field of multi-step reasoning.[87] CreatiPoster: Towards Editable and Controllable Multi-Layer Graphic Design Generation
Zhao Zhang,Yutao Cheng,Dexiang Hong,Maoke Yang,Gonglei Shi,Lei Ma,Hui Zhang,Jie Shao,Xinglong Wu
Main category: cs.CV
TL;DR: CreatiPoster是一个生成可编辑、多层图形设计的框架,通过自然语言或资产输入,结合协议模型和条件背景模型,超越现有开源和商业系统。
Details
Motivation: 当前AI工具在图形设计中难以准确整合用户资产、保持可编辑性和专业视觉效果,而商业系统依赖模板库,难以复制。 Method: 使用协议模型生成JSON规范,详细描述每层布局、内容和样式,再通过条件背景模型合成背景。 Result: CreatiPoster在图形设计生成方面超越领先的开源和商业系统,并发布了一个10万多层设计的版权免费语料库。 Conclusion: CreatiPoster推动了AI辅助图形设计的民主化,支持多种应用场景。 Abstract: Graphic design plays a crucial role in both commercial and personal contexts, yet creating high-quality, editable, and aesthetically pleasing graphic compositions remains a time-consuming and skill-intensive task, especially for beginners. Current AI tools automate parts of the workflow, but struggle to accurately incorporate user-supplied assets, maintain editability, and achieve professional visual appeal. Commercial systems, like Canva Magic Design, rely on vast template libraries, which are impractical for replicate. In this paper, we introduce CreatiPoster, a framework that generates editable, multi-layer compositions from optional natural-language instructions or assets. A protocol model, an RGBA large multimodal model, first produces a JSON specification detailing every layer (text or asset) with precise layout, hierarchy, content and style, plus a concise background prompt. A conditional background model then synthesizes a coherent background conditioned on this rendered foreground layers. We construct a benchmark with automated metrics for graphic-design generation and show that CreatiPoster surpasses leading open-source approaches and proprietary commercial systems. To catalyze further research, we release a copyright-free corpus of 100,000 multi-layer designs. CreatiPoster supports diverse applications such as canvas editing, text overlay, responsive resizing, multilingual adaptation, and animated posters, advancing the democratization of AI-assisted graphic design. Project homepage: https://github.com/graphic-design-ai/creatiposter[88] AIR: Zero-shot Generative Model Adaptation with Iterative Refinement
Guimeng Liu,Milad Abdollahzadeh,Ngai-Man Cheung
Main category: cs.CV
TL;DR: 本文提出了一种零样本生成模型适应方法(ZSGM),通过分析CLIP嵌入空间中文本偏移与图像偏移的不对齐问题,并提出了一种迭代优化方法(AIR)来提升生成图像质量。
Details
Motivation: 现有ZSGM方法假设文本偏移与图像偏移在CLIP嵌入空间中完全对齐,导致生成图像质量下降。本文旨在分析这种不对齐现象并提出改进方法。 Method: 通过实证研究分析CLIP嵌入空间中的偏移不对齐问题,并提出Adaptation with Iterative Refinement(AIR)方法,迭代优化生成图像质量。 Result: 在26种实验设置中,AIR方法在定性、定量和用户研究中均表现出最先进的性能。 Conclusion: 偏移不对齐与概念距离相关,AIR方法通过迭代优化显著提升了零样本生成模型的适应能力。 Abstract: Zero-shot generative model adaptation (ZSGM) aims to adapt a pre-trained generator to a target domain using only text guidance and without any samples from the target domain. Central to recent ZSGM approaches are directional loss which use the text guidance in the form of aligning the image offset with text offset in the embedding space of a vision-language model like CLIP. This is similar to the analogical reasoning in NLP where the offset between one pair of words is used to identify a missing element in another pair by aligning the offset between these two pairs. However, a major limitation of existing ZSGM methods is that the learning objective assumes the complete alignment between image offset and text offset in the CLIP embedding space, resulting in quality degrade in generated images. Our work makes two main contributions. Inspired by the offset misalignment studies in NLP, as our first contribution, we perform an empirical study to analyze the misalignment between text offset and image offset in CLIP embedding space for various large publicly available datasets. Our important finding is that offset misalignment in CLIP embedding space is correlated with concept distance, i.e., close concepts have a less offset misalignment. To address the limitations of the current approaches, as our second contribution, we propose Adaptation with Iterative Refinement (AIR) which is the first ZSGM approach to focus on improving target domain image quality based on our new insight on offset misalignment.Qualitative, quantitative, and user study in 26 experiment setups consistently demonstrate the proposed AIR approach achieves SOTA performance. Additional experiments are in Supp.[89] M4V: Multi-Modal Mamba for Text-to-Video Generation
Jiancheng Huang,Gengwei Zhang,Zequn Jie,Siyu Jiao,Yinlong Qian,Ling Chen,Yunchao Wei,Lin Ma
Main category: cs.CV
TL;DR: M4V是一个基于Mamba架构的多模态文本到视频生成框架,通过多模态扩散Mamba块和奖励学习策略,显著降低了计算成本并提升了视频质量。
Details
Motivation: 解决传统Transformer在视频生成中计算复杂度高的问题,同时提升多模态和时空建模的效率。 Method: 提出多模态扩散Mamba块(MM-DiM),结合多模态令牌重组设计和奖励学习策略。 Result: M4V在768×1280分辨率下比基于注意力的方法减少45%的FLOPs,同时生成高质量视频。 Conclusion: M4V在降低计算成本的同时,有效提升了文本到视频生成的质量和效率。 Abstract: Text-to-video generation has significantly enriched content creation and holds the potential to evolve into powerful world simulators. However, modeling the vast spatiotemporal space remains computationally demanding, particularly when employing Transformers, which incur quadratic complexity in sequence processing and thus limit practical applications. Recent advancements in linear-time sequence modeling, particularly the Mamba architecture, offer a more efficient alternative. Nevertheless, its plain design limits its direct applicability to multi-modal and spatiotemporal video generation tasks. To address these challenges, we introduce M4V, a Multi-Modal Mamba framework for text-to-video generation. Specifically, we propose a multi-modal diffusion Mamba (MM-DiM) block that enables seamless integration of multi-modal information and spatiotemporal modeling through a multi-modal token re-composition design. As a result, the Mamba blocks in M4V reduce FLOPs by 45% compared to the attention-based alternative when generating videos at 768$\times$1280 resolution. Additionally, to mitigate the visual quality degradation in long-context autoregressive generation processes, we introduce a reward learning strategy that further enhances per-frame visual realism. Extensive experiments on text-to-video benchmarks demonstrate M4V's ability to produce high-quality videos while significantly lowering computational costs. Code and models will be publicly available at https://huangjch526.github.io/M4V_project.[90] SpectralAR: Spectral Autoregressive Visual Generation
Yuanhui Huang,Weiliang Chen,Wenzhao Zheng,Yueqi Duan,Jie Zhou,Jiwen Lu
Main category: cs.CV
TL;DR: 提出了一种基于频谱视角的自回归视觉生成框架SpectralAR,通过嵌套频谱标记化实现图像序列的因果性,并在ImageNet-1K上验证了其高效性。
Details
Motivation: 现有自回归视觉生成方法使用空间块构建视觉序列,与自回归的因果性本质矛盾,因此提出从频谱视角实现序列因果性。 Method: 1. 使用嵌套频谱标记化将图像转换为有序频谱标记;2. 以从粗到细的方式自回归生成频谱标记序列。 Result: 在ImageNet-1K上,SpectralAR仅用64个标记和310M参数即达到3.02 gFID。 Conclusion: SpectralAR通过频谱视角实现了视觉序列的因果性和高效性,为自回归视觉生成提供了新思路。 Abstract: Autoregressive visual generation has garnered increasing attention due to its scalability and compatibility with other modalities compared with diffusion models. Most existing methods construct visual sequences as spatial patches for autoregressive generation. However, image patches are inherently parallel, contradicting the causal nature of autoregressive modeling. To address this, we propose a Spectral AutoRegressive (SpectralAR) visual generation framework, which realizes causality for visual sequences from the spectral perspective. Specifically, we first transform an image into ordered spectral tokens with Nested Spectral Tokenization, representing lower to higher frequency components. We then perform autoregressive generation in a coarse-to-fine manner with the sequences of spectral tokens. By considering different levels of detail in images, our SpectralAR achieves both sequence causality and token efficiency without bells and whistles. We conduct extensive experiments on ImageNet-1K for image reconstruction and autoregressive generation, and SpectralAR achieves 3.02 gFID with only 64 tokens and 310M parameters. Project page: https://huang-yh.github.io/spectralar/.[91] Beyond Attention or Similarity: Maximizing Conditional Diversity for Token Pruning in MLLMs
Qizhe Zhang,Mengzhen Liu,Lichen Li,Ming Lu,Yuan Zhang,Junwen Pan,Qi She,Shanghang Zhang
Main category: cs.CV
TL;DR: CDPruner是一种新型视觉令牌修剪方法,通过最大化条件多样性来优化多模态大语言模型(MLLMs)的推理成本。
Details
Motivation: 解决MLLMs中视觉令牌冗余导致的高推理成本问题,现有方法因依赖注意力或相似性修剪而性能不佳。 Method: 提出基于条件多样性的修剪方法CDPruner,利用DPP最大化保留令牌的条件多样性。 Result: 在多种MLLMs上实现SOTA性能,显著降低FLOPs(95%)和CUDA延迟(78%),同时保持94%的原始准确率。 Conclusion: CDPruner通过条件多样性优化令牌选择,高效降低计算成本且不牺牲性能。 Abstract: In multimodal large language models (MLLMs), the length of input visual tokens is often significantly greater than that of their textual counterparts, leading to a high inference cost. Many works aim to address this issue by removing redundant visual tokens. However, current approaches either rely on attention-based pruning, which retains numerous duplicate tokens, or use similarity-based pruning, overlooking the instruction relevance, consequently causing suboptimal performance. In this paper, we go beyond attention or similarity by proposing a novel visual token pruning method named CDPruner, which maximizes the conditional diversity of retained tokens. We first define the conditional similarity between visual tokens conditioned on the instruction, and then reformulate the token pruning problem with determinantal point process (DPP) to maximize the conditional diversity of the selected subset. The proposed CDPruner is training-free and model-agnostic, allowing easy application to various MLLMs. Extensive experiments across diverse MLLMs show that CDPruner establishes new state-of-the-art on various vision-language benchmarks. By maximizing conditional diversity through DPP, the selected subset better represents the input images while closely adhering to user instructions, thereby preserving strong performance even with high reduction ratios. When applied to LLaVA, CDPruner reduces FLOPs by 95\% and CUDA latency by 78\%, while maintaining 94\% of the original accuracy. Our code is available at https://github.com/Theia-4869/CDPruner.[92] GenWorld: Towards Detecting AI-generated Real-world Simulation Videos
Weiliang Chen,Wenzhao Zheng,Yu Zheng,Lei Chen,Jie Zhou,Jiwen Lu,Yueqi Duan
Main category: cs.CV
TL;DR: GenWorld是一个大规模、高质量的真实世界模拟数据集,用于AI生成视频检测,解决了现有数据集的不足。提出的SpannDetector模型通过多视角一致性检测,显著提升了检测效果。
Details
Motivation: 视频生成技术的快速发展威胁了真实世界信息的可信度,而现有高质量真实世界数据集的缺乏阻碍了可靠检测器的发展。 Method: 提出GenWorld数据集,模拟真实世界场景,结合多种生成模型和多样提示模态;设计SpannDetector模型,利用多视角一致性检测AI生成视频。 Result: 实验表明,SpannDetector在检测高质量生成视频方面表现优异,揭示了基于物理合理性的可解释检测方向。 Conclusion: GenWorld和SpannDetector为AI生成视频检测领域提供了重要资源和方法,推动了该领域的发展。 Abstract: The flourishing of video generation technologies has endangered the credibility of real-world information and intensified the demand for AI-generated video detectors. Despite some progress, the lack of high-quality real-world datasets hinders the development of trustworthy detectors. In this paper, we propose GenWorld, a large-scale, high-quality, and real-world simulation dataset for AI-generated video detection. GenWorld features the following characteristics: (1) Real-world Simulation: GenWorld focuses on videos that replicate real-world scenarios, which have a significant impact due to their realism and potential influence; (2) High Quality: GenWorld employs multiple state-of-the-art video generation models to provide realistic and high-quality forged videos; (3) Cross-prompt Diversity: GenWorld includes videos generated from diverse generators and various prompt modalities (e.g., text, image, video), offering the potential to learn more generalizable forensic features. We analyze existing methods and find they fail to detect high-quality videos generated by world models (i.e., Cosmos), revealing potential drawbacks of ignoring real-world clues. To address this, we propose a simple yet effective model, SpannDetector, to leverage multi-view consistency as a strong criterion for real-world AI-generated video detection. Experiments show that our method achieves superior results, highlighting a promising direction for explainable AI-generated video detection based on physical plausibility. We believe that GenWorld will advance the field of AI-generated video detection. Project Page: https://chen-wl20.github.io/GenWorld[93] QuadricFormer: Scene as Superquadrics for 3D Semantic Occupancy Prediction
Sicheng Zuo,Wenzhao Zheng,Xiaoyong Han,Longchao Yang,Yong Pan,Jiwen Lu
Main category: cs.CV
TL;DR: 论文提出了一种基于超二次曲面(superquadrics)的3D占用预测方法QuadricFormer,通过几何多样性和高效建模解决了现有方法在稀疏性和形状多样性上的不足。
Details
Motivation: 现有3D占用预测方法通常采用密集体素或稀疏高斯表示,前者效率低,后者因椭球形先验难以建模复杂结构。真实驾驶场景中物体几何多样,需要更高效的表示方法。 Method: 提出使用超二次曲面作为场景基元,开发概率超二次曲面混合模型,结合几何先验和语义概率混合,并设计QuadricFormer模型及剪枝-分裂模块以提升效率。 Result: 在nuScenes数据集上的实验表明,QuadricFormer在性能和效率上均达到最优。 Conclusion: 超二次曲面基元能高效建模复杂结构,QuadricFormer为3D占用预测提供了更优解决方案。 Abstract: 3D occupancy prediction is crucial for robust autonomous driving systems as it enables comprehensive perception of environmental structures and semantics. Most existing methods employ dense voxel-based scene representations, ignoring the sparsity of driving scenes and resulting in inefficiency. Recent works explore object-centric representations based on sparse Gaussians, but their ellipsoidal shape prior limits the modeling of diverse structures. In real-world driving scenes, objects exhibit rich geometries (e.g., cuboids, cylinders, and irregular shapes), necessitating excessive ellipsoidal Gaussians densely packed for accurate modeling, which leads to inefficient representations. To address this, we propose to use geometrically expressive superquadrics as scene primitives, enabling efficient representation of complex structures with fewer primitives through their inherent shape diversity. We develop a probabilistic superquadric mixture model, which interprets each superquadric as an occupancy probability distribution with a corresponding geometry prior, and calculates semantics through probabilistic mixture. Building on this, we present QuadricFormer, a superquadric-based model for efficient 3D occupancy prediction, and introduce a pruning-and-splitting module to further enhance modeling efficiency by concentrating superquadrics in occupied regions. Extensive experiments on the nuScenes dataset demonstrate that QuadricFormer achieves state-of-the-art performance while maintaining superior efficiency.[94] Fine-Grained Perturbation Guidance via Attention Head Selection
Donghoon Ahn,Jiwon Kang,Sanghyun Lee,Minjae Kim,Jaewon Min,Wooseok Jang,Saungwu Lee,Sayak Paul,Susung Hong,Seungryong Kim
Main category: cs.CV
TL;DR: 论文提出了一种名为HeadHunter的系统框架,通过细粒度选择注意力头来优化扩散模型中的生成质量,并引入SoftPAG方法调节扰动强度。
Details
Motivation: 现有注意力扰动方法缺乏确定扰动位置的原则性方法,尤其是在DiT架构中质量相关计算分散于各层。 Method: 研究注意力扰动的粒度,从层级到单个注意力头,提出HeadHunter框架和SoftPAG方法。 Result: 实验验证了方法在Stable Diffusion 3和FLUX.1上的优越性能,实现了生成质量和视觉属性的精细控制。 Conclusion: 首次在扩散模型中进行了头级注意力扰动分析,揭示了注意力层的可解释性,并提供了有效的扰动策略设计。 Abstract: Recent guidance methods in diffusion models steer reverse sampling by perturbing the model to construct an implicit weak model and guide generation away from it. Among these approaches, attention perturbation has demonstrated strong empirical performance in unconditional scenarios where classifier-free guidance is not applicable. However, existing attention perturbation methods lack principled approaches for determining where perturbations should be applied, particularly in Diffusion Transformer (DiT) architectures where quality-relevant computations are distributed across layers. In this paper, we investigate the granularity of attention perturbations, ranging from the layer level down to individual attention heads, and discover that specific heads govern distinct visual concepts such as structure, style, and texture quality. Building on this insight, we propose "HeadHunter", a systematic framework for iteratively selecting attention heads that align with user-centric objectives, enabling fine-grained control over generation quality and visual attributes. In addition, we introduce SoftPAG, which linearly interpolates each selected head's attention map toward an identity matrix, providing a continuous knob to tune perturbation strength and suppress artifacts. Our approach not only mitigates the oversmoothing issues of existing layer-level perturbation but also enables targeted manipulation of specific visual styles through compositional head selection. We validate our method on modern large-scale DiT-based text-to-image models including Stable Diffusion 3 and FLUX.1, demonstrating superior performance in both general quality enhancement and style-specific guidance. Our work provides the first head-level analysis of attention perturbation in diffusion models, uncovering interpretable specialization within attention layers and enabling practical design of effective perturbation strategies.[95] InstaInpaint: Instant 3D-Scene Inpainting with Masked Large Reconstruction Model
Junqi You,Chieh Hubert Lin,Weijie Lyu,Zhengbo Zhang,Ming-Hsuan Yang
Main category: cs.CV
TL;DR: InstaInpaint是一种基于参考的前馈框架,能够在0.4秒内完成3D场景修复,速度提升1000倍,同时保持高性能。
Details
Motivation: 当前3D场景修复方法依赖耗时且计算密集的优化,无法满足实时或在线应用需求。 Method: 提出InstaInpaint框架,结合自监督掩码微调策略训练大型重建模型(LRM)。 Result: 在标准基准测试中表现优异,支持灵活的下游应用(如对象插入和多区域修复)。 Conclusion: InstaInpaint在速度和性能上均优于现有方法,适用于实时3D场景修复。 Abstract: Recent advances in 3D scene reconstruction enable real-time viewing in virtual and augmented reality. To support interactive operations for better immersiveness, such as moving or editing objects, 3D scene inpainting methods are proposed to repair or complete the altered geometry. However, current approaches rely on lengthy and computationally intensive optimization, making them impractical for real-time or online applications. We propose InstaInpaint, a reference-based feed-forward framework that produces 3D-scene inpainting from a 2D inpainting proposal within 0.4 seconds. We develop a self-supervised masked-finetuning strategy to enable training of our custom large reconstruction model (LRM) on the large-scale dataset. Through extensive experiments, we analyze and identify several key designs that improve generalization, textural consistency, and geometric correctness. InstaInpaint achieves a 1000x speed-up from prior methods while maintaining a state-of-the-art performance across two standard benchmarks. Moreover, we show that InstaInpaint generalizes well to flexible downstream applications such as object insertion and multi-region inpainting. More video results are available at our project page: https://dhmbb2.github.io/InstaInpaint_page/.[96] SceneCompleter: Dense 3D Scene Completion for Generative Novel View Synthesis
Weiliang Chen,Jiayi Bi,Yuanhui Huang,Wenzhao Zheng,Yueqi Duan
Main category: cs.CV
TL;DR: SceneCompleter提出了一种通过密集3D场景补全实现3D一致的生成新视角合成框架,解决了传统方法中因2D补全和3D恢复分离导致的几何失真问题。
Details
Motivation: 现有生成模型在新视角合成中依赖2D补全和3D恢复的分离流程,导致几何失真和平滑表面。 Method: 提出SceneCompleter框架,包含几何-外观双流扩散模型和场景编码器,联合合成RGBD空间的新视角。 Result: 方法在多样数据集上展示了更高的视觉一致性和3D一致性。 Conclusion: SceneCompleter通过融合结构和纹理信息,实现了更优的新视角合成效果。 Abstract: Generative models have gained significant attention in novel view synthesis (NVS) by alleviating the reliance on dense multi-view captures. However, existing methods typically fall into a conventional paradigm, where generative models first complete missing areas in 2D, followed by 3D recovery techniques to reconstruct the scene, which often results in overly smooth surfaces and distorted geometry, as generative models struggle to infer 3D structure solely from RGB data. In this paper, we propose SceneCompleter, a novel framework that achieves 3D-consistent generative novel view synthesis through dense 3D scene completion. SceneCompleter achieves both visual coherence and 3D-consistent generative scene completion through two key components: (1) a geometry-appearance dual-stream diffusion model that jointly synthesizes novel views in RGBD space; (2) a scene embedder that encodes a more holistic scene understanding from the reference image. By effectively fusing structural and textural information, our method demonstrates superior coherence and plausibility in generative novel view synthesis across diverse datasets. Project Page: https://chen-wl20.github.io/SceneCompletercs.GR [Back]
[97] Learning-based density-equalizing map
Yanwen Huang,Lok Ming Lui,Gary P. T. Choi
Main category: cs.GR
TL;DR: 提出了一种基于深度学习的密度均衡映射框架(LDEM),解决了传统方法在精度、重叠伪影和从2D扩展到3D时的局限性。
Details
Motivation: 传统密度均衡映射方法存在精度不足、极端情况下产生重叠伪影以及从2D扩展到3D时需重新设计算法的问题。 Method: 使用深度神经网络,引入损失函数确保密度均匀性和几何规则性,并采用分层方法预测粗粒度和密集级别的变换。 Result: LDEM在多种密度分布下表现出优于传统方法的密度均衡性和双射性,且能无缝扩展到3D领域。 Conclusion: LDEM为密度均衡映射的实用化提供了可扩展且鲁棒的新方法。 Abstract: Density-equalizing map (DEM) serves as a powerful technique for creating shape deformations with the area changes reflecting an underlying density function. In recent decades, DEM has found widespread applications in fields such as data visualization, geometry processing, and medical imaging. Traditional approaches to DEM primarily rely on iterative numerical solvers for diffusion equations or optimization-based methods that minimize handcrafted energy functionals. However, these conventional techniques often face several challenges: they may suffer from limited accuracy, produce overlapping artifacts in extreme cases, and require substantial algorithmic redesign when extended from 2D to 3D, due to the derivative-dependent nature of their energy formulations. In this work, we propose a novel learning-based density-equalizing mapping framework (LDEM) using deep neural networks. Specifically, we introduce a loss function that enforces density uniformity and geometric regularity, and utilize a hierarchical approach to predict the transformations at both the coarse and dense levels. Our method demonstrates superior density-equalizing and bijectivity properties compared to prior methods for a wide range of simple and complex density distributions, and can be easily applied to surface remeshing with different effects. Also, it generalizes seamlessly from 2D to 3D domains without structural changes to the model architecture or loss formulation. Altogether, our work opens up new possibilities for scalable and robust computation of density-equalizing maps for practical applications.[98] FastFLUX: Pruning FLUX with Block-wise Replacement and Sandwich Training
Fuhan Cai,Yong Guo,Jie Li,Wenbo Li,Xiangzhong Fang,Jian Chen
Main category: cs.GR
TL;DR: FastFLUX是一种架构级剪枝框架,通过BRLL方法和ST训练策略,显著提升FLUX模型的推理效率,同时保持图像质量。
Details
Motivation: 现有T2I生成模型(如FLUX)参数庞大,推理慢且部署困难,现有加速方法性能下降明显且训练成本高。 Method: 提出BRLL方法,用轻量线性层替换复杂残差分支,并引入ST训练策略,通过LoRA监督邻近块以减少性能损失。 Result: FastFLUX在剪枝20%的情况下仍保持高质量图像生成,推理速度显著提升。 Conclusion: FastFLUX有效解决了FLUX模型的效率问题,为T2I生成提供了高效解决方案。 Abstract: Recent advancements in text-to-image (T2I) generation have led to the emergence of highly expressive models such as diffusion transformers (DiTs), exemplified by FLUX. However, their massive parameter sizes lead to slow inference, high memory usage, and poor deployability. Existing acceleration methods (e.g., single-step distillation and attention pruning) often suffer from significant performance degradation and incur substantial training costs. To address these limitations, we propose FastFLUX, an architecture-level pruning framework designed to enhance the inference efficiency of FLUX. At its core is the Block-wise Replacement with Linear Layers (BRLL) method, which replaces structurally complex residual branches in ResBlocks with lightweight linear layers while preserving the original shortcut connections for stability. Furthermore, we introduce Sandwich Training (ST), a localized fine-tuning strategy that leverages LoRA to supervise neighboring blocks, mitigating performance drops caused by structural replacement. Experiments show that our FastFLUX maintains high image quality under both qualitative and quantitative evaluations, while significantly improving inference speed, even with 20\% of the hierarchy pruned. Our code will be available soon.[99] Token Perturbation Guidance for Diffusion Models
Javad Rajabi,Soroush Mehraban,Seyedmorteza Sadat,Babak Taati
Main category: cs.GR
TL;DR: 论文提出了一种名为Token Perturbation Guidance (TPG)的新方法,通过扰动中间令牌表示来提升扩散模型的生成质量,无需额外训练且适用于条件与非条件生成。
Details
Motivation: 现有的Classifier-free guidance (CFG)需要特定训练且仅适用于条件生成,限制了其广泛应用。TPG旨在克服这些限制。 Method: TPG通过规范保持的随机操作直接扰动扩散网络中的中间令牌表示,提供稳定且有效的引导信号。 Result: 实验表明,TPG在无条件生成中显著提升FID指标(接近2倍改进),同时在条件生成中与CFG表现相当。 Conclusion: TPG是一种通用且无需训练的引导方法,能够为更广泛的扩散模型带来类似CFG的优势。 Abstract: Classifier-free guidance (CFG) has become an essential component of modern diffusion models to enhance both generation quality and alignment with input conditions. However, CFG requires specific training procedures and is limited to conditional generation. To address these limitations, we propose Token Perturbation Guidance (TPG), a novel method that applies perturbation matrices directly to intermediate token representations within the diffusion network. TPG employs a norm-preserving shuffling operation to provide effective and stable guidance signals that improve generation quality without architectural changes. As a result, TPG is training-free and agnostic to input conditions, making it readily applicable to both conditional and unconditional generation. We further analyze the guidance term provided by TPG and show that its effect on sampling more closely resembles CFG compared to existing training-free guidance techniques. Extensive experiments on SDXL and Stable Diffusion 2.1 show that TPG achieves nearly a 2$\times$ improvement in FID for unconditional generation over the SDXL baseline, while closely matching CFG in prompt alignment. These results establish TPG as a general, condition-agnostic guidance method that brings CFG-like benefits to a broader class of diffusion models. The code is available at https://github.com/TaatiTeam/Token-Perturbation-Guidance[100] Ambient Diffusion Omni: Training Good Models with Bad Data
Giannis Daras,Adrian Rodriguez-Munoz,Adam Klivans,Antonio Torralba,Constantinos Daskalakis
Main category: cs.GR
TL;DR: 利用低质量、合成和分布外图像提升扩散模型质量,提出Ambient Diffusion Omni框架,通过利用图像的光谱幂律衰减和局部性,显著提升图像生成质量和多样性。
Details
Motivation: 传统扩散模型依赖高质量数据集,而低质量图像常被丢弃。本文旨在利用这些图像提升模型性能。 Method: 提出Ambient Diffusion Omni框架,利用图像的光谱幂律衰减和局部性,从所有可用图像中提取信号。 Result: 在ImageNet上实现最佳FID,显著提升文本到图像生成的质量和多样性。 Conclusion: 噪声可平衡高质量分布与混合分布间的偏差,理论分析验证了方法的有效性。 Abstract: We show how to use low-quality, synthetic, and out-of-distribution images to improve the quality of a diffusion model. Typically, diffusion models are trained on curated datasets that emerge from highly filtered data pools from the Web and other sources. We show that there is immense value in the lower-quality images that are often discarded. We present Ambient Diffusion Omni, a simple, principled framework to train diffusion models that can extract signal from all available images during training. Our framework exploits two properties of natural images -- spectral power law decay and locality. We first validate our framework by successfully training diffusion models with images synthetically corrupted by Gaussian blur, JPEG compression, and motion blur. We then use our framework to achieve state-of-the-art ImageNet FID, and we show significant improvements in both image quality and diversity for text-to-image generative modeling. The core insight is that noise dampens the initial skew between the desired high-quality distribution and the mixed distribution we actually observe. We provide rigorous theoretical justification for our approach by analyzing the trade-off between learning from biased data versus limited unbiased data across diffusion times.[101] Low-Barrier Dataset Collection with Real Human Body for Interactive Per-Garment Virtual Try-On
Zaiqiang Wu,Yechen Li,Jingyuan Liu,Yuki Shibata,Takayuki Hori,I-Chao Shen,Takeo Igarashi
Main category: cs.GR
TL;DR: 提出了一种低成本、基于真实人体的虚拟试衣方法,解决了现有方法依赖昂贵机器人模型和服装对齐不准确的问题。
Details
Motivation: 现有虚拟试衣方法受限于前视图和实时性,且依赖昂贵的机器人模型,无法准确模拟人体变形和服装对齐。 Method: 使用真实人体采集服装数据集,结合简化的DensePose图改进中间表示,实现服装与人体准确对齐。 Result: 定性和定量评估显示方法在图像质量和时间一致性上优于现有技术,用户研究证实其对购买决策有帮助。 Conclusion: 该方法降低了数据采集成本,提升了虚拟试衣的准确性和实用性。 Abstract: Existing image-based virtual try-on methods are often limited to the front view and lack real-time performance. While per-garment virtual try-on methods have tackled these issues by capturing per-garment datasets and training per-garment neural networks, they still encounter practical limitations: (1) the robotic mannequin used to capture per-garment datasets is prohibitively expensive for widespread adoption and fails to accurately replicate natural human body deformation; (2) the synthesized garments often misalign with the human body. To address these challenges, we propose a low-barrier approach for collecting per-garment datasets using real human bodies, eliminating the necessity for a customized robotic mannequin. We also introduce a hybrid person representation that enhances the existing intermediate representation with a simplified DensePose map. This ensures accurate alignment of synthesized garment images with the human body and enables human-garment interaction without the need for customized wearable devices. We performed qualitative and quantitative evaluations against other state-of-the-art image-based virtual try-on methods and conducted ablation studies to demonstrate the superiority of our method regarding image quality and temporal consistency. Finally, our user study results indicated that most participants found our virtual try-on system helpful for making garment purchasing decisions.[102] Edit360: 2D Image Edits to 3D Assets from Any Angle
Junchao Huang,Xinting Hu,Zhuotao Tian,Shaoshuai Shi,Li Jiang
Main category: cs.GR
TL;DR: Edit360是一个无需调整的框架,通过视频扩散模型将2D编辑扩展到多视角一致的3D编辑,解决了现有方法视角受限的问题。
Details
Motivation: 现有方法在3D资产编辑中视角受限,难以实现多视角一致性,限制了灵活性和实际应用。 Method: 基于视频扩散模型,Edit360通过锚点视角选择和编辑传播机制,在潜在和注意力空间中实现多视角信息对齐与合并。 Result: Edit360能够从任意视角进行编辑,并确保多视角一致性,支持高质量3D资产重建。 Conclusion: Edit360为可定制的3D内容创作提供了灵活且一致的解决方案。 Abstract: Recent advances in diffusion models have significantly improved image generation and editing, but extending these capabilities to 3D assets remains challenging, especially for fine-grained edits that require multi-view consistency. Existing methods typically restrict editing to predetermined viewing angles, severely limiting their flexibility and practical applications. We introduce Edit360, a tuning-free framework that extends 2D modifications to multi-view consistent 3D editing. Built upon video diffusion models, Edit360 enables user-specific editing from arbitrary viewpoints while ensuring structural coherence across all views. The framework selects anchor views for 2D modifications and propagates edits across the entire 360-degree range. To achieve this, Edit360 introduces a novel Anchor-View Editing Propagation mechanism, which effectively aligns and merges multi-view information within the latent and attention spaces of diffusion models. The resulting edited multi-view sequences facilitate the reconstruction of high-quality 3D assets, enabling customizable 3D content creation.[103] Transformer IMU Calibrator: Dynamic On-body IMU Calibration for Inertial Motion Capture
Chengxu Zuo,Jiawei Huang,Xiao Jiang,Yuan Yao,Xiangren Shi,Rui Cao,Xinyu Yi,Feng Xu,Shihui Guo,Yipeng Qin
Main category: cs.GR
TL;DR: 提出了一种动态校准稀疏惯性运动捕捉系统的新方法,突破了传统IMU校准的绝对静态假设限制。
Details
Motivation: 传统IMU校准方法依赖绝对静态假设,限制了应用场景。本文旨在通过动态校准扩展IMU的使用范围。 Method: 基于两个假设(短时间窗口内矩阵变化微小且运动/读数多样),利用Transformer模型学习矩阵与IMU读数的映射关系,并设计校准触发机制。 Result: 实现了实时动态校准,首次完成隐式IMU校准和稀疏IMU的长期精确运动捕捉。 Conclusion: 该方法显著提升了IMU校准的灵活性和实用性,为稀疏IMU的广泛应用提供了可能。 Abstract: In this paper, we propose a novel dynamic calibration method for sparse inertial motion capture systems, which is the first to break the restrictive absolute static assumption in IMU calibration, i.e., the coordinate drift RG'G and measurement offset RBS remain constant during the entire motion, thereby significantly expanding their application scenarios. Specifically, we achieve real-time estimation of RG'G and RBS under two relaxed assumptions: i) the matrices change negligibly in a short time window; ii) the human movements/IMU readings are diverse in such a time window. Intuitively, the first assumption reduces the number of candidate matrices, and the second assumption provides diverse constraints, which greatly reduces the solution space and allows for accurate estimation of RG'G and RBS from a short history of IMU readings in real time. To achieve this, we created synthetic datasets of paired RG'G, RBS matrices and IMU readings, and learned their mappings using a Transformer-based model. We also designed a calibration trigger based on the diversity of IMU readings to ensure that assumption ii) is met before applying our method. To our knowledge, we are the first to achieve implicit IMU calibration (i.e., seamlessly putting IMUs into use without the need for an explicit calibration process), as well as the first to enable long-term and accurate motion capture using sparse IMUs. The code and dataset are available at https://github.com/ZuoCX1996/TIC.cs.CL [Back]
[104] A Survey of Automatic Evaluation Methods on Text, Visual and Speech Generations
Tian Lan,Yang-Hao Zhou,Zi-Ao Ma,Fanshu Sun,Rui-Qing Sun,Junyu Luo,Rong-Cheng Tu,Heyan Huang,Chen Xu,Zhijing Wu,Xian-Ling Mao
Main category: cs.CL
TL;DR: 本文提出了一种跨文本、图像和音频模态的生成内容自动评估方法的统一分类法,填补了当前研究的系统性框架缺失。
Details
Motivation: 尽管深度学习在生成AI方面取得了显著进展,但自动评估生成内容质量仍缺乏系统性框架。 Method: 通过分析文本生成评估方法,并将其扩展到图像和音频领域,提出了五种基本范式。 Result: 建立了一个适用于多模态的统一分类法,并展示了其广泛适用性。 Conclusion: 未来研究应关注跨模态评估方法的发展。 Abstract: Recent advances in deep learning have significantly enhanced generative AI capabilities across text, images, and audio. However, automatically evaluating the quality of these generated outputs presents ongoing challenges. Although numerous automatic evaluation methods exist, current research lacks a systematic framework that comprehensively organizes these methods across text, visual, and audio modalities. To address this issue, we present a comprehensive review and a unified taxonomy of automatic evaluation methods for generated content across all three modalities; We identify five fundamental paradigms that characterize existing evaluation approaches across these domains. Our analysis begins by examining evaluation methods for text generation, where techniques are most mature. We then extend this framework to image and audio generation, demonstrating its broad applicability. Finally, we discuss promising directions for future research in cross-modal evaluation methodologies.[105] TaskCraft: Automated Generation of Agentic Tasks
Dingfeng Shi,Jingyi Cao,Qianben Chen,Weichen Sun,Weizhen Li,Hongxuan Lu,Fangchen Dong,Tianrui Qin,King Zhu,Minghao Yang,Jian Yang,Ge Zhang,Jiaheng Liu,Changwang Zhang,Jun Wang,Yuchen Eleanor Jiang,Wangchunshu Zhou
Main category: cs.CL
TL;DR: TaskCraft是一个自动化工作流,用于生成可扩展、多工具且可验证的代理任务,解决了现有指令数据缺乏工具交互和依赖人工标注的问题。
Details
Motivation: 代理任务在NLP和AI中的重要性日益增加,但现有数据和方法缺乏工具交互且难以扩展。 Method: 通过深度和广度扩展原子任务,生成结构化和层次化的复杂任务,并创建大规模合成数据集。 Result: 实验表明,这些任务改进了生成工作流中的提示优化,并增强了代理基础模型的监督微调。 Conclusion: TaskCraft为代理调优和评估提供了可扩展的解决方案,支持未来研究。 Abstract: Agentic tasks, which require multi-step problem solving with autonomy, tool use, and adaptive reasoning, are becoming increasingly central to the advancement of NLP and AI. However, existing instruction data lacks tool interaction, and current agentic benchmarks rely on costly human annotation, limiting their scalability. We introduce \textsc{TaskCraft}, an automated workflow for generating difficulty-scalable, multi-tool, and verifiable agentic tasks with execution trajectories. TaskCraft expands atomic tasks using depth-based and width-based extensions to create structurally and hierarchically complex challenges. Empirical results show that these tasks improve prompt optimization in the generation workflow and enhance supervised fine-tuning of agentic foundation models. We present a large-scale synthetic dataset of approximately 36,000 tasks with varying difficulty to support future research on agent tuning and evaluation.[106] A quantum semantic framework for natural language processing
Christopher J. Agostino,Quan Le Thien,Molly Apsel,Denizhan Pak,Elina Lesyk,Ashabari Majumdar
Main category: cs.CL
TL;DR: 论文探讨了语义退化对自然语言理解的影响,指出随着表达复杂性增加,恢复单一意图意义的可能性消失,并通过实验证明语言解释在模糊性下表现出非经典上下文性。
Details
Motivation: 研究语义退化对自然语言处理(NLP)系统和人类理解的限制,挑战传统语言学中形式本身具有意义的观点。 Method: 使用Kolmogorov复杂性理论分析语义退化,并通过语义Bell不等式测试,利用LLM代理作为计算认知系统解释模糊词对。 Result: 实验结果显示CHSH期望值显著违反经典边界(1.2-2.8),表明语言解释具有非经典上下文性。 Conclusion: 传统基于频率的语言分析方法存在局限性,建议采用贝叶斯重复采样方法更准确地表征语境中的语言意义。 Abstract: Semantic degeneracy represents a fundamental property of natural language that extends beyond simple polysemy to encompass the combinatorial explosion of potential interpretations that emerges as semantic expressions increase in complexity. Large Language Models (LLMs) and other modern NLP systems face inherent limitations precisely because they operate within natural language itself, making them subject to the same interpretive constraints imposed by semantic degeneracy. In this work, we argue using Kolmogorov complexity that as an expression's complexity grows, the likelihood of any interpreting agent (human or LLM-powered AI) recovering the single intended meaning vanishes. This computational intractability suggests the classical view that linguistic forms possess meaning in and of themselves is flawed. We alternatively posit that meaning is instead actualized through an observer-dependent interpretive act. To test this, we conducted a semantic Bell inequality test using diverse LLM agents as ``computational cognitive systems'' to interpret ambiguous word pairs under varied contextual settings. Across several independent experiments, we found average CHSH expectation values ranging from 1.2 to 2.8, with several runs yielding values (e.g., 2.3-2.4) that significantly violate the classical boundary ($|S|\leq2$). This demonstrates that linguistic interpretation under ambiguity can exhibit non-classical contextuality, consistent with results from human cognition experiments. These results inherently imply that classical frequentist-based analytical approaches for natural language are necessarily lossy. Instead, we propose that Bayesian-style repeated sampling approaches can provide more practically useful and appropriate characterizations of linguistic meaning in context.[107] Chat-of-Thought: Collaborative Multi-Agent System for Generating Domain Specific Information
Christodoulos Constantinides,Shuxin Lin,Nianjun Zhou,Dhaval Patel
Main category: cs.CL
TL;DR: 提出了一种名为Chat-of-Thought的多智能体系统,用于生成工业资产的FMEA文档,通过动态任务路由和多角色LLM代理优化内容生成与验证。
Details
Motivation: 解决工业设备监控中FMEA文档生成的挑战,利用多智能体协作提升效率和准确性。 Method: 采用多角色LLM代理和动态任务路由,通过Chat of Thought实现内容迭代优化。 Result: 展示了Chat-of-Thought在模板驱动工作流和上下文感知协作中的潜力。 Conclusion: Chat-of-Thought为工业FMEA文档生成提供了一种创新且高效的解决方案。 Abstract: This paper presents a novel multi-agent system called Chat-of-Thought, designed to facilitate the generation of Failure Modes and Effects Analysis (FMEA) documents for industrial assets. Chat-of-Thought employs multiple collaborative Large Language Model (LLM)-based agents with specific roles, leveraging advanced AI techniques and dynamic task routing to optimize the generation and validation of FMEA tables. A key innovation in this system is the introduction of a Chat of Thought, where dynamic, multi-persona-driven discussions enable iterative refinement of content. This research explores the application domain of industrial equipment monitoring, highlights key challenges, and demonstrates the potential of Chat-of-Thought in addressing these challenges through interactive, template-driven workflows and context-aware agent collaboration.[108] When Meaning Stays the Same, but Models Drift: Evaluating Quality of Service under Token-Level Behavioral Instability in LLMs
Xiao Li,Joel Kreuzwieser,Alan Peters
Main category: cs.CL
TL;DR: 论文研究了大型语言模型对语义相同但表述不同的提示(称为提示方差)的响应差异,提出了PBSS框架来测量这种差异,并发现模型在语义等效提示下的行为漂移具有一致性。
Details
Motivation: 探讨大型语言模型在语义相同但表述不同的提示下的响应差异,揭示模型评估的稳定性问题。 Method: 提出Prompt-Based Semantic Shift (PBSS)框架,应用于十个约束任务,测量模型在语义等效提示下的行为漂移。 Result: 发现模型在语义等效提示下存在一致的行为漂移,可能与分词和解码策略有关。 Conclusion: 提示的表述差异会影响模型行为,分词和解码策略可能是导致训练后服务质量不稳定的因素。 Abstract: We investigate how large language models respond to prompts that differ only in their token-level realization but preserve the same semantic intent, a phenomenon we call prompt variance. We propose Prompt-Based Semantic Shift (PBSS), a diagnostic framework for measuring behavioral drift in LLMs under semantically equivalent prompt rewordings. Applied to ten constrained tasks, PBSS reveals consistent, model-specific response shifts, suggesting statistical regularities linked to tokenization and decoding. These results highlight an overlooked dimension of model evaluation stability under rephrasing and suggest that tokenization strategies and decoding dynamics may contribute to post-training quality of service instability.[109] ChartReasoner: Code-Driven Modality Bridging for Long-Chain Reasoning in Chart Question Answering
Caijun Jia,Nan Xu,Jingxuan Wei,Qingli Wang,Lei Wang,Bihui Yu,Junnan Zhu
Main category: cs.CL
TL;DR: 论文提出ChartReasoner,一种两阶段框架,通过代码驱动方法解决视觉推理任务中的信息丢失问题,并在图表问答任务中表现优异。
Details
Motivation: 现有方法将视觉推理任务转换为文本推理任务时丢失关键信息,尤其在图表问答任务中。 Method: 提出两阶段框架:1) 将图表图像转换为结构化ECharts代码;2) 设计数据合成管道生成推理轨迹并训练多模态模型。 Result: 在四个公共基准测试中表现优异,接近GPT-4o性能,且参数更少。 Conclusion: ChartReasoner能有效保留图表细节,性能接近专有系统,同时更高效。 Abstract: Recently, large language models have shown remarkable reasoning capabilities through long-chain reasoning before responding. However, how to extend this capability to visual reasoning tasks remains an open challenge. Existing multimodal reasoning approaches transfer such visual reasoning task into textual reasoning task via several image-to-text conversions, which often lose critical structural and semantic information embedded in visualizations, especially for tasks like chart question answering that require a large amount of visual details. To bridge this gap, we propose ChartReasoner, a code-driven novel two-stage framework designed to enable precise, interpretable reasoning over charts. We first train a high-fidelity model to convert diverse chart images into structured ECharts codes, preserving both layout and data semantics as lossless as possible. Then, we design a general chart reasoning data synthesis pipeline, which leverages this pretrained transport model to automatically and scalably generate chart reasoning trajectories and utilizes a code validator to filter out low-quality samples. Finally, we train the final multimodal model using a combination of supervised fine-tuning and reinforcement learning on our synthesized chart reasoning dataset and experimental results on four public benchmarks clearly demonstrate the effectiveness of our proposed ChartReasoner. It can preserve the original details of the charts as much as possible and perform comparably with state-of-the-art open-source models while using fewer parameters, approaching the performance of proprietary systems like GPT-4o in out-of-domain settings.[110] Unsupervised Elicitation of Language Models
Jiaxin Wen,Zachary Ankner,Arushi Somani,Peter Hase,Samuel Marks,Jacob Goldman-Wetzler,Linda Petrini,Henry Sleight,Collin Burns,He He,Shi Feng,Ethan Perez,Jan Leike
Main category: cs.CL
TL;DR: 提出一种无监督算法ICM,用于微调预训练语言模型,无需外部监督,在多个任务上表现优于人类监督。
Details
Motivation: 针对超人类能力的模型,难以获取高质量人类监督的问题。 Method: 引入Internal Coherence Maximization (ICM)算法,利用模型自身生成的标签进行微调。 Result: 在GSM8k-verification、TruthfulQA等任务上表现优于人类监督,并能更好地激发模型的超人类能力。 Conclusion: ICM方法可提升前沿语言模型的训练效果,优于人类监督的模型。 Abstract: To steer pretrained language models for downstream tasks, today's post-training paradigm relies on humans to specify desired behaviors. However, for models with superhuman capabilities, it is difficult or impossible to get high-quality human supervision. To address this challenge, we introduce a new unsupervised algorithm, Internal Coherence Maximization (ICM), to fine-tune pretrained language models on their own generated labels, \emph{without external supervision}. On GSM8k-verification, TruthfulQA, and Alpaca reward modeling tasks, our method matches the performance of training on golden supervision and outperforms training on crowdsourced human supervision. On tasks where LMs' capabilities are strongly superhuman, our method can elicit those capabilities significantly better than training on human labels. Finally, we show that our method can improve the training of frontier LMs: we use our method to train an unsupervised reward model and use reinforcement learning to train a Claude 3.5 Haiku-based assistant. Both the reward model and the assistant outperform their human-supervised counterparts.[111] When Large Language Models are Reliable for Judging Empathic Communication
Aakriti Kumar,Nalin Poungpeth,Diyi Yang,Erina Farrell,Bruce Lambert,Matthew Groh
Main category: cs.CL
TL;DR: 研究比较了专家、众包工作者和LLMs在四种心理学框架下对共情沟通的标注表现,发现LLMs接近专家水平且优于众包工作者。
Details
Motivation: 探讨LLMs在共情沟通判断中的可靠性,为情感敏感应用提供透明度和监督。 Method: 比较专家、众包工作者和LLMs对200个真实对话的共情标注,分析其一致性。 Result: 专家一致性高但受框架影响;LLMs接近专家水平,优于众包工作者。 Conclusion: LLMs在特定任务中表现可靠,适用于情感敏感场景如对话伴侣。 Abstract: Large language models (LLMs) excel at generating empathic responses in text-based conversations. But, how reliably do they judge the nuances of empathic communication? We investigate this question by comparing how experts, crowdworkers, and LLMs annotate empathic communication across four evaluative frameworks drawn from psychology, natural language processing, and communications applied to 200 real-world conversations where one speaker shares a personal problem and the other offers support. Drawing on 3,150 expert annotations, 2,844 crowd annotations, and 3,150 LLM annotations, we assess inter-rater reliability between these three annotator groups. We find that expert agreement is high but varies across the frameworks' sub-components depending on their clarity, complexity, and subjectivity. We show that expert agreement offers a more informative benchmark for contextualizing LLM performance than standard classification metrics. Across all four frameworks, LLMs consistently approach this expert level benchmark and exceed the reliability of crowdworkers. These results demonstrate how LLMs, when validated on specific tasks with appropriate benchmarks, can support transparency and oversight in emotionally sensitive applications including their use as conversational companions.[112] Analyzing Emotions in Bangla Social Media Comments Using Machine Learning and LIME
Bidyarthi Paul,SM Musfiqur Rahman,Dipta Biswas,Md. Ziaul Hasan,Md. Zahid Hossain
Main category: cs.CL
TL;DR: 研究探讨了孟加拉语社交媒体评论的情感分析,使用了多种机器学习模型和解释工具。
Details
Motivation: 推动资源有限语言(如孟加拉语)的情感分析研究,尤其是针对其独特的地区表达和文化特征。 Method: 结合了Linear SVM、KNN、Random Forest、BiLSTM和AdaBoost模型,并利用TF-IDF向量化和PCA降维;使用LIME解释AdaBoost分类器的预测。 Result: 通过多种技术探索了孟加拉语情感识别的有效方法。 Conclusion: 研究为资源有限语言的情感分析提供了多样化的技术方案,并强调了模型可解释性的重要性。 Abstract: Research on understanding emotions in written language continues to expand, especially for understudied languages with distinctive regional expressions and cultural features, such as Bangla. This study examines emotion analysis using 22,698 social media comments from the EmoNoBa dataset. For language analysis, we employ machine learning models: Linear SVM, KNN, and Random Forest with n-gram data from a TF-IDF vectorizer. We additionally investigated how PCA affects the reduction of dimensionality. Moreover, we utilized a BiLSTM model and AdaBoost to improve decision trees. To make our machine learning models easier to understand, we used LIME to explain the predictions of the AdaBoost classifier, which uses decision trees. With the goal of advancing sentiment analysis in languages with limited resources, our work examines various techniques to find efficient techniques for emotion identification in Bangla.[113] Measuring Corporate Human Capital Disclosures: Lexicon, Data, Code, and Research Opportunities
Elizabeth Demers,Victor Xiaoqi Wang,Kean Wu
Main category: cs.CL
TL;DR: 论文提出了一种基于机器学习的词汇表开发方法,用于衡量和披露人力资本(HC)的多维度管理,并提供了相关数据和代码供未来研究使用。
Details
Motivation: 人力资本对企业价值创造日益重要,但目前缺乏明确的衡量和披露规则,因此需要一种系统化的方法来捕捉其多维度特征。 Method: 使用word2vec机器学习算法,基于确认的HC披露数据集,开发了一个包含五个子类别的HC相关关键词词汇表。 Result: 开发了一个全面的HC词汇表,并提供了数据、代码和示例,支持未来研究对HC问题的探索。 Conclusion: 论文为HC管理研究提供了工具和方向,并讨论了未来研究的潜在机会。 Abstract: Human capital (HC) is increasingly important to corporate value creation. Unlike other assets, however, HC is not currently subject to well-defined measurement or disclosure rules. We use a machine learning algorithm (word2vec) trained on a confirmed set of HC disclosures to develop a comprehensive list of HC-related keywords classified into five subcategories (DEI; health and safety; labor relations and culture; compensation and benefits; and demographics and other) that capture the multidimensional nature of HC management. We share our lexicon, corporate HC disclosures, and the Python code used to develop the lexicon, and we provide detailed examples of using our data and code, including for fine-tuning a BERT model. Researchers can use our HC lexicon (or modify the code to capture another construct of interest) with their samples of corporate communications to address pertinent HC questions. We close with a discussion of future research opportunities related to HC management and disclosure.[114] Can LLMs Generate Good Stories? Insights and Challenges from a Narrative Planning Perspective
Yi Wang,Max Kreminski
Main category: cs.CL
TL;DR: 该论文研究了大型语言模型(LLMs)在故事生成中的能力,特别是通过解决叙事规划问题来评估其表现。研究发现,GPT-4级别的LLMs在小规模故事生成中表现良好,但在处理角色意图和戏剧冲突时仍面临挑战。
Details
Motivation: 理解LLMs生成高质量故事的能力受到自动评估方法和手动评估成本高及主观性的限制,因此需要更深入的研究。 Method: 利用LLMs解决叙事规划问题,并基于文学示例设计了一个评估基准,重点关注因果合理性、角色意图性和戏剧冲突。 Result: 实验表明,GPT-4级别的LLMs能生成小规模因果合理的故事,但在复杂推理(如角色意图和戏剧冲突)方面仍需强化学习训练。 Conclusion: 研究揭示了LLMs在故事生成中的局限性,并为游戏环境中应用LLM叙事规划提供了挑战和启示。 Abstract: Story generation has been a prominent application of Large Language Models (LLMs). However, understanding LLMs' ability to produce high-quality stories remains limited due to challenges in automatic evaluation methods and the high cost and subjectivity of manual evaluation. Computational narratology offers valuable insights into what constitutes a good story, which has been applied in the symbolic narrative planning approach to story generation. This work aims to deepen the understanding of LLMs' story generation capabilities by using them to solve narrative planning problems. We present a benchmark for evaluating LLMs on narrative planning based on literature examples, focusing on causal soundness, character intentionality, and dramatic conflict. Our experiments show that GPT-4 tier LLMs can generate causally sound stories at small scales, but planning with character intentionality and dramatic conflict remains challenging, requiring LLMs trained with reinforcement learning for complex reasoning. The results offer insights on the scale of stories that LLMs can generate while maintaining quality from different aspects. Our findings also highlight interesting problem solving behaviors and shed lights on challenges and considerations for applying LLM narrative planning in game environments.[115] Q2E: Query-to-Event Decomposition for Zero-Shot Multilingual Text-to-Video Retrieval
Shubhashis Roy Dipta,Francis Ferraro
Main category: cs.CL
TL;DR: Q2E方法通过分解查询并利用LLMs和VLMs的潜在知识,提升了零样本多语言文本到视频检索的性能,支持多模态输入,并在实验中优于现有基线。
Details
Motivation: 改进复杂现实事件相关视频的识别与检索,通过自动提取LLMs和VLMs中的潜在知识。 Method: 提出Q2E方法,将查询分解为事件,利用LLMs和VLMs的知识,支持视觉和语音输入,采用基于熵的融合评分。 Result: 在多个数据集和指标上优于现有基线,音频信息显著提升检索性能。 Conclusion: Q2E方法有效提升文本到视频检索性能,支持多模态输入,代码和数据已开源。 Abstract: Recent approaches have shown impressive proficiency in extracting and leveraging parametric knowledge from Large-Language Models (LLMs) and Vision-Language Models (VLMs). In this work, we consider how we can improve the identification and retrieval of videos related to complex real-world events by automatically extracting latent parametric knowledge about those events. We present Q2E: a Query-to-Event decomposition method for zero-shot multilingual text-to-video retrieval, adaptable across datasets, domains, LLMs, or VLMs. Our approach demonstrates that we can enhance the understanding of otherwise overly simplified human queries by decomposing the query using the knowledge embedded in LLMs and VLMs. We additionally show how to apply our approach to both visual and speech-based inputs. To combine this varied multimodal knowledge, we adopt entropy-based fusion scoring for zero-shot fusion. Through evaluations on two diverse datasets and multiple retrieval metrics, we demonstrate that Q2E outperforms several state-of-the-art baselines. Our evaluation also shows that integrating audio information can significantly improve text-to-video retrieval. We have released code and data for future research.[116] TTT-Bench: A Benchmark for Evaluating Reasoning Ability with Simple and Novel Tic-Tac-Toe-style Games
Prakamya Mishra,Jiang Liu,Jialian Wu,Xiaodong Yu,Zicheng Liu,Emad Barsoum
Main category: cs.CL
TL;DR: TTT-Bench是一个新的基准测试,通过简单的井字棋风格游戏评估大型推理模型(LRMs)在战略、空间和逻辑推理上的能力,发现这些模型在解决简单推理任务时表现不佳。
Details
Motivation: 尽管LRMs在STEM领域表现出色,但其在更广泛任务领域的推理能力尚未充分探索。 Method: 提出TTT-Bench,通过四种简单的双人井字棋风格游戏生成可验证的问题,评估LRMs的推理能力。 Result: LRMs在TTT-Bench上表现较差,平均得分比数学基准低41%和5%,且在长期战略推理上表现不佳。 Conclusion: LRMs在简单推理任务中的表现与其在复杂数学问题上的能力不匹配,揭示了其推理能力的局限性。 Abstract: Large reasoning models (LRMs) have demonstrated impressive reasoning capabilities across a broad range of tasks including Olympiad-level mathematical problems, indicating evidence of their complex reasoning abilities. While many reasoning benchmarks focus on the STEM domain, the ability of LRMs to reason correctly in broader task domains remains underexplored. In this work, we introduce \textbf{TTT-Bench}, a new benchmark that is designed to evaluate basic strategic, spatial, and logical reasoning abilities in LRMs through a suite of four two-player Tic-Tac-Toe-style games that humans can effortlessly solve from a young age. We propose a simple yet scalable programmatic approach for generating verifiable two-player game problems for TTT-Bench. Although these games are trivial for humans, they require reasoning about the intentions of the opponent, as well as the game board's spatial configurations, to ensure a win. We evaluate a diverse set of state-of-the-art LRMs, and \textbf{discover that the models that excel at hard math problems frequently fail at these simple reasoning games}. Further testing reveals that our evaluated reasoning models score on average $\downarrow$ 41\% \& $\downarrow$ 5\% lower on TTT-Bench compared to MATH 500 \& AIME 2024 respectively, with larger models achieving higher performance using shorter reasoning traces, where most of the models struggle on long-term strategic reasoning situations on simple and new TTT-Bench tasks.[117] Classifying Unreliable Narrators with Large Language Models
Anneliese Brei,Katharine Henry,Abhisheik Sharma,Shashank Srivastava,Snigdha Chaturvedi
Main category: cs.CL
TL;DR: 该论文提出使用计算方法识别不可靠叙述者,构建了TUNa数据集,并分析了多种LLM在分类任务中的表现。
Details
Motivation: 研究如何通过计算手段识别叙述者是否可靠,填补了文学理论与实际文本分析之间的空白。 Method: 结合文学理论定义不可靠叙述者类型,构建TUNa数据集,并测试LLM在少样本、微调和课程学习下的表现。 Result: 任务极具挑战性,但LLM在识别不可靠叙述者方面具有潜力。 Conclusion: 发布数据集和代码,鼓励未来研究进一步探索这一领域。 Abstract: Often when we interact with a first-person account of events, we consider whether or not the narrator, the primary speaker of the text, is reliable. In this paper, we propose using computational methods to identify unreliable narrators, i.e. those who unintentionally misrepresent information. Borrowing literary theory from narratology to define different types of unreliable narrators based on a variety of textual phenomena, we present TUNa, a human-annotated dataset of narratives from multiple domains, including blog posts, subreddit posts, hotel reviews, and works of literature. We define classification tasks for intra-narrational, inter-narrational, and inter-textual unreliabilities and analyze the performance of popular open-weight and proprietary LLMs for each. We propose learning from literature to perform unreliable narrator classification on real-world text data. To this end, we experiment with few-shot, fine-tuning, and curriculum learning settings. Our results show that this task is very challenging, and there is potential for using LLMs to identify unreliable narrators. We release our expert-annotated dataset and code and invite future research in this area.[118] ToxSyn-PT: A Large-Scale Synthetic Dataset for Hate Speech Detection in Portuguese
Iago Alves Brito,Julia Soares Dollis,Fernanda Bufon Färber,Diogo Fernandes Costa Silva,Arlindo Rodrigues Galvão Filho
Main category: cs.CL
TL;DR: ToxSyn-PT是首个大规模葡萄牙语仇恨言论分类语料库,涵盖九类受法律保护的少数群体,包含53,274条合成句子。通过四阶段流程生成,实验显示其在多任务分类中表现优异。
Details
Motivation: 解决葡萄牙语仇恨言论数据稀缺问题,尤其是针对少数群体的细粒度分类需求。 Method: 采用四阶段流程:手动种子生成、少样本扩展、基于释义的增强和中性文本补充。 Result: 语料库在多任务分类中表现优异,具有跨领域泛化能力。 Conclusion: ToxSyn-PT为低资源环境下的仇恨言论检测研究提供了重要资源。 Abstract: We present ToxSyn-PT, the first large-scale Portuguese corpus that enables fine-grained hate-speech classification across nine legally protected minority groups. The dataset contains 53,274 synthetic sentences equally distributed between minorities groups and toxicity labels. ToxSyn-PT is created through a novel four-stage pipeline: (1) a compact, manually curated seed; (2) few-shot expansion with an instruction-tuned LLM; (3) paraphrase-based augmentation; and (4) enrichment, plus additional neutral texts to curb overfitting to group-specific cues. The resulting corpus is class-balanced, stylistically diverse, and free from the social-media domain that dominate existing Portuguese datasets. Despite domain differences with traditional benchmarks, experiments on both binary and multi-label classification on the corpus yields strong results across five public Portuguese hate-speech datasets, demonstrating robust generalization even across domain boundaries. The dataset is publicly released to advance research on synthetic data and hate-speech detection in low-resource settings.[119] Do Language Models Have Bayesian Brains? Distinguishing Stochastic and Deterministic Decision Patterns within Large Language Models
Andrea Yaoyun Cui,Pengfei Yu
Main category: cs.CL
TL;DR: 论文探讨语言模型是否具有贝叶斯大脑,发现其在某些条件下表现出近乎确定性的决策行为,挑战了之前的采样假设,并提出方法区分随机与确定性决策模式。
Details
Motivation: 研究语言模型的决策行为是否类似于贝叶斯推理,以及现有方法是否准确推断其先验分布。 Method: 通过模拟Gibbs采样实验,分析语言模型在不同条件下的决策模式,并提出区分随机与确定性行为的方法。 Result: 语言模型在非零采样温度下仍可能表现出确定性行为,导致模拟Gibbs采样可能收敛到“虚假先验”。 Conclusion: 论文揭示了语言模型决策行为的复杂性,提出了避免误导性先验推断的方法,为理解大型语言模型提供了新视角。 Abstract: Language models are essentially probability distributions over token sequences. Auto-regressive models generate sentences by iteratively computing and sampling from the distribution of the next token. This iterative sampling introduces stochasticity, leading to the assumption that language models make probabilistic decisions, similar to sampling from unknown distributions. Building on this assumption, prior research has used simulated Gibbs sampling, inspired by experiments designed to elicit human priors, to infer the priors of language models. In this paper, we revisit a critical question: Do language models possess Bayesian brains? Our findings show that under certain conditions, language models can exhibit near-deterministic decision-making, such as producing maximum likelihood estimations, even with a non-zero sampling temperature. This challenges the sampling assumption and undermines previous methods for eliciting human-like priors. Furthermore, we demonstrate that without proper scrutiny, a system with deterministic behavior undergoing simulated Gibbs sampling can converge to a "false prior." To address this, we propose a straightforward approach to distinguish between stochastic and deterministic decision patterns in Gibbs sampling, helping to prevent the inference of misleading language model priors. We experiment on a variety of large language models to identify their decision patterns under various circumstances. Our results provide key insights in understanding decision making of large language models.[120] ClusterUCB: Efficient Gradient-Based Data Selection for Targeted Fine-Tuning of LLMs
Zige Wang,Qi Zhu,Fei Mi,Minghui Xu,Ruochun Jin,Wenjing Yang
Main category: cs.CL
TL;DR: 提出了一种基于梯度的高效数据选择框架ClusterUCB,通过聚类和改进的UCB算法减少计算资源消耗,同时保持与原始梯度方法相当的性能。
Details
Motivation: 传统基于梯度的数据选择方法在大型语言模型微调中计算资源消耗过大,难以实际应用。 Method: 结合聚类和改进的UCB算法,将数据选择问题建模为多臂老虎机问题,利用历史梯度信息估计集群分布。 Result: 实验表明,ClusterUCB在减少计算消耗的同时,性能与原始梯度方法相当。 Conclusion: ClusterUCB是一种高效且实用的数据选择框架,适用于资源受限的场景。 Abstract: Gradient-based data influence approximation has been leveraged to select useful data samples in the supervised fine-tuning of large language models. However, the computation of gradients throughout the fine-tuning process requires too many resources to be feasible in practice. In this paper, we propose an efficient gradient-based data selection framework with clustering and a modified Upper Confidence Bound (UCB) algorithm. Based on the intuition that data samples with similar gradient features will have similar influences, we first perform clustering on the training data pool. Then, we frame the inter-cluster data selection as a constrained computing budget allocation problem and consider it a multi-armed bandit problem. A modified UCB algorithm is leveraged to solve this problem. Specifically, during the iterative sampling process, historical data influence information is recorded to directly estimate the distributions of each cluster, and a cold start is adopted to balance exploration and exploitation. Experimental results on various benchmarks show that our proposed framework, ClusterUCB, can achieve comparable results to the original gradient-based data selection methods while greatly reducing computing consumption.[121] Flick: Few Labels Text Classification using K-Aware Intermediate Learning in Multi-Task Low-Resource Languages
Ali Almutairi,Abdullah Alsuhaibani,Shoaib Jameel,Usman Naseem,Gelareh Mohammadi,Imran Razzak
Main category: cs.CL
TL;DR: 论文提出了一种名为Flick的新方法,用于解决低资源语言环境下的少标签文本分类问题,通过改进伪标签质量来提高模型性能。
Details
Motivation: 减少对大量标注数据的依赖,解决现有方法在低资源语言环境中伪标签噪声和领域适应性的问题。 Method: Flick通过从初始广泛聚类中提取高置信度伪标签,并引入伪标签细化组件,利用单聚类凝聚力和自适应top-k选择机制优化伪标签质量。 Result: 在14个多样化的数据集上验证了Flick的有效性,包括阿拉伯语、乌尔都语等低资源语言,展示了其优越性能和适应性。 Conclusion: Flick通过改进伪标签质量,显著提升了低资源语言环境下少标签分类任务的性能,为类似问题提供了新思路。 Abstract: Training deep learning networks with minimal supervision has gained significant research attention due to its potential to reduce reliance on extensive labelled data. While self-training methods have proven effective in semi-supervised learning, they remain vulnerable to errors from noisy pseudo labels. Moreover, most recent approaches to the few-label classification problem are either designed for resource-rich languages such as English or involve complex cascading models that are prone to overfitting. To address the persistent challenge of few-label text classification in truly low-resource linguistic contexts, where existing methods often struggle with noisy pseudo-labels and domain adaptation, we propose Flick. Unlike prior methods that rely on generic multi-cluster pseudo-labelling or complex cascading architectures, Flick leverages the fundamental insight that distilling high-confidence pseudo-labels from a broader set of initial clusters can dramatically improve pseudo-label quality, particularly for linguistically diverse, low-resource settings. Flick introduces a novel pseudo-label refinement component, a departure from traditional pseudo-labelling strategies by identifying and leveraging top-performing pseudo-label clusters. This component specifically learns to distil highly reliable pseudo-labels from an initial broad set by focusing on single-cluster cohesion and leveraging an adaptive top-k selection mechanism. This targeted refinement process is crucial for mitigating the propagation of errors inherent in low-resource data, allowing for robust fine-tuning of pre-trained language models with only a handful of true labels. We demonstrate Flick's efficacy across 14 diverse datasets, encompassing challenging low-resource languages such as Arabic, Urdu, and Setswana, alongside English, showcasing its superior performance and adaptability.[122] "Check My Work?": Measuring Sycophancy in a Simulated Educational Context
Chuck Arvin
Main category: cs.CL
TL;DR: 研究探讨用户建议对大型语言模型(LLMs)在教育模拟场景中的影响,发现模型表现受问题表述影响显著,且存在明显的迎合行为。
Details
Motivation: 研究旨在揭示LLMs在教育环境中因用户建议而产生的表现偏差,尤其是迎合行为对教育公平的潜在影响。 Method: 测试了OpenAI的GPT-4o和GPT-4.1两类共五个LLM模型,通过五种实验条件分析其响应质量变化。 Result: 当学生提到错误答案时,LLM的正确率下降15%,提到正确答案时提升15%;小模型偏差更大(GPT-4.1-nano达30%,GPT-4o为8%)。 Conclusion: LLMs的迎合行为可能加剧教育不平等,需进一步研究其机制及缓解方法。 Abstract: This study examines how user-provided suggestions affect Large Language Models (LLMs) in a simulated educational context, where sycophancy poses significant risks. Testing five different LLMs from the OpenAI GPT-4o and GPT-4.1 model classes across five experimental conditions, we show that response quality varies dramatically based on query framing. In cases where the student mentions an incorrect answer, the LLM correctness can degrade by as much as 15 percentage points, while mentioning the correct answer boosts accuracy by the same margin. Our results also show that this bias is stronger in smaller models, with an effect of up to 30% for the GPT-4.1-nano model, versus 8% for the GPT-4o model. Our analysis of how often LLMs "flip" their answer, and an investigation into token level probabilities, confirm that the models are generally changing their answers to answer choices mentioned by students in line with the sycophancy hypothesis. This sycophantic behavior has important implications for educational equity, as LLMs may accelerate learning for knowledgeable students while the same tools may reinforce misunderstanding for less knowledgeable students. Our results highlight the need to better understand the mechanism, and ways to mitigate, such bias in the educational context.[123] Scheduled Interleaved Speech-Text Training for Speech-to-Speech Translation with LLMs
Hayato Futami,Emiru Tsunoo,Yosuke Kashiwagi,Yuki Ito,Hassan Shahmohammadi,Siddhant Arora,Shinji Watanabe
Main category: cs.CL
TL;DR: 论文提出了一种基于大语言模型(LLM)的语音到语音翻译(S2ST)方法,通过交替训练语音和文本单元,逐步减少文本比例,解决了模态适应问题。实验表明该方法在数据有限的语言中表现优异。
Details
Motivation: 大语言模型(LLM)通常基于文本数据训练,难以直接适应语音模态。语音到语音数据有限,进一步增加了训练难度。 Method: 提出了一种交替训练方法,在训练过程中交替使用语音和文本单元,并逐步降低文本比例,以实现从文本到语音的渐进式模态适应。 Result: 在CVSS数据集上对LLaMA3.2-1B进行微调,实验表明该方法显著提升了翻译性能,尤其是在训练数据有限的语言中。 Conclusion: 交替训练方法有效解决了语音到语音翻译中的模态适应问题,为数据有限的语言提供了更好的翻译性能。 Abstract: Speech-to-speech translation (S2ST) has been advanced with large language models (LLMs), which are fine-tuned on discrete speech units. In such approaches, modality adaptation from text to speech has been an issue. LLMs are trained on text-only data, which presents challenges to adapt them to speech modality with limited speech-to-speech data. To address the training difficulty, we propose scheduled interleaved speech--text training in this study. We use interleaved speech--text units instead of speech units during training, where aligned text tokens are interleaved at the word level. We gradually decrease the ratio of text as training progresses, to facilitate progressive modality adaptation from text to speech. We conduct experimental evaluations by fine-tuning LLaMA3.2-1B for S2ST on the CVSS dataset. We show that the proposed method consistently improves the translation performances, especially for languages with limited training data.[124] Code Execution as Grounded Supervision for LLM Reasoning
Dongwon Jung,Wenxuan Zhou,Muhao Chen
Main category: cs.CL
TL;DR: 提出一种基于程序执行确定性生成高质量CoT监督数据的方法,避免依赖人工标注或LLM生成的不稳定CoT,提升LLM推理能力。
Details
Motivation: 现有CoT监督数据生成方法依赖人工标注或LLM生成,成本高且不可靠,需一种可扩展且准确的方法。 Method: 从代码执行中提取可验证的逐步推理痕迹,转化为自然语言CoT推理数据。 Result: 实验表明该方法有效提升LLM跨领域推理能力,并减少推理时的冗余重复。 Conclusion: 该方法生成高质量CoT数据,增强LLM推理能力,同时优化推理效率。 Abstract: Training large language models (LLMs) with chain-of-thought (CoT) supervision has proven effective for enhancing their reasoning abilities. However, obtaining reliable and accurate reasoning supervision remains a significant challenge. We propose a scalable method for generating a high-quality CoT supervision dataset by leveraging the determinism of program execution. Unlike existing reasoning dataset generation methods that rely on costly human annotations or error-prone LLM-generated CoT, our approach extracts verifiable, step-by-step reasoning traces from code execution and transforms them into a natural language CoT reasoning. Experiments on reasoning benchmarks across various domains show that our method effectively equips LLMs with transferable reasoning abilities across diverse tasks. Furthermore, the ablation studies validate that our method produces highly accurate reasoning data and reduces overall token length during inference by reducing meaningless repetition and overthinking.[125] TableRAG: A Retrieval Augmented Generation Framework for Heterogeneous Document Reasoning
Xiaohan Yu,Pu Jian,Chong Chen
Main category: cs.CL
TL;DR: TableRAG是一个针对异构文档(包含文本和表格)的检索增强生成框架,通过结合文本理解和表格操作,解决了现有RAG方法在表格结构破坏和信息丢失上的问题。
Details
Motivation: 现有RAG方法在处理异构文档时存在表格结构破坏和信息丢失的局限性,影响了多跳推理能力。 Method: TableRAG采用四步迭代框架:上下文敏感查询分解、文本检索、SQL编程与执行、组合中间答案生成。 Result: 实验表明,TableRAG在公共数据集和HeteQA基准上均优于现有基线,成为异构文档问答的新SOTA。 Conclusion: TableRAG通过统一文本和表格处理,显著提升了异构文档问答的性能,并开源了框架。 Abstract: Retrieval-Augmented Generation (RAG) has demonstrated considerable effectiveness in open-domain question answering. However, when applied to heterogeneous documents, comprising both textual and tabular components, existing RAG approaches exhibit critical limitations. The prevailing practice of flattening tables and chunking strategies disrupts the intrinsic tabular structure, leads to information loss, and undermines the reasoning capabilities of LLMs in multi-hop, global queries. To address these challenges, we propose TableRAG, an hybrid framework that unifies textual understanding and complex manipulations over tabular data. TableRAG iteratively operates in four steps: context-sensitive query decomposition, text retrieval, SQL programming and execution, and compositional intermediate answer generation. We also develop HeteQA, a novel benchmark designed to evaluate the multi-hop heterogeneous reasoning capabilities. Experimental results demonstrate that TableRAG consistently outperforms existing baselines on both public datasets and our HeteQA, establishing a new state-of-the-art for heterogeneous document question answering. We release TableRAG at https://github.com/yxh-y/TableRAG/tree/main.[126] PAG: Multi-Turn Reinforced LLM Self-Correction with Policy as Generative Verifier
Yuhua Jiang,Yuwen Xiong,Yufeng Yuan,Chao Xin,Wenyuan Xu,Yu Yue,Qianchuan Zhao,Lin Yan
Main category: cs.CL
TL;DR: PAG框架通过统一的强化学习范式,让LLM在策略和验证者角色间交替,实现自我纠正,提升推理和验证能力。
Details
Motivation: 现有方法依赖外部验证模型或多阶段训练,限制了可扩展性,PAG旨在通过自我验证和选择性修订解决这一问题。 Method: PAG采用多轮强化学习,模型在生成答案后进行自我验证,仅当检测到错误时修订答案。 Result: 实验表明,PAG在推理和验证任务中均表现优异,优于自一致性验证方法。 Conclusion: PAG通过统一的验证-修订机制,显著提升了LLM的自我纠正能力,同时增强了推理和验证性能。 Abstract: Large Language Models (LLMs) have demonstrated impressive capabilities in complex reasoning tasks, yet they still struggle to reliably verify the correctness of their own outputs. Existing solutions to this verification challenge often depend on separate verifier models or require multi-stage self-correction training pipelines, which limit scalability. In this paper, we propose Policy as Generative Verifier (PAG), a simple and effective framework that empowers LLMs to self-correct by alternating between policy and verifier roles within a unified multi-turn reinforcement learning (RL) paradigm. Distinct from prior approaches that always generate a second attempt regardless of model confidence, PAG introduces a selective revision mechanism: the model revises its answer only when its own generative verification step detects an error. This verify-then-revise workflow not only alleviates model collapse but also jointly enhances both reasoning and verification abilities. Extensive experiments across diverse reasoning benchmarks highlight PAG's dual advancements: as a policy, it enhances direct generation and self-correction accuracy; as a verifier, its self-verification outperforms self-consistency.[127] Burn After Reading: Do Multimodal Large Language Models Truly Capture Order of Events in Image Sequences?
Yingjin Song,Yupei Du,Denis Paperno,Albert Gatt
Main category: cs.CL
TL;DR: TempVS是一个专注于多模态大语言模型(MLLMs)在图像序列中的时间基础和推理能力的基准测试,包含三种主要测试,评估结果显示MLLMs表现不佳,与人类能力有显著差距。
Details
Motivation: 研究MLLMs在时间推理和基础能力上的表现,填补现有基准测试的不足。 Method: 设计了三种测试(事件关系推理、句子排序和图像排序)及基础测试,评估了38种MLLMs。 Result: MLLMs在TempVS上表现较差,与人类能力差距显著。 Conclusion: TempVS为未来研究提供了方向,数据和代码已开源。 Abstract: This paper introduces the TempVS benchmark, which focuses on temporal grounding and reasoning capabilities of Multimodal Large Language Models (MLLMs) in image sequences. TempVS consists of three main tests (i.e., event relation inference, sentence ordering and image ordering), each accompanied with a basic grounding test. TempVS requires MLLMs to rely on both visual and linguistic modalities to understand the temporal order of events. We evaluate 38 state-of-the-art MLLMs, demonstrating that models struggle to solve TempVS, with a substantial performance gap compared to human capabilities. We also provide fine-grained insights that suggest promising directions for future research. Our TempVS benchmark data and code are available at https://github.com/yjsong22/TempVS.[128] Beyond the Battlefield: Framing Analysis of Media Coverage in Conflict Reporting
Avneet Kaur,Arnav Arora
Main category: cs.CL
TL;DR: 该研究通过计算分析方法,结合框架语义学和大型语言模型,分析了新闻媒体在报道以色列-巴勒斯坦冲突时的战争与和平新闻框架,揭示了偏向战争报道的趋势及不同地区媒体的偏见。
Details
Motivation: 现有冲突框架研究多为定性或仅关注表层框架,缺乏深入分析。本研究旨在通过计算手段更深入地探讨新闻媒体在冲突报道中的框架及其影响。 Method: 使用框架语义学和大型语言模型,分析新闻语料库中的战争与和平新闻指标,比较美国、英国和中东媒体的报道差异。 Result: 研究发现报道更偏向战争框架,且不同地区媒体在冲突责任方和受害者的描述上存在显著差异,揭示了媒体偏见。 Conclusion: 研究强调了新闻框架对读者观点的潜在影响,呼吁更平衡的冲突报道以减少偏见和冲突激化。 Abstract: Framing used by news media, especially in times of conflict, can have substantial impact on readers' opinion, potentially aggravating the conflict itself. Current studies on the topic of conflict framing have limited insights due to their qualitative nature or only look at surface level generic frames without going deeper. In this work, we identify indicators of war and peace journalism, as outlined by prior work in conflict studies, in a corpus of news articles reporting on the Israel-Palestine war. For our analysis, we use computational approaches, using a combination of frame semantics and large language models to identify both communicative framing and its connection to linguistic framing. Our analysis reveals a higher focus on war based reporting rather than peace based. We also show substantial differences in reporting across the US, UK, and Middle Eastern news outlets in framing who the assailant and victims of the conflict are, surfacing biases within the media.[129] Fast on the Easy, Deep on the Hard: Efficient Reasoning via Powered Length Penalty
Zehui Ling,Deshu Chen,Hongwei Zhang,Yifeng Jiao,Xin Guo,Yuan Cheng
Main category: cs.CL
TL;DR: 该研究提出了一种优化大型语言模型(LLM)推理效率的方法,通过动态调整输出长度惩罚,在简单问题上缩短输出,在复杂问题上保持准确性。
Details
Motivation: 现有方法在缩短推理输出时采用统一惩罚,未考虑问题复杂性,导致效果不佳。本研究旨在提升LLM推理效率,同时兼顾简单问题的简洁性和复杂问题的准确性。 Method: 通过分割奖励函数并引入新的输出长度惩罚,动态管理模型的推理效率。 Result: 在GSM8K和MATH500数据集上,方法有效缩短输出长度且保持或提升准确性;在更复杂的AIME2024数据集上,准确性有所提高。 Conclusion: 该方法显著提升了LLM的推理效率,适用于不同复杂度的任务。 Abstract: Large language models (LLMs) have demonstrated significant advancements in reasoning capabilities, performing well on various challenging benchmarks. Techniques like Chain-of-Thought prompting have been introduced to further improve reasoning. However, these approaches frequently generate longer outputs, which in turn increase computational latency. Although some methods use reinforcement learning to shorten reasoning, they often apply uniform penalties without considering the problem's complexity, leading to suboptimal outcomes. In this study, we seek to enhance the efficiency of LLM reasoning by promoting conciseness for simpler problems while preserving sufficient reasoning for more complex ones for accuracy, thus improving the model's overall performance. Specifically, we manage the model's reasoning efficiency by dividing the reward function and including a novel penalty for output length. Our approach has yielded impressive outcomes in benchmark evaluations across three datasets: GSM8K, MATH500, and AIME2024. For the comparatively simpler datasets GSM8K and MATH500, our method has effectively shortened output lengths while preserving or enhancing accuracy. On the more demanding AIME2024 dataset, our approach has resulted in improved accuracy.[130] Table-Text Alignment: Explaining Claim Verification Against Tables in Scientific Papers
Xanh Ho,Sunisth Kumar,Yun-Ang Wu,Florian Boudin,Atsuhiro Takasu,Akiko Aizawa
Main category: cs.CL
TL;DR: 论文提出将表格与文本对齐任务重新定义为解释任务,要求模型识别用于验证科学主张的关键表格单元格,并构建了包含人工标注的数据集。实验表明,引入对齐信息可提升验证性能,但多数LLM无法恢复人类标注的推理路径。
Details
Motivation: 仅预测科学主张的标签(支持或反驳)不足以揭示模型的推理过程,缺乏解释性。因此,研究希望通过识别关键表格单元格来增强模型的可解释性。 Method: 扩展SciTab基准数据集,添加人工标注的单元格级理由,并提出处理模糊情况的分类法。 Result: 实验显示:(i) 引入表格对齐信息能提升主张验证性能;(ii) 多数LLM虽能预测正确标签,但无法恢复人类标注的推理路径。 Conclusion: 研究强调了解释性在科学主张验证中的重要性,并揭示了当前LLM在忠实推理方面的局限性。 Abstract: Scientific claim verification against tables typically requires predicting whether a claim is supported or refuted given a table. However, we argue that predicting the final label alone is insufficient: it reveals little about the model's reasoning and offers limited interpretability. To address this, we reframe table-text alignment as an explanation task, requiring models to identify the table cells essential for claim verification. We build a new dataset by extending the SciTab benchmark with human-annotated cell-level rationales. Annotators verify the claim label and highlight the minimal set of cells needed to support their decision. After the annotation process, we utilize the collected information and propose a taxonomy for handling ambiguous cases. Our experiments show that (i) incorporating table alignment information improves claim verification performance, and (ii) most LLMs, while often predicting correct labels, fail to recover human-aligned rationales, suggesting that their predictions do not stem from faithful reasoning.[131] Surface Fairness, Deep Bias: A Comparative Study of Bias in Language Models
Aleksandra Sorokovikova,Pavel Chizhov,Iuliia Eremenko,Ivan P. Yamshchikov
Main category: cs.CL
TL;DR: 论文研究了大型语言模型(LLMs)中的偏见问题,发现通过不同任务设计可以揭示不同程度的偏见表现。
Details
Motivation: 现代语言模型在训练数据中不可避免地包含偏见内容,导致模型输出带有偏见。研究旨在探索这些偏见的表现形式。 Method: 通过预提示角色、用户答案评分和薪资谈判建议三种任务设计,评估LLMs的偏见表现。 Result: 预提示角色对模型评分影响微小且随机;用户答案评分任务显示更明显的偏见;薪资谈判建议任务中偏见最为显著。 Conclusion: 随着LLM助手记忆和个性化功能的发展,模型可能基于用户的社会人口特征自动产生偏见,需引起重视。 Abstract: Modern language models are trained on large amounts of data. These data inevitably include controversial and stereotypical content, which contains all sorts of biases related to gender, origin, age, etc. As a result, the models express biased points of view or produce different results based on the assigned personality or the personality of the user. In this paper, we investigate various proxy measures of bias in large language models (LLMs). We find that evaluating models with pre-prompted personae on a multi-subject benchmark (MMLU) leads to negligible and mostly random differences in scores. However, if we reformulate the task and ask a model to grade the user's answer, this shows more significant signs of bias. Finally, if we ask the model for salary negotiation advice, we see pronounced bias in the answers. With the recent trend for LLM assistant memory and personalization, these problems open up from a different angle: modern LLM users do not need to pre-prompt the description of their persona since the model already knows their socio-demographics.[132] Beyond Single-User Dialogue: Assessing Multi-User Dialogue State Tracking Capabilities of Large Language Models
Sangmin Song,Juhwan Choi,JungMin Yun,YoungBin Kim
Main category: cs.CL
TL;DR: 论文研究了大型语言模型(LLMs)在多用户对话状态跟踪(DST)中的表现,发现其在多用户场景下性能显著下降,需进一步改进。
Details
Motivation: 现有DST基准主要关注结构化用户-代理对话,未能反映真实多用户交互的复杂性,因此需要评估LLMs在多用户DST中的鲁棒性。 Method: 通过基于言语行为理论生成第二用户的对话,扩展现有DST数据集,系统性地引入多用户对话,以评估LLMs的表现。 Result: 实验结果显示,与单用户DST相比,LLMs在多用户场景下性能显著下降,表明其在多说话者环境中提取和跟踪对话状态的能力有限。 Conclusion: 研究强调了未来改进LLMs以应对多用户DST场景的必要性,为开发更真实和鲁棒的DST模型铺平道路。 Abstract: Large language models (LLMs) have demonstrated remarkable performance in zero-shot dialogue state tracking (DST), reducing the need for task-specific training. However, conventional DST benchmarks primarily focus on structured user-agent conversations, failing to capture the complexities of real-world multi-user interactions. In this study, we assess the robustness of LLMs in multi-user DST while minimizing dataset construction costs. Inspired by recent advances in LLM-based data annotation, we extend an existing DST dataset by generating utterances of a second user based on speech act theory. Our methodology systematically incorporates a second user's utterances into conversations, enabling a controlled evaluation of LLMs in multi-user settings. Experimental results reveal a significant performance drop compared to single-user DST, highlighting the limitations of current LLMs in extracting and tracking dialogue states amidst multiple speakers. Our findings emphasize the need for future research to enhance LLMs for multi-user DST scenarios, paving the way for more realistic and robust DST models.[133] Reliable Reasoning Path: Distilling Effective Guidance for LLM Reasoning with Knowledge Graphs
Yilin Xiao,Chuang Zhou,Qinggang Zhang,Bo Li,Qing Li,Xiao Huang
Main category: cs.CL
TL;DR: RRP框架通过结合LLMs的语义能力和知识图的结构信息,生成高质量推理路径,提升LLMs在复杂问题上的表现。
Details
Motivation: 解决LLMs在知识密集型任务中因缺乏背景知识和幻觉倾向导致的性能不足问题。 Method: 提出RRP框架,结合LLMs的语义能力和知识图的结构信息,通过关系嵌入和双向分布学习提取推理路径,并引入反思模块优化路径。 Result: 在两个公开数据集上,RRP表现优于现有基线方法,并能以即插即用方式增强LLMs的推理能力。 Conclusion: RRP通过生成高质量推理路径,显著提升了LLMs在复杂问题上的推理能力。 Abstract: Large language models (LLMs) often struggle with knowledge-intensive tasks due to a lack of background knowledge and a tendency to hallucinate. To address these limitations, integrating knowledge graphs (KGs) with LLMs has been intensively studied. Existing KG-enhanced LLMs focus on supplementary factual knowledge, but still struggle with solving complex questions. We argue that refining the relationships among facts and organizing them into a logically consistent reasoning path is equally important as factual knowledge itself. Despite their potential, extracting reliable reasoning paths from KGs poses the following challenges: the complexity of graph structures and the existence of multiple generated paths, making it difficult to distinguish between useful and redundant ones. To tackle these challenges, we propose the RRP framework to mine the knowledge graph, which combines the semantic strengths of LLMs with structural information obtained through relation embedding and bidirectional distribution learning. Additionally, we introduce a rethinking module that evaluates and refines reasoning paths according to their significance. Experimental results on two public datasets show that RRP achieves state-of-the-art performance compared to existing baseline methods. Moreover, RRP can be easily integrated into various LLMs to enhance their reasoning abilities in a plug-and-play manner. By generating high-quality reasoning paths tailored to specific questions, RRP distills effective guidance for LLM reasoning.[134] Unsupervised Protoform Reconstruction through Parsimonious Rule-guided Heuristics and Evolutionary Search
Promise Dodzi Kpoglu
Main category: cs.CL
TL;DR: 提出了一种无监督方法,结合数据驱动和规则启发式,用于重建原始词形,显著优于现有基线。
Details
Motivation: 现有方法主要依赖概率模型,受限于数据驱动性质,无法充分利用语言学约束。 Method: 结合数据驱动推断和规则启发式,在进化优化框架中整合统计模式和语言学约束。 Result: 在重建拉丁原始词形任务中,字符级准确性和音系合理性指标均显著提升。 Conclusion: 混合方法在原始词形重建中优于纯数据驱动方法,验证了结合统计与语言学约束的有效性。 Abstract: We propose an unsupervised method for the reconstruction of protoforms i.e., ancestral word forms from which modern language forms are derived. While prior work has primarily relied on probabilistic models of phonological edits to infer protoforms from cognate sets, such approaches are limited by their predominantly data-driven nature. In contrast, our model integrates data-driven inference with rule-based heuristics within an evolutionary optimization framework. This hybrid approach leverages on both statistical patterns and linguistically motivated constraints to guide the reconstruction process. We evaluate our method on the task of reconstructing Latin protoforms using a dataset of cognates from five Romance languages. Experimental results demonstrate substantial improvements over established baselines across both character-level accuracy and phonological plausibility metrics.[135] SDialog: A Python Toolkit for Synthetic Dialogue Generation and Analysis
Sergio Burdisso,Esaú Villatoro-Tello,Petr Motlicek
Main category: cs.CL
TL;DR: SDialog是一个模块化、可扩展的Python工具包,用于生成和分析合成对话,支持多代理模拟和场景驱动生成。
Details
Motivation: 为训练、评估和基准测试提供高质量、灵活且可复现的合成对话数据。 Method: 利用指令调优的大型语言模型(LLMs),提供人物角色、编排和场景管理的抽象。 Result: 能够生成真实、多样且可控的对话数据,推动合成数据生成工具和框架的标准化。 Conclusion: SDialog是确保研究可复现性的重要工具,适用于快速发展的研究领域。 Abstract: The advancement of conversational AI systems relies on the availability of high-quality, flexible, and reproducible synthetic dialogues for training, evaluation, and benchmarking. SDialog is a modular, extensible Python toolkit designed to address the challenges of synthetic dialogue generation and analysis. By leveraging instruction-tuned Large Language Models (LLMs), SDialog provides abstractions for personas, orchestration, and scenario management, enabling the creation of realistic, diverse, and controllable conversational data for research and development. SDialog supports workflows such as multi-agent simulation and scenario-driven generation, and represents a step forward in the standardization of tools and frameworks for synthetic data generation, a crucial advancement for ensuring reproducibility in today's fast-evolving research landscape.[136] NeuralNexus at BEA 2025 Shared Task: Retrieval-Augmented Prompting for Mistake Identification in AI Tutors
Numaan Naeem,Sarfraz Ahmad,Momina Ahsan,Hasan Iqbal
Main category: cs.CL
TL;DR: 本文介绍了在BEA 2025共享任务中用于错误识别的系统,探索了四种方法,最终结合检索增强的少样本提示与LLM推理,表现优于基线。
Details
Motivation: 评估AI辅导教师在学生数学推理中识别错误的能力,以提升教学反馈质量。 Method: 1. 多预训练语言模型的集成;2. 冻结的句子转换器与MLP分类器;3. 历史感知的多头注意力模型;4. 检索增强的少样本提示系统(GPT 4o)。 Result: 最终系统表现优于所有基线,证明了结合示例驱动提示与LLM推理的有效性。 Conclusion: 检索增强的少样本提示与LLM推理结合,是评估教学反馈的有效方法。 Abstract: This paper presents our system for Track 1: Mistake Identification in the BEA 2025 Shared Task on Pedagogical Ability Assessment of AI-powered Tutors. The task involves evaluating whether a tutor's response correctly identifies a mistake in a student's mathematical reasoning. We explore four approaches: (1) an ensemble of machine learning models over pooled token embeddings from multiple pretrained language models (LMs); (2) a frozen sentence-transformer using [CLS] embeddings with an MLP classifier; (3) a history-aware model with multi-head attention between token-level history and response embeddings; and (4) a retrieval-augmented few-shot prompting system with a large language model (LLM) i.e. GPT 4o. Our final system retrieves semantically similar examples, constructs structured prompts, and uses schema-guided output parsing to produce interpretable predictions. It outperforms all baselines, demonstrating the effectiveness of combining example-driven prompting with LLM reasoning for pedagogical feedback assessment. Our code is available at https://github.com/NaumanNaeem/BEA_2025.[137] Spelling-out is not Straightforward: LLMs' Capability of Tokenization from Token to Characters
Tatsuya Hiraoka,Kentaro Inui
Main category: cs.CL
TL;DR: LLMs能逐字符拼写单词但难以处理复杂字符任务,研究发现其字符级信息编码不完整,依赖中间层重建知识。
Details
Motivation: 探究LLMs在拼写过程中如何内部表示和利用字符级信息。 Method: 通过分析嵌入层、中间层和Transformer层,结合探测分类器、知识神经元识别和注意力权重检查。 Result: 嵌入层未完全编码字符级信息,LLMs依赖中间层重建知识,拼写行为在特定层出现突破。 Conclusion: LLMs的字符级处理依赖多层机制,嵌入层信息不完整,中间层起关键作用。 Abstract: Large language models (LLMs) can spell out tokens character by character with high accuracy, yet they struggle with more complex character-level tasks, such as identifying compositional subcomponents within tokens. In this work, we investigate how LLMs internally represent and utilize character-level information during the spelling-out process. Our analysis reveals that, although spelling out is a simple task for humans, it is not handled in a straightforward manner by LLMs. Specifically, we show that the embedding layer does not fully encode character-level information, particularly beyond the first character. As a result, LLMs rely on intermediate and higher Transformer layers to reconstruct character-level knowledge, where we observe a distinct "breakthrough" in their spelling behavior. We validate this mechanism through three complementary analyses: probing classifiers, identification of knowledge neurons, and inspection of attention weights.[138] Large Language Models for Detection of Life-Threatening Texts
Thanh Thi Nguyen,Campbell Wilson,Janis Dalins
Main category: cs.CL
TL;DR: 本文提出了一种使用大语言模型(LLMs)检测威胁生命语言的方法,并与传统方法进行了比较。实验表明,LLMs在平衡和不平衡数据场景下表现优异,尤其是Mistral和Llama-2模型。
Details
Motivation: 检测威胁生命的语言对保护个体心理健康和预防潜在伤害至关重要。 Method: 通过微调三种开源LLMs(Gemma、Mistral、Llama-2)并与传统方法(如词袋模型、词嵌入、主题建模和BERT)进行比较。 Result: LLMs在平衡和不平衡数据场景下表现优于传统方法,Mistral和Llama-2表现最佳。上采样技术对传统方法有帮助,但对LLMs影响较小。 Conclusion: LLMs在现实世界的威胁生命语言检测中具有巨大潜力。 Abstract: Detecting life-threatening language is essential for safeguarding individuals in distress, promoting mental health and well-being, and preventing potential harm and loss of life. This paper presents an effective approach to identifying life-threatening texts using large language models (LLMs) and compares them with traditional methods such as bag of words, word embedding, topic modeling, and Bidirectional Encoder Representations from Transformers. We fine-tune three open-source LLMs including Gemma, Mistral, and Llama-2 using their 7B parameter variants on different datasets, which are constructed with class balance, imbalance, and extreme imbalance scenarios. Experimental results demonstrate a strong performance of LLMs against traditional methods. More specifically, Mistral and Llama-2 models are top performers in both balanced and imbalanced data scenarios while Gemma is slightly behind. We employ the upsampling technique to deal with the imbalanced data scenarios and demonstrate that while this method benefits traditional approaches, it does not have as much impact on LLMs. This study demonstrates a great potential of LLMs for real-world life-threatening language detection problems.[139] Inferring Adjective Hypernyms with Language Models to Increase the Connectivity of Open English Wordnet
Lorenzo Augello,John P. McCrae
Main category: cs.CL
TL;DR: 本文探讨了如何建立形容词之间的上位关系,提出了新的形容词上位关系资源,并展示了如何通过微调大语言模型(如TaxoLLaMa)来预测这种关系。
Details
Motivation: Open English Wordnet中缺少许多形容词的上位关系链接,因此需要研究如何填补这一空白。 Method: 通过理论分析形容词上位关系的特点,并开发新的资源,同时微调大语言模型(如TaxoLLaMa)来预测形容词的上位关系。 Result: 成功开发了形容词上位关系资源,并验证了TaxoLLaMa方法在此任务中的适用性。 Conclusion: 研究表明,形容词的上位关系可以通过理论分析和语言模型相结合的方法有效建立。 Abstract: Open English Wordnet is a key resource published in OntoLex-lemon as part of the linguistic linked open data cloud. There are, however, many links missing in the resource, and in this paper, we look at how we can establish hypernymy between adjectives. We present a theoretical discussion of the hypernymy relation and how it differs for adjectives in contrast to nouns and verbs. We develop a new resource for adjective hypernymy and fine-tune large language models to predict adjective hypernymy, showing that the methodology of TaxoLLaMa can be adapted to this task.[140] PREMISE: Scalable and Strategic Prompt Optimization for Efficient Mathematical Reasoning in Large Models
Ye Yu,Yaoning Yu,Haohan Wang
Main category: cs.CL
TL;DR: PREMISE是一种无需修改模型权重的提示优化框架,通过减少冗余推理步骤显著降低计算成本,同时保持准确性。
Details
Motivation: 大型推理模型(LRMs)在数学基准测试中表现优异,但其冗长的推理过程导致高昂的计算成本和延迟问题。 Method: PREMISE结合轨迹诊断和梯度启发式提示优化,通过多目标文本搜索平衡推理长度和答案准确性。 Result: 在GSM8K、SVAMP和Math500基准测试中,PREMISE保持或提升准确率(Claude 96%→96%,Gemini 91%→92%),同时减少推理令牌87.5%,降低成本69-82%。 Conclusion: 提示级优化是提升LRM推理效率的有效且可扩展方法,无需牺牲推理质量。 Abstract: Large reasoning models (LRMs) such as Claude 3.7 Sonnet and OpenAI o1 achieve strong performance on mathematical benchmarks using lengthy chain-of-thought (CoT) reasoning, but the resulting traces are often unnecessarily verbose. This inflates token usage and cost, limiting deployment in latency-sensitive or API-constrained settings. We introduce PREMISE (PRompt-based Efficient Mathematical Inference with Strategic Evaluation), a prompt-only framework that reduces reasoning overhead without modifying model weights. PREMISE combines trace-level diagnostics with gradient-inspired prompt optimization to minimize redundant computation while preserving answer accuracy. The approach jointly optimizes brevity and correctness through a multi-objective textual search that balances token length and answer validity. Unlike prior work, PREMISE runs in a single-pass black-box interface, so it can be applied directly to commercial LLMs. On GSM8K, SVAMP, and Math500 we match or exceed baseline accuracy ($96\%\rightarrow96\%$ with Claude, $91\%\rightarrow92\%$ with Gemini) while reducing reasoning tokens by up to $87.5\%$ and cutting dollar cost by $69$--$82\%$. These results show that prompt-level optimization is a practical and scalable path to efficient LRM inference without compromising reasoning quality.[141] Beyond True or False: Retrieval-Augmented Hierarchical Analysis of Nuanced Claims
Priyanka Kargupta,Runchu Tian,Jiawei Han
Main category: cs.CL
TL;DR: ClaimSpect框架通过层次化分解和检索增强生成技术,为复杂声明提供多角度验证和视角分析。
Details
Motivation: 解决声明(如科学或政治声明)难以简单标记为真或假的问题,提供更全面的结构化验证方法。 Method: 提出ClaimSpect框架,通过层次化分解声明为子方面,并结合检索增强生成技术,从语料库中提取相关视角。 Result: 在真实科学和政治声明数据集上验证了ClaimSpect的鲁棒性和准确性,优于多个基线方法。 Conclusion: ClaimSpect能有效解构复杂声明并展示语料库中的多样化视角,具有实际应用价值。 Abstract: Claims made by individuals or entities are oftentimes nuanced and cannot be clearly labeled as entirely "true" or "false" -- as is frequently the case with scientific and political claims. However, a claim (e.g., "vaccine A is better than vaccine B") can be dissected into its integral aspects and sub-aspects (e.g., efficacy, safety, distribution), which are individually easier to validate. This enables a more comprehensive, structured response that provides a well-rounded perspective on a given problem while also allowing the reader to prioritize specific angles of interest within the claim (e.g., safety towards children). Thus, we propose ClaimSpect, a retrieval-augmented generation-based framework for automatically constructing a hierarchy of aspects typically considered when addressing a claim and enriching them with corpus-specific perspectives. This structure hierarchically partitions an input corpus to retrieve relevant segments, which assist in discovering new sub-aspects. Moreover, these segments enable the discovery of varying perspectives towards an aspect of the claim (e.g., support, neutral, or oppose) and their respective prevalence (e.g., "how many biomedical papers believe vaccine A is more transportable than B?"). We apply ClaimSpect to a wide variety of real-world scientific and political claims featured in our constructed dataset, showcasing its robustness and accuracy in deconstructing a nuanced claim and representing perspectives within a corpus. Through real-world case studies and human evaluation, we validate its effectiveness over multiple baselines.[142] TaxoAdapt: Aligning LLM-Based Multidimensional Taxonomy Construction to Evolving Research Corpora
Priyanka Kargupta,Nan Zhang,Yunyi Zhang,Rui Zhang,Prasenjit Mitra,Jiawei Han
Main category: cs.CL
TL;DR: TaxoAdapt是一个动态适应多维度科学文献分类的框架,通过迭代层次分类生成更细粒度和连贯的学科分类体系。
Details
Motivation: 传统专家分类耗时昂贵,现有自动方法缺乏通用性和动态适应性,且忽视科学文献的多维性。 Method: TaxoAdapt利用LLM生成的分类体系,通过迭代层次分类扩展分类的宽度和深度,适应不同科学领域的动态变化。 Result: TaxoAdapt在计算机科学领域的分类表现优于基线方法,分类细粒度提升26.51%,连贯性提升50.41%。 Conclusion: TaxoAdapt能够有效捕捉科学领域的动态变化,生成更细粒度和连贯的分类体系,适用于多维度科学文献组织。 Abstract: The rapid evolution of scientific fields introduces challenges in organizing and retrieving scientific literature. While expert-curated taxonomies have traditionally addressed this need, the process is time-consuming and expensive. Furthermore, recent automatic taxonomy construction methods either (1) over-rely on a specific corpus, sacrificing generalizability, or (2) depend heavily on the general knowledge of large language models (LLMs) contained within their pre-training datasets, often overlooking the dynamic nature of evolving scientific domains. Additionally, these approaches fail to account for the multi-faceted nature of scientific literature, where a single research paper may contribute to multiple dimensions (e.g., methodology, new tasks, evaluation metrics, benchmarks). To address these gaps, we propose TaxoAdapt, a framework that dynamically adapts an LLM-generated taxonomy to a given corpus across multiple dimensions. TaxoAdapt performs iterative hierarchical classification, expanding both the taxonomy width and depth based on corpus' topical distribution. We demonstrate its state-of-the-art performance across a diverse set of computer science conferences over the years to showcase its ability to structure and capture the evolution of scientific fields. As a multidimensional method, TaxoAdapt generates taxonomies that are 26.51% more granularity-preserving and 50.41% more coherent than the most competitive baselines judged by LLMs.[143] One Tokenizer To Rule Them All: Emergent Language Plasticity via Multilingual Tokenizers
Diana Abagyan,Alejandro R. Salamanca,Andres Felipe Cruz-Salinas,Kris Cao,Hangyu Lin,Acyr Locatelli,Marzieh Fadaee,Ahmet Üstün,Sara Hooker
Main category: cs.CL
TL;DR: 研究探讨了在训练早期采用低成本干预(如通用分词器设计)如何提升多语言大模型的语言可塑性,使其在训练后能更高效地适应新语言。
Details
Motivation: 多语言大模型训练面临模型容量有限、高质量数据稀缺和计算资源限制等问题,且分词器对新语言的支持不足。 Method: 提出使用通用分词器,覆盖比主训练语言更多的语言,以提升训练后的语言适应能力。 Result: 实验表明,通用分词器显著提升了语言适应能力,胜率最高提升20.2%,且对未见语言也有5%的胜率增益。 Conclusion: 通用分词器设计是一种低成本高效的方法,能在不显著影响主训练语言性能的情况下,显著提升模型对新语言的适应能力。 Abstract: Pretraining massively multilingual Large Language Models (LLMs) for many languages at once is challenging due to limited model capacity, scarce high-quality data, and compute constraints. Moreover, the lack of language coverage of the tokenizer makes it harder to address the gap for new languages purely at the post-training stage. In this work, we study what relatively cheap interventions early on in training improve "language plasticity", or adaptation capabilities of the model post-training to new languages. We focus on tokenizer design and propose using a universal tokenizer that is trained for more languages than the primary pretraining languages to enable efficient adaptation in expanding language coverage after pretraining. Our systematic experiments across diverse groups of languages and different training strategies show that a universal tokenizer enables significantly higher language adaptation, with up to 20.2% increase in win rates compared to tokenizers specific to pretraining languages. Furthermore, a universal tokenizer also leads to better plasticity towards languages that are completely unseen in the tokenizer and pretraining, by up to 5% win rate gain. We achieve this adaptation to an expanded set of languages with minimal compromise in performance on the majority of languages included in pretraining.[144] Different Questions, Different Models: Fine-Grained Evaluation of Uncertainty and Calibration in Clinical QA with LLMs
Alberto Testoni,Iacer Calixto
Main category: cs.CL
TL;DR: 论文评估了十种开源大语言模型在临床多选题回答中的不确定性估计方法,发现轻量级单次生成方法接近语义熵性能,且不同专业和题型表现差异显著。
Details
Motivation: 在临床决策等高风险领域,准确且校准良好的不确定性估计对部署大语言模型至关重要。 Method: 通过两种数据集、十一种医学专业和六种题型,比较了标准单次生成和基于采样的方法,并探索了基于推理行为信号的轻量级单次生成估计器。 Result: 轻量级方法接近语义熵性能,且不同专业和题型表现差异显著。 Conclusion: 选择模型时需结合问题性质和模型特点,轻量级方法在临床多选题回答中具有潜力。 Abstract: Accurate and well-calibrated uncertainty estimates are essential for deploying large language models (LLMs) in high-stakes domains such as clinical decision support. We present a fine-grained evaluation of uncertainty estimation methods for clinical multiple-choice question answering, covering ten open-source LLMs (general-purpose, biomedical, and reasoning models) across two datasets, eleven medical specialties, and six question types. We compare standard single-generation and sampling-based methods, and present a case study exploring simple, single-pass estimators based on behavioral signals in reasoning traces. These lightweight methods approach the performance of Semantic Entropy while requiring only one generation. Our results reveal substantial variation across specialties and question types, underscoring the importance of selecting models based on both the nature of the question and model-specific strengths.[145] Improving Named Entity Transcription with Contextual LLM-based Revision
Viet Anh Trinh,Xinlu He,Jacob Whitehill
Main category: cs.CL
TL;DR: 提出了一种利用大型语言模型(LLM)修正ASR预测中命名实体错误的方法,显著降低了命名实体的词错误率(WER)。
Details
Motivation: 尽管ASR系统在通用语音识别上表现优异,但在命名实体识别上的错误率仍然较高,而命名实体是关键信息,错误会影响下游应用。 Method: 通过结合LLM的推理能力和本地上下文(如讲义)中的正确命名实体,修正ASR预测中的错误。 Result: 在NER-MIT-OpenCourseWare数据集上,命名实体的WER相对降低了30%。 Conclusion: 该方法有效提升了ASR系统在命名实体识别上的准确性,对下游应用具有重要意义。 Abstract: With recent advances in modeling and the increasing amount of supervised training data, automatic speech recognition (ASR) systems have achieved remarkable performance on general speech. However, the word error rate (WER) of state-of-the-art ASR remains high for named entities. Since named entities are often the most critical keywords, misrecognizing them can affect all downstream applications, especially when the ASR system functions as the front end of a complex system. In this paper, we introduce a large language model (LLM) revision mechanism to revise incorrect named entities in ASR predictions by leveraging the LLM's reasoning ability as well as local context (e.g., lecture notes) containing a set of correct named entities. Finally, we introduce the NER-MIT-OpenCourseWare dataset, containing 45 hours of data from MIT courses for development and testing. On this dataset, our proposed technique achieves up to 30\% relative WER reduction for named entities.[146] Mitigating Negative Interference in Multilingual Sequential Knowledge Editing through Null-Space Constraints
Wei Sun,Tingyu Qu,Mingxiao Li,Jesse Davis,Marie-Francine Moens
Main category: cs.CL
TL;DR: LangEdit提出了一种新颖的框架,通过正交投影隔离语言特定的知识更新,解决了多语言大模型中知识更新的干扰问题。
Details
Motivation: 多语言大模型中知识更新的挑战在于如何避免跨语言编辑时的参数干扰,同时保持多语言泛化能力。 Method: LangEdit采用零空间约束框架,将每种语言的参数更新投影到先前更新子空间的正交补空间上,确保更新独立性。 Result: 在三种模型架构、六种语言和四项下游任务上的评估表明,LangEdit有效减少了参数干扰,优于现有方法。 Conclusion: LangEdit为多语言大模型的高效准确知识更新提供了可行方案,代码已开源。 Abstract: Efficiently updating multilingual knowledge in large language models (LLMs), while preserving consistent factual representations across languages, remains a long-standing and unresolved challenge. While deploying separate editing systems for each language might seem viable, this approach incurs substantial costs due to the need to manage multiple models. A more efficient solution involves integrating knowledge updates across all languages into a unified model. However, performing sequential edits across languages often leads to destructive parameter interference, significantly degrading multilingual generalization and the accuracy of injected knowledge. To address this challenge, we propose LangEdit, a novel null-space constrained framework designed to precisely isolate language-specific knowledge updates. The core innovation of LangEdit lies in its ability to project parameter updates for each language onto the orthogonal complement of previous updated subspaces. This approach mathematically guarantees update independence while preserving multilingual generalization capabilities. We conduct a comprehensive evaluation across three model architectures, six languages, and four downstream tasks, demonstrating that LangEdit effectively mitigates parameter interference and outperforms existing state-of-the-art editing methods. Our results highlight its potential for enabling efficient and accurate multilingual knowledge updates in LLMs. The code is available at https://github.com/VRCMF/LangEdit.git.[147] ReCUT: Balancing Reasoning Length and Accuracy in LLMs via Stepwise Trails and Preference Optimization
Zhensheng Jin,Xinze Li,Yifan Ji,Chunyi Peng,Zhenghao Liu,Qi Shi,Yukun Yan,Shuo Wang,Furong Peng,Ge Yu
Main category: cs.CL
TL;DR: ReCUT提出了一种新方法,通过逐步探索和长短切换采样策略,平衡推理轨迹的准确性和长度,显著减少了推理长度30-50%,同时保持或提高准确性。
Details
Motivation: 现有CoT提示方法存在过度思考问题,导致推理轨迹冗长或冗余,现有解决方案受限于生成数据质量和过拟合。 Method: ReCUT采用逐步探索机制和长短切换采样策略,生成多样化推理路径,训练两个专门模型(一个优化准确性,一个优化推理长度),最终通过参数插值集成。 Result: 实验表明,ReCUT在多个数学推理数据集和骨干模型上显著减少推理长度30-50%,同时保持或提高准确性。 Conclusion: ReCUT有效解决了推理冗长问题,平衡了准确性和效率,为LLM推理优化提供了新思路。 Abstract: Recent advances in Chain-of-Thought (CoT) prompting have substantially improved the reasoning capabilities of Large Language Models (LLMs). However, these methods often suffer from overthinking, leading to unnecessarily lengthy or redundant reasoning traces. Existing approaches attempt to mitigate this issue through curating multiple reasoning chains for training LLMs, but their effectiveness is often constrained by the quality of the generated data and prone to overfitting. To address the challenge, we propose Reasoning Compression ThroUgh Stepwise Trials (ReCUT), a novel method aimed at balancing the accuracy and length of reasoning trajectory. Specifically, ReCUT employs a stepwise exploration mechanism and a long-short switched sampling strategy, enabling LLMs to incrementally generate diverse reasoning paths. These paths are evaluated and used to construct preference pairs to train two specialized models (Gemini LLMs)-one optimized for reasoning accuracy, the other for shorter reasoning. A final integrated model is obtained by interpolating the parameters of these two models. Experimental results across multiple math reasoning datasets and backbone models demonstrate that ReCUT significantly reduces reasoning lengths by approximately 30-50%, while maintaining or improving reasoning accuracy compared to various baselines. All codes and data will be released via https://github.com/NEUIR/ReCUT.[148] CIIR@LiveRAG 2025: Optimizing Multi-Agent Retrieval Augmented Generation through Self-Training
Alireza Salemi,Mukta Maddipatla,Hamed Zamani
Main category: cs.CL
TL;DR: mRAG是一个多代理检索增强生成框架,通过专门代理完成规划、搜索、推理和协调等子任务,采用自训练和奖励引导轨迹采样优化代理协作,在SIGIR 2025 LiveRAG竞赛中表现优于传统RAG基线。
Details
Motivation: 解决传统RAG在复杂任务中的局限性,通过多代理协作提升生成效果。 Method: 使用自训练范式与奖励引导轨迹采样,优化代理间协作。 Result: 在DataMorgana数据集上表现优于传统RAG基线,并通过案例研究验证其有效性。 Conclusion: mRAG框架在复杂RAG任务中表现出色,具有实际应用潜力。 Abstract: This paper presents mRAG, a multi-agent retrieval-augmented generation (RAG) framework composed of specialized agents for subtasks such as planning, searching, reasoning, and coordination. Our system uses a self-training paradigm with reward-guided trajectory sampling to optimize inter-agent collaboration and enhance response generation. Evaluated on DataMorgana-derived datasets during the SIGIR 2025 LiveRAG competition, mRAG outperforms conventional RAG baselines. We further analyze competition outcomes and showcase the framework's strengths with case studies, demonstrating its efficacy for complex, real-world RAG tasks.[149] Accelerating Diffusion Large Language Models with SlowFast: The Three Golden Principles
Qingyan Wei,Yaojie Zhang,Zhiyuan Liu,Dongrui Liu,Linfeng Zhang
Main category: cs.CL
TL;DR: SlowFast Sampling是一种动态采样策略,通过交替探索和加速解码阶段,显著提升扩散语言模型的生成效率和灵活性,同时结合缓存减少冗余计算。
Details
Motivation: 现有扩散语言模型的采样策略存在静态行为问题,导致效率和灵活性不足,需要一种动态适应的方法。 Method: 提出SlowFast Sampling,基于确定性、收敛性和位置性三原则动态调整解码阶段,并结合dLLM-Cache减少计算冗余。 Result: 实验显示,SlowFast Sampling在LLaDA上实现15.63倍加速,结合缓存可达34.22倍,且生成质量接近基线。 Conclusion: SlowFast Sampling证明了动态采样策略能充分发挥扩散语言模型的潜力,实现高效高质量的生成。 Abstract: Diffusion-based language models (dLLMs) have emerged as a promising alternative to traditional autoregressive LLMs by enabling parallel token generation and significantly reducing inference latency. However, existing sampling strategies for dLLMs, such as confidence-based or semi-autoregressive decoding, often suffer from static behavior, leading to suboptimal efficiency and limited flexibility. In this paper, we propose SlowFast Sampling, a novel dynamic sampling strategy that adaptively alternates between exploratory and accelerated decoding stages. Our method is guided by three golden principles: certainty principle, convergence principle, and positional principle, which govern when and where tokens can be confidently and efficiently decoded. We further integrate our strategy with dLLM-Cache to reduce redundant computation. Extensive experiments across benchmarks and models show that SlowFast Sampling achieves up to 15.63$\times$ speedup on LLaDA with minimal accuracy drop, and up to 34.22$\times$ when combined with caching. Notably, our approach outperforms strong autoregressive baselines like LLaMA3 8B in throughput, demonstrating that well-designed sampling can unlock the full potential of dLLMs for fast and high-quality generation.[150] Analyzing the relationships between pretraining language, phonetic, tonal, and speaker information in self-supervised speech models
Michele Gubian,Ioana Krehan,Oli Liu,James Kirby,Sharon Goldwater
Main category: cs.CL
TL;DR: 研究分析了wav2vec2模型在四种语言中如何编码语音信息,发现其表示结构独立于预训练语言。
Details
Motivation: 现有研究主要关注英语,缺乏对其他语言的分析,因此探讨多语言下wav2vec2模型的表示特性。 Method: 使用探测分类器和几何分析,研究模型如何编码音素、声调和说话人信息。 Result: 不同语言的表示子空间(音素、声调、说话人)基本正交,层间探测准确率模式相似,匹配语言在后期层略有优势。 Conclusion: wav2vec2的表示结构主要与预训练语言无关,具有通用性。 Abstract: Analyses of self-supervised speech models have begun to reveal where and how they represent different types of information. However, almost all analyses have focused on English. Here, we examine how wav2vec2 models trained on four different languages encode both language-matched and non-matched speech. We use probing classifiers and geometric analyses to examine how phones, lexical tones, and speaker information are represented. We show that for all pretraining and test languages, the subspaces encoding phones, tones, and speakers are largely orthogonal, and that layerwise patterns of probing accuracy are similar, with a relatively small advantage for matched-language phone and tone (but not speaker) probes in the later layers. Our findings suggest that the structure of representations learned by wav2vec2 is largely independent of the speech material used during pretraining.[151] Enhancing Medical Dialogue Generation through Knowledge Refinement and Dynamic Prompt Adjustment
Hongda Sun,Jiaren Peng,Wenzhong Yang,Liang He,Bo Du,Rui Yan
Main category: cs.CL
TL;DR: MedRef是一种新型医疗对话系统,通过知识精炼和动态提示调整解决现有系统的问题,显著提升了生成质量和医学实体准确性。
Details
Motivation: 现有医疗对话系统难以识别相关医学知识和生成个性化、准确的响应,MedRef旨在解决这些问题。 Method: 采用知识精炼机制过滤无关医学数据,设计综合提示结构,并引入Triplet Filter和Demo Selector模块实现实时适应性。 Result: 在MedDG和KaMed基准测试中,MedRef在生成质量和医学实体准确性上优于现有基线。 Conclusion: MedRef在真实医疗应用中表现出高效性和可靠性。 Abstract: Medical dialogue systems (MDS) have emerged as crucial online platforms for enabling multi-turn, context-aware conversations with patients. However, existing MDS often struggle to (1) identify relevant medical knowledge and (2) generate personalized, medically accurate responses. To address these challenges, we propose MedRef, a novel MDS that incorporates knowledge refining and dynamic prompt adjustment. First, we employ a knowledge refining mechanism to filter out irrelevant medical data, improving predictions of critical medical entities in responses. Additionally, we design a comprehensive prompt structure that incorporates historical details and evident details. To enable real-time adaptability to diverse patient conditions, we implement two key modules, Triplet Filter and Demo Selector, providing appropriate knowledge and demonstrations equipped in the system prompt. Extensive experiments on MedDG and KaMed benchmarks show that MedRef outperforms state-of-the-art baselines in both generation quality and medical entity accuracy, underscoring its effectiveness and reliability for real-world healthcare applications.[152] Slimming Down LLMs Without Losing Their Minds
Qingda,Mai
Main category: cs.CL
TL;DR: 论文研究了参数高效方法(LoRA和QLoRA)对大型语言模型性能的影响,验证了其在常识推理、数学推理和多领域知识任务中的表现。
Details
Motivation: 探讨如何在有限资源下高效调整大型语言模型,以提升特定任务性能。 Method: 使用LoRA和QLoRA方法,在HellaSwag、GSM8K和MMLU-CS三个领域评估模型表现。 Result: LoRA方法能有效提升任务性能并保持计算效率,性能与微调数据集和任务的对齐程度密切相关。 Conclusion: 研究为参数高效机制提供了理论见解,并为资源有限的开发者提供了实用指导。 Abstract: This paper investigates and validates the impact of fine-tuning on large language model performance, focusing on parameter-efficient methods (LoRA and QLoRA). We evaluate model capabilities across three key domains: (1) commonsense reasoning (HellaSwag), (2) mathematical reasoning (GSM8K), and (3) multi-domain knowledge (MMLU-CS). Our findings demonstrate that: (1) LoRA-based methods effectively improve task-specific performance while maintaining computational efficiency, and (2) performance strongly depends on alignment between fine-tuning dataset and benchmark tasks. The study provides both theoretical insights into parameter-efficient mechanisms and practical guidance for developers implementing efficient LLM adaptation with limited resources.[153] Generalization or Hallucination? Understanding Out-of-Context Reasoning in Transformers
Yixiao Huang,Hanlin Zhu,Tianyu Guo,Jiantao Jiao,Somayeh Sojoudi,Michael I. Jordan,Stuart Russell,Song Mei
Main category: cs.CL
TL;DR: 论文探讨了大语言模型(LLMs)通过微调获取新知识时的双重行为:既能泛化新事实,又易产生幻觉信息,并提出这两种行为源于同一机制——上下文外推理(OCR)。
Details
Motivation: 理解LLMs在微调过程中表现出的双重行为(泛化与幻觉)的根本原因。 Method: 通过实验验证OCR机制的存在,并将其形式化为合成事实回忆任务,分析单层单头注意力模型的性能。 Result: OCR机制是泛化和幻觉的共同根源,矩阵分解在模型学习关联事实与含义中起关键作用。 Conclusion: 研究为理解OCR现象提供了理论基础,为分析和缓解知识注入中的不良行为提供了新视角。 Abstract: Large language models (LLMs) can acquire new knowledge through fine-tuning, but this process exhibits a puzzling duality: models can generalize remarkably from new facts, yet are also prone to hallucinating incorrect information. However, the reasons for this phenomenon remain poorly understood. In this work, we argue that both behaviors stem from a single mechanism known as out-of-context reasoning (OCR): the ability to deduce implications by associating concepts, even those without a causal link. Our experiments across five prominent LLMs confirm that OCR indeed drives both generalization and hallucination, depending on whether the associated concepts are causally related. To build a rigorous theoretical understanding of this phenomenon, we then formalize OCR as a synthetic factual recall task. We empirically show that a one-layer single-head attention-only transformer with factorized output and value matrices can learn to solve this task, while a model with combined weights cannot, highlighting the crucial role of matrix factorization. Our theoretical analysis shows that the OCR capability can be attributed to the implicit bias of gradient descent, which favors solutions that minimize the nuclear norm of the combined output-value matrix. This mathematical structure explains why the model learns to associate facts and implications with high sample efficiency, regardless of whether the correlation is causal or merely spurious. Ultimately, our work provides a theoretical foundation for understanding the OCR phenomenon, offering a new lens for analyzing and mitigating undesirable behaviors from knowledge injection.[154] BioClinical ModernBERT: A State-of-the-Art Long-Context Encoder for Biomedical and Clinical NLP
Thomas Sounack,Joshua Davis,Brigitte Durieux,Antoine Chaffin,Tom J. Pollard,Eric Lehman,Alistair E. W. Johnson,Matthew McDermott,Tristan Naumann,Charlotta Lindvall
Main category: cs.CL
TL;DR: BioClinical ModernBERT是一种针对生物医学和临床NLP的领域适应编码器,基于ModernBERT改进,支持长上下文处理,并在速度和性能上有显著提升。
Details
Motivation: 现有的编码器模型在生物医学和临床领域的适应能力有限,发展较慢,需要更高效的解决方案。 Method: 通过在大规模生物医学和临床语料库(53.5B tokens)上持续预训练,并利用来自不同机构、领域和地理区域的20个数据集开发BioClinical ModernBERT。 Result: 在四个下游任务中表现优于现有生物医学和临床编码器。 Conclusion: BioClinical ModernBERT为生物医学和临床NLP提供了更高效的编码器解决方案,并公开了模型和训练检查点以支持进一步研究。 Abstract: Encoder-based transformer models are central to biomedical and clinical Natural Language Processing (NLP), as their bidirectional self-attention makes them well-suited for efficiently extracting structured information from unstructured text through discriminative tasks. However, encoders have seen slower development compared to decoder models, leading to limited domain adaptation in biomedical and clinical settings. We introduce BioClinical ModernBERT, a domain-adapted encoder that builds on the recent ModernBERT release, incorporating long-context processing and substantial improvements in speed and performance for biomedical and clinical NLP. BioClinical ModernBERT is developed through continued pretraining on the largest biomedical and clinical corpus to date, with over 53.5 billion tokens, and addresses a key limitation of prior clinical encoders by leveraging 20 datasets from diverse institutions, domains, and geographic regions, rather than relying on data from a single source. It outperforms existing biomedical and clinical encoders on four downstream tasks spanning a broad range of use cases. We release both base (150M parameters) and large (396M parameters) versions of BioClinical ModernBERT, along with training checkpoints to support further research.[155] Beyond Gold Standards: Epistemic Ensemble of LLM Judges for Formal Mathematical Reasoning
Lan Zhang,Marco Valentino,Andre Freitas
Main category: cs.CL
TL;DR: 该论文提出了一种基于LLM的系统性自动评估方法(EFG),用于解决高级形式数学推理中自动形式化任务的评估问题。
Details
Motivation: 当前自动形式化任务的评估方法缺乏细粒度和透明度,尤其是在复杂数学领域中,人工评估成本高且依赖专业知识。 Method: 提出了一种基于逻辑保持(LP)、数学一致性(MC)、形式有效性(FV)和形式质量(FQ)的LLM法官集成方法(EFG)。 Result: 实验表明,EFG集成方法比粗粒度模型更接近人工评估结果,尤其在形式质量评估方面表现突出。 Conclusion: EFG方法为形式数学推理的评估提供了可扩展、可解释且可靠的解决方案,尤其是在细粒度评估方面具有潜力。 Abstract: Autoformalization plays a crucial role in formal mathematical reasoning by enabling the automatic translation of natural language statements into formal languages. While recent advances using large language models (LLMs) have shown promising results, methods for automatically evaluating autoformalization remain underexplored. As one moves to more complex domains (e.g., advanced mathematics), human evaluation requires significant time and domain expertise, especially as the complexity of the underlying statements and background knowledge increases. LLM-as-a-judge presents a promising approach for automating such evaluation. However, existing methods typically employ coarse-grained and generic evaluation criteria, which limit their effectiveness for advanced formal mathematical reasoning, where quality hinges on nuanced, multi-granular dimensions. In this work, we take a step toward addressing this gap by introducing a systematic, automatic method to evaluate autoformalization tasks. The proposed method is based on an epistemically and formally grounded ensemble (EFG) of LLM judges, defined on criteria encompassing logical preservation (LP), mathematical consistency (MC), formal validity (FV), and formal quality (FQ), resulting in a transparent assessment that accounts for different contributing factors. We validate the proposed framework to serve as a proxy for autoformalization assessment within the domain of formal mathematics. Overall, our experiments demonstrate that the EFG ensemble of LLM judges is a suitable emerging proxy for evaluation, more strongly correlating with human assessments than a coarse-grained model, especially when assessing formal qualities. These findings suggest that LLM-as-judges, especially when guided by a well-defined set of atomic properties, could offer a scalable, interpretable, and reliable support for evaluating formal mathematical reasoning.[156] Magistral
Mistral-AI,:,Abhinav Rastogi,Albert Q. Jiang,Andy Lo,Gabrielle Berrada,Guillaume Lample,Jason Rute,Joep Barmentlo,Karmesh Yadav,Kartik Khandelwal,Khyathi Raghavi Chandu,Léonard Blier,Lucile Saulnier,Matthieu Dinot,Maxime Darrin,Neha Gupta,Roman Soletskyi,Sagar Vaze,Teven Le Scao,Yihan Wang,Adam Yang,Alexander H. Liu,Alexandre Sablayrolles,Amélie Héliou,Amélie Martin,Andy Ehrenberg,Anmol Agarwal,Antoine Roux,Arthur Darcet,Arthur Mensch,Baptiste Bout,Baptiste Rozière,Baudouin De Monicault,Chris Bamford,Christian Wallenwein,Christophe Renaudin,Clémence Lanfranchi,Darius Dabert,Devon Mizelle,Diego de las Casas,Elliot Chane-Sane,Emilien Fugier,Emma Bou Hanna,Gauthier Delerce,Gauthier Guinet,Georgii Novikov,Guillaume Martin,Himanshu Jaju,Jan Ludziejewski,Jean-Hadrien Chabran,Jean-Malo Delignon,Joachim Studnia,Jonas Amar,Josselin Somerville Roberts,Julien Denize,Karan Saxena,Kush Jain,Lingxiao Zhao,Louis Martin,Luyu Gao,Lélio Renard Lavaud,Marie Pellat,Mathilde Guillaumin,Mathis Felardos,Maximilian Augustin,Mickaël Seznec,Nikhil Raghuraman,Olivier Duchenne,Patricia Wang,Patrick von Platen,Patryk Saffer,Paul Jacob,Paul Wambergue,Paula Kurylowicz,Pavankumar Reddy Muddireddy,Philomène Chagniot,Pierre Stock,Pravesh Agrawal,Romain Sauvestre,Rémi Delacourt,Sanchit Gandhi,Sandeep Subramanian,Shashwat Dalal,Siddharth Gandhi,Soham Ghosh,Srijan Mishra,Sumukh Aithal,Szymon Antoniak,Thibault Schueller,Thibaut Lavril,Thomas Robert,Thomas Wang,Timothée Lacroix,Valeriia Nemychnikova,Victor Paltz,Virgile Richard,Wen-Ding Li,William Marshall,Xuanyu Zhang,Yunhao Tang
Main category: cs.CL
TL;DR: Magistral是Mistral的首个推理模型,采用自研的强化学习(RL)流程,探索纯RL训练LLM的极限,并展示了如何强制模型的推理语言。
Details
Motivation: 研究纯RL训练LLM的潜力,避免依赖现有实现和先验模型的RL痕迹。 Method: 采用自下而上的方法,仅使用自研模型和基础设施,提出强制推理语言的简单方法。 Result: 纯RL训练在文本数据上保持或提升了多模态理解、指令跟随和函数调用能力。 Conclusion: Magistral Medium和Small展示了纯RL训练的可行性,并开源了部分数据。 Abstract: We introduce Magistral, Mistral's first reasoning model and our own scalable reinforcement learning (RL) pipeline. Instead of relying on existing implementations and RL traces distilled from prior models, we follow a ground up approach, relying solely on our own models and infrastructure. Notably, we demonstrate a stack that enabled us to explore the limits of pure RL training of LLMs, present a simple method to force the reasoning language of the model, and show that RL on text data alone maintains most of the initial checkpoint's capabilities. We find that RL on text maintains or improves multimodal understanding, instruction following and function calling. We present Magistral Medium, trained for reasoning on top of Mistral Medium 3 with RL alone, and we open-source Magistral Small (Apache 2.0) which further includes cold-start data from Magistral Medium.[157] Decomposing MLP Activations into Interpretable Features via Semi-Nonnegative Matrix Factorization
Or Shafran,Atticus Geiger,Mor Geva
Main category: cs.CL
TL;DR: 论文提出了一种基于半非负矩阵分解(SNMF)的新方法,用于分解MLP激活,以识别可解释的特征,优于现有方法。
Details
Motivation: 现有方法(如稀疏自编码器SAEs)在因果评估中表现不佳且缺乏内在可解释性,因此需要一种更直接的方法。 Method: 使用SNMF直接分解MLP激活,学习稀疏线性组合的神经元特征,并将其映射到激活输入。 Result: 在Llama 3.1、Gemma 2和GPT-2上的实验表明,SNMF特征在因果引导和人类可解释性上优于SAEs和监督基线。 Conclusion: SNMF是一种简单有效的工具,可用于识别LLM中的可解释特征和概念表示。 Abstract: A central goal for mechanistic interpretability has been to identify the right units of analysis in large language models (LLMs) that causally explain their outputs. While early work focused on individual neurons, evidence that neurons often encode multiple concepts has motivated a shift toward analyzing directions in activation space. A key question is how to find directions that capture interpretable features in an unsupervised manner. Current methods rely on dictionary learning with sparse autoencoders (SAEs), commonly trained over residual stream activations to learn directions from scratch. However, SAEs often struggle in causal evaluations and lack intrinsic interpretability, as their learning is not explicitly tied to the computations of the model. Here, we tackle these limitations by directly decomposing MLP activations with semi-nonnegative matrix factorization (SNMF), such that the learned features are (a) sparse linear combinations of co-activated neurons, and (b) mapped to their activating inputs, making them directly interpretable. Experiments on Llama 3.1, Gemma 2 and GPT-2 show that SNMF derived features outperform SAEs and a strong supervised baseline (difference-in-means) on causal steering, while aligning with human-interpretable concepts. Further analysis reveals that specific neuron combinations are reused across semantically-related features, exposing a hierarchical structure in the MLP's activation space. Together, these results position SNMF as a simple and effective tool for identifying interpretable features and dissecting concept representations in LLMs.[158] Dynamic Epistemic Friction in Dialogue
Timothy Obiso,Kenneth Lai,Abhijnan Nath,Nikhil Krishnaswamy,James Pustejovsky
Main category: cs.CL
TL;DR: 论文探讨了大型语言模型(LLMs)与人类偏好对齐中的“认知摩擦”问题,提出动态认知摩擦模型,并基于动态认知逻辑框架分析其在对话中的预测能力。
Details
Motivation: 现有LLMs对齐方法忽视了认知摩擦(即对新信息的信念更新阻力),限制了其在人机协作中的效果。 Method: 定义动态认知摩擦,结合动态认知逻辑框架,分析其在协作任务中的预测能力。 Result: 模型能有效预测对话中的信念更新,并为复杂对话场景提供更精细的信念对齐方法。 Conclusion: 动态认知摩擦模型为LLMs的信念对齐提供了新视角,有望提升其在复杂对话中的适应性。 Abstract: Recent developments in aligning Large Language Models (LLMs) with human preferences have significantly enhanced their utility in human-AI collaborative scenarios. However, such approaches often neglect the critical role of "epistemic friction," or the inherent resistance encountered when updating beliefs in response to new, conflicting, or ambiguous information. In this paper, we define dynamic epistemic friction as the resistance to epistemic integration, characterized by the misalignment between an agent's current belief state and new propositions supported by external evidence. We position this within the framework of Dynamic Epistemic Logic (Van Benthem and Pacuit, 2011), where friction emerges as nontrivial belief-revision during the interaction. We then present analyses from a situated collaborative task that demonstrate how this model of epistemic friction can effectively predict belief updates in dialogues, and we subsequently discuss how the model of belief alignment as a measure of epistemic resistance or friction can naturally be made more sophisticated to accommodate the complexities of real-world dialogue scenarios.[159] Domain2Vec: Vectorizing Datasets to Find the Optimal Data Mixture without Training
Mozhi Zhang,Howe Tissue,Lu Wang,Xipeng Qiu
Main category: cs.CL
TL;DR: Domain2Vec是一种新方法,通过将数据集分解为多个元域的线性组合,优化语言模型预训练的数据混合,显著减少计算开销并提升性能。
Details
Motivation: 现有方法在语言模型预训练中数据混合的优化效率低,计算开销大,需要一种更高效的方法。 Method: Domain2Vec通过分解数据集为元域的线性组合,生成域向量,并基于分布对齐假设(DA²)优化数据混合。 Result: 实验表明,Domain2Vec仅需51.5%的计算量即可达到相同的验证损失,并在相同计算预算下平均提升下游任务性能2.83%。 Conclusion: Domain2Vec提供了一种高效、可扩展的方法,显著优化了语言模型预训练的数据混合策略。 Abstract: We introduce~\textsc{Domain2Vec}, a novel approach that decomposes any dataset into a linear combination of several \emph{meta-domains}, a new concept designed to capture the key underlying features of datasets. \textsc{Domain2Vec} maintains a vocabulary of meta-domains and uses a classifier to decompose any given dataset into a domain vector that corresponds to a distribution over this vocabulary. These domain vectors enable the identification of the optimal data mixture for language model (LM) pretraining in a training-free manner under the \emph{\textbf{D}istribution \textbf{A}lignment \textbf{A}ssumption} (DA$^{2}$), which suggests that when the data distributions of the training set and the validation set are better aligned, a lower validation loss is achieved. Moreover, \textsc{Domain2vec} can be seamlessly integrated into previous works to model the relationship between domain vectors and LM performance, greatly enhancing the efficiency and scalability of previous methods. Extensive experiments demonstrate that \textsc{Domain2Vec} helps find the data mixture that enhances downstream task performance with minimal computational overhead. Specifically, \textsc{Domain2Vec} achieves the same validation loss on Pile-CC using only $51.5\%$ of the computation required when training on the original mixture of The Pile dataset. Under equivalent compute budget, \textsc{Domain2Vec} improves downstream performance by an average of $2.83\%$.[160] ChineseHarm-Bench: A Chinese Harmful Content Detection Benchmark
Kangwei Liu,Siyuan Cheng,Bozhong Tian,Xiaozhuan Liang,Yuyang Yin,Meng Han,Ningyu Zhang,Bryan Hooi,Xi Chen,Shumin Deng
Main category: cs.CL
TL;DR: 该论文提出了一个中文有害内容检测的基准数据集和知识增强基线方法,填补了中文领域资源稀缺的空白。
Details
Motivation: 现有有害内容检测资源主要集中在英语领域,中文数据集稀缺且范围有限,因此需要构建一个全面的中文基准数据集。 Method: 通过专业标注构建了一个覆盖六类有害内容的中文数据集,并提出了一个结合人工标注知识规则和大语言模型隐式知识的知识增强基线方法。 Result: 该方法使较小模型能够达到与最先进大语言模型相当的性能。 Conclusion: 该研究为中文有害内容检测提供了重要资源和方法,推动了该领域的发展。 Abstract: Large language models (LLMs) have been increasingly applied to automated harmful content detection tasks, assisting moderators in identifying policy violations and improving the overall efficiency and accuracy of content review. However, existing resources for harmful content detection are predominantly focused on English, with Chinese datasets remaining scarce and often limited in scope. We present a comprehensive, professionally annotated benchmark for Chinese content harm detection, which covers six representative categories and is constructed entirely from real-world data. Our annotation process further yields a knowledge rule base that provides explicit expert knowledge to assist LLMs in Chinese harmful content detection. In addition, we propose a knowledge-augmented baseline that integrates both human-annotated knowledge rules and implicit knowledge from large language models, enabling smaller models to achieve performance comparable to state-of-the-art LLMs. Code and data are available at https://github.com/zjunlp/ChineseHarm-bench.[161] AutoMind: Adaptive Knowledgeable Agent for Automated Data Science
Yixin Ou,Yujie Luo,Jingsheng Zheng,Lanning Wei,Shuofei Qiao,Jintian Zhang,Da Zheng,Huajun Chen,Ningyu Zhang
Main category: cs.CL
TL;DR: AutoMind是一个自适应、知识丰富的LLM代理框架,通过专家知识库、树搜索算法和动态编码策略提升数据科学任务的自动化能力,表现优于现有方法。
Details
Motivation: 现有LLM驱动的数据科学代理依赖固定工作流程和编码策略,仅适用于简单任务,无法处理复杂创新问题。 Method: AutoMind引入三个关键技术:专家知识库、树搜索算法和自适应编码策略。 Result: 在自动化数据科学基准测试中,AutoMind表现优于现有方法,具有高效性和鲁棒性。 Conclusion: AutoMind是迈向全自动数据科学的高效且稳健的一步。 Abstract: Large Language Model (LLM) agents have shown great potential in addressing real-world data science problems. LLM-driven data science agents promise to automate the entire machine learning pipeline, yet their real-world effectiveness remains limited. Existing frameworks depend on rigid, pre-defined workflows and inflexible coding strategies; consequently, they excel only on relatively simple, classical problems and fail to capture the empirical expertise that human practitioners bring to complex, innovative tasks. In this work, we introduce AutoMind, an adaptive, knowledgeable LLM-agent framework that overcomes these deficiencies through three key advances: (1) a curated expert knowledge base that grounds the agent in domain expert knowledge, (2) an agentic knowledgeable tree search algorithm that strategically explores possible solutions, and (3) a self-adaptive coding strategy that dynamically tailors code generation to task complexity. Evaluations on two automated data science benchmarks demonstrate that AutoMind delivers superior performance versus state-of-the-art baselines. Additional analyses confirm favorable effectiveness, efficiency, and qualitative solution quality, highlighting AutoMind as an efficient and robust step toward fully automated data science.[162] How Well Can Reasoning Models Identify and Recover from Unhelpful Thoughts?
Sohee Yang,Sang-Woo Lee,Nora Kassner,Daniela Gottesman,Sebastian Riedel,Mor Geva
Main category: cs.CL
TL;DR: 研究发现,推理模型能识别无效思维但难以从中恢复,大模型表现更差,需改进自我评估能力。