cs.CV [Back]

[1] ReStNet: A Reusable & Stitchable Network for Dynamic Adaptation on IoT Devices

Maoyu Wang,Yao Lu,Jiaqi Nie,Zeyu Wang,Yun Lin,Qi Xuan,Guan Gui

Main category: cs.CV

TL;DR: ReStNet提出了一种可重用和可拼接的网络，通过动态拼接两个预训练模型来适应不同设备的资源限制，解决了传统压缩方法的灵活性不足问题。

Details

Motivation: 由于不同设备的计算和内存资源异构，单一预训练模型难以适应所有平台，传统压缩方法缺乏灵活性。 Method: ReStNet通过计算层间相似度（CKA）选择拼接点，保留大模型的早期层和小模型的深层，仅微调拼接层，支持同构和异构模型拼接。 Result: 实验表明，ReStNet在运行时灵活平衡精度与效率，显著降低训练成本。 Conclusion: ReStNet为动态适应资源限制提供了一种高效且灵活的解决方案。 Abstract: With the rapid development of deep learning, a growing number of pre-trained models have been publicly available. However, deploying these fixed models in real-world IoT applications is challenging because different devices possess heterogeneous computational and memory resources, making it impossible to deploy a single model across all platforms. Although traditional compression methods, such as pruning, quantization, and knowledge distillation, can improve efficiency, they become inflexible once applied and cannot adapt to changing resource constraints. To address these issues, we propose ReStNet, a Reusable and Stitchable Network that dynamically constructs a hybrid network by stitching two pre-trained models together. Implementing ReStNet requires addressing several key challenges, including how to select the optimal stitching points, determine the stitching order of the two pre-trained models, and choose an effective fine-tuning strategy. To systematically address these challenges and adapt to varying resource constraints, ReStNet determines the stitching point by calculating layer-wise similarity via Centered Kernel Alignment (CKA). It then constructs the hybrid model by retaining early layers from a larger-capacity model and appending deeper layers from a smaller one. To facilitate efficient deployment, only the stitching layer is fine-tuned. This design enables rapid adaptation to changing budgets while fully leveraging available resources. Moreover, ReStNet supports both homogeneous (CNN-CNN, Transformer-Transformer) and heterogeneous (CNN-Transformer) stitching, allowing to combine different model families flexibly. Extensive experiments on multiple benchmarks demonstrate that ReStNet achieve flexible accuracy-efficiency trade-offs at runtime while significantly reducing training cost.

[2] Enhancing the Safety of Medical Vision-Language Models by Synthetic Demonstrations

Zhiyu Xue,Reza Abbasi-Asl,Ramtin Pedarsani

Main category: cs.CV

TL;DR: 提出了一种新的推理时防御策略，用于保护生成式医学视觉语言模型（Med-VLMs）免受有害查询的影响，同时避免过度防御导致性能下降。

Details

Motivation: Med-VLMs在生成医学报告时面临安全漏洞，需要拒绝有害查询，但现有方法可能导致过度防御，影响正常查询的性能。 Method: 采用基于合成临床演示的推理时防御策略，结合多样化的医学影像数据集，提出混合演示策略以平衡安全与性能。 Result: 实验表明，该策略能有效防御视觉和文本越狱攻击，且增加演示预算可缓解过度防御问题。 Conclusion: 混合演示策略在少样本演示预算限制下，实现了安全与性能的平衡。 Abstract: Generative medical vision-language models~(Med-VLMs) are primarily designed to generate complex textual information~(e.g., diagnostic reports) from multimodal inputs including vision modality~(e.g., medical images) and language modality~(e.g., clinical queries). However, their security vulnerabilities remain underexplored. Med-VLMs should be capable of rejecting harmful queries, such as \textit{Provide detailed instructions for using this CT scan for insurance fraud}. At the same time, addressing security concerns introduces the risk of over-defense, where safety-enhancing mechanisms may degrade general performance, causing Med-VLMs to reject benign clinical queries. In this paper, we propose a novel inference-time defense strategy to mitigate harmful queries, enabling defense against visual and textual jailbreak attacks. Using diverse medical imaging datasets collected from nine modalities, we demonstrate that our defense strategy based on synthetic clinical demonstrations enhances model safety without significantly compromising performance. Additionally, we find that increasing the demonstration budget alleviates the over-defense issue. We then introduce a mixed demonstration strategy as a trade-off solution for balancing security and performance under few-shot demonstration budget constraints.

[3] BG-HOP: A Bimanual Generative Hand-Object Prior

Sriram Krishna,Sravan Chittupalli,Sungjae Park

Main category: cs.CV

TL;DR: BG-HOP是一种生成先验模型，用于建模3D中的双手-物体交互，通过扩展单手生成先验解决数据不足问题，实验展示了其生成双手交互和抓取合成的能力。

Details

Motivation: 解决双手-物体交互数据有限的问题，扩展单手生成先验以建模更复杂的交互。 Method: 扩展现有的单手生成先验模型，建模双手与物体的联合分布。 Result: 模型能够生成双手交互动作，并为给定物体合成抓取动作。 Conclusion: BG-HOP展示了建模双手-物体交互的潜力，代码和模型已公开。 Abstract: In this work, we present BG-HOP, a generative prior that seeks to model bimanual hand-object interactions in 3D. We address the challenge of limited bimanual interaction data by extending existing single-hand generative priors, demonstrating preliminary results in capturing the joint distribution of hands and objects. Our experiments showcase the model's capability to generate bimanual interactions and synthesize grasps for given objects. We make code and models publicly available.

[4] Segment Any Architectural Facades (SAAF):An automatic segmentation model for building facades, walls and windows based on multimodal semantics guidance

Peilin Li,Jun Yin,Jing Zhong,Ran Luo,Pengyu Zeng,Miao Zhang

Main category: cs.CV

TL;DR: SAAF模型通过多模态语义引导实现建筑立面墙窗自动分割，结合NLP技术提升语义理解，端到端训练框架提高自动化与鲁棒性，实验显示其性能优于现有方法。

Details

Motivation: 提升建筑信息模型和计算机辅助设计效率，解决墙窗分割任务中的精度和泛化能力问题。 Method: 提出多模态语义协作特征提取机制，结合文本描述与图像特征；开发端到端训练框架，减少人工干预。 Result: 在多个立面数据集上，SAAF的mIoU指标优于现有方法，展现出高精度分割能力和泛化性。 Conclusion: SAAF模型为建筑计算机视觉技术发展提供参考，探索多模态学习在建筑领域的新应用路径。 Abstract: In the context of the digital development of architecture, the automatic segmentation of walls and windows is a key step in improving the efficiency of building information models and computer-aided design. This study proposes an automatic segmentation model for building facade walls and windows based on multimodal semantic guidance, called Segment Any Architectural Facades (SAAF). First, SAAF has a multimodal semantic collaborative feature extraction mechanism. By combining natural language processing technology, it can fuse the semantic information in text descriptions with image features, enhancing the semantic understanding of building facade components. Second, we developed an end-to-end training framework that enables the model to autonomously learn the mapping relationship from text descriptions to image segmentation, reducing the influence of manual intervention on the segmentation results and improving the automation and robustness of the model. Finally, we conducted extensive experiments on multiple facade datasets. The segmentation results of SAAF outperformed existing methods in the mIoU metric, indicating that the SAAF model can maintain high-precision segmentation ability when faced with diverse datasets. Our model has made certain progress in improving the accuracy and generalization ability of the wall and window segmentation task. It is expected to provide a reference for the development of architectural computer vision technology and also explore new ideas and technical paths for the application of multimodal learning in the architectural field.

[5] VersaVid-R1: A Versatile Video Understanding and Reasoning Model from Question Answering to Captioning Tasks

Xinlong Chen,Yuanxing Zhang,Yushuo Guan,Bohan Zeng,Yang Shi,Sihan Yang,Pengfei Wan,Qiang Liu,Liang Wang,Tieniu Tan

Main category: cs.CV

TL;DR: 论文提出了两个新数据集DarkEventInfer和MixVidQA，用于提升视频推理能力，并开发了VersaVid-R1模型，显著优于现有模型。

Details

Motivation: 视频推理领域缺乏高质量数据和有效训练方法，限制了多模态大语言模型的发展。 Method: 通过DarkEventInfer和MixVidQA数据集训练模型，结合强化学习，开发了VersaVid-R1模型。 Result: VersaVid-R1在多项视频理解、推理和字幕任务中表现优异。 Conclusion: 新数据集和模型成功填补了视频推理领域的空白，为未来研究提供了基础。 Abstract: Recent advancements in multimodal large language models have successfully extended the Reason-Then-Respond paradigm to image-based reasoning, yet video-based reasoning remains an underdeveloped frontier, primarily due to the scarcity of high-quality reasoning-oriented data and effective training methodologies. To bridge this gap, we introduce DarkEventInfer and MixVidQA, two novel datasets specifically designed to stimulate the model's advanced video understanding and reasoning abilities. DarkEventinfer presents videos with masked event segments, requiring models to infer the obscured content based on contextual video cues. MixVidQA, on the other hand, presents interleaved video sequences composed of two distinct clips, challenging models to isolate and reason about one while disregarding the other. Leveraging these carefully curated training samples together with reinforcement learning guided by diverse reward functions, we develop VersaVid-R1, the first versatile video understanding and reasoning model under the Reason-Then-Respond paradigm capable of handling multiple-choice and open-ended question answering, as well as video captioning tasks. Extensive experiments demonstrate that VersaVid-R1 significantly outperforms existing models across a broad spectrum of benchmarks, covering video general understanding, cognitive reasoning, and captioning tasks.

[6] FlagEvalMM: A Flexible Framework for Comprehensive Multimodal Model Evaluation

Zheqi He,Yesheng Liu,Jing-shu Zheng,Xuejing Li,Richeng Xuan,Jin-Ge Yao,Xi Yang

Main category: cs.CV

TL;DR: FlagEvalMM是一个开源的多模态模型评估框架，支持多种视觉-语言任务，通过独立评估服务和高效工具提升评估效率。

Details

Motivation: 当前多模态模型评估缺乏统一的框架，FlagEvalMM旨在填补这一空白，提供灵活、高效的评估工具。 Method: 通过解耦模型推理与评估，利用独立评估服务和高效工具（如vLLM、SGLang）提升效率。 Result: 实验表明FlagEvalMM能准确高效地评估模型优缺点，推动多模态研究。 Conclusion: FlagEvalMM是一个有价值的开源工具，支持多模态模型的全面评估。 Abstract: We present FlagEvalMM, an open-source evaluation framework designed to comprehensively assess multimodal models across a diverse range of vision-language understanding and generation tasks, such as visual question answering, text-to-image/video generation, and image-text retrieval. We decouple model inference from evaluation through an independent evaluation service, thus enabling flexible resource allocation and seamless integration of new tasks and models. Moreover, FlagEvalMM utilizes advanced inference acceleration tools (e.g., vLLM, SGLang) and asynchronous data loading to significantly enhance evaluation efficiency. Extensive experiments show that FlagEvalMM offers accurate and efficient insights into model strengths and limitations, making it a valuable tool for advancing multimodal research. The framework is publicly accessible athttps://github.com/flageval-baai/FlagEvalMM.

[7] AVA-Bench: Atomic Visual Ability Benchmark for Vision Foundation Models

Zheda Mai,Arpita Chowdhury,Zihe Wang,Sooyoung Jeon,Lemeng Wang,Jiacheng Hou,Jihyung Kil,Wei-Lun Chao

Main category: cs.CV

TL;DR: AVA-Bench是一个新的基准测试，旨在通过解耦14种原子视觉能力（AVAs）来系统评估视觉基础模型（VFMs），解决了现有VQA基准测试中的分布不匹配和多能力混淆问题。

Details

Motivation: 现有VQA基准测试存在两个盲点：指令调优数据与测试分布不匹配，以及多能力任务难以定位具体缺陷。AVA-Bench通过解耦AVAs，提供更精确的评估。 Method: 引入AVA-Bench，明确解耦14种AVAs（如定位、深度估计等），并在每种能力内匹配训练和测试分布，以精准评估VFMs的表现。 Result: AVA-Bench揭示了VFMs的独特“能力指纹”，并发现0.5B参数的LLM在评估效果上与7B参数的LLM相似，但GPU时间减少8倍。 Conclusion: AVA-Bench为下一代VFMs提供了全面透明的评估基础，使模型选择从经验猜测转变为工程化决策。 Abstract: The rise of vision foundation models (VFMs) calls for systematic evaluation. A common approach pairs VFMs with large language models (LLMs) as general-purpose heads, followed by evaluation on broad Visual Question Answering (VQA) benchmarks. However, this protocol has two key blind spots: (i) the instruction tuning data may not align with VQA test distributions, meaning a wrong prediction can stem from such data mismatch rather than a VFM' visual shortcomings; (ii) VQA benchmarks often require multiple visual abilities, making it hard to tell whether errors stem from lacking all required abilities or just a single critical one. To address these gaps, we introduce AVA-Bench, the first benchmark that explicitly disentangles 14 Atomic Visual Abilities (AVAs) -- foundational skills like localization, depth estimation, and spatial understanding that collectively support complex visual reasoning tasks. By decoupling AVAs and matching training and test distributions within each, AVA-Bench pinpoints exactly where a VFM excels or falters. Applying AVA-Bench to leading VFMs thus reveals distinctive "ability fingerprints," turning VFM selection from educated guesswork into principled engineering. Notably, we find that a 0.5B LLM yields similar VFM rankings as a 7B LLM while cutting GPU hours by 8x, enabling more efficient evaluation. By offering a comprehensive and transparent benchmark, we hope AVA-Bench lays the foundation for the next generation of VFMs.

[8] BakuFlow: A Streamlining Semi-Automatic Label Generation Tool

Jerry Lin,Partick P. W. Chen

Main category: cs.CV

TL;DR: BakuFlow是一个半自动标注工具，通过像素级手动修正、交互式数据增强、标签传播和自动标注模块，显著提升计算机视觉任务的标注效率。

Details

Motivation: 大规模计算机视觉任务中，手动标注耗时且易出错，现有工具仍需人工干预，亟需更高效的解决方案。 Method: BakuFlow结合了可调节放大镜、交互式数据增强、标签传播和基于改进YOLOE的自动标注模块。 Result: 工具显著减少了标注工作量，提升了视频数据和动态数据集的标注效率。 Conclusion: BakuFlow为对象检测和跟踪任务提供了灵活、高效的标注解决方案，适用于实际工业场景。 Abstract: Accurately labeling (or annotation) data is still a bottleneck in computer vision, especially for large-scale tasks where manual labeling is time-consuming and error-prone. While tools like LabelImg can handle the labeling task, some of them still require annotators to manually label each image. In this paper, we introduce BakuFlow, a streamlining semi-automatic label generation tool. Key features include (1) a live adjustable magnifier for pixel-precise manual corrections, improving user experience; (2) an interactive data augmentation module to diversify training datasets; (3) label propagation for rapidly copying labeled objects between consecutive frames, greatly accelerating annotation of video data; and (4) an automatic labeling module powered by a modified YOLOE framework. Unlike the original YOLOE, our extension supports adding new object classes and any number of visual prompts per class during annotation, enabling flexible and scalable labeling for dynamic, real-world datasets. These innovations make BakuFlow especially effective for object detection and tracking, substantially reducing labeling workload and improving efficiency in practical computer vision and industrial scenarios.

[9] Bias Analysis in Unconditional Image Generative Models

Xiaofeng Zhang,Michelle Lin,Simon Lacoste-Julien,Aaron Courville,Yash Goyal

Main category: cs.CV

TL;DR: 研究发现生成AI模型中属性偏移较小，但评估框架中的分类器敏感性可能影响结果，需改进标注方法和评估框架。

Details

Motivation: 生成AI模型的广泛使用引发了对代表性危害和歧视性结果的担忧，但无条件生成中的偏见机制尚不明确。 Method: 训练无条件图像生成模型，采用常用偏见评估框架研究训练与生成分布间的偏移。 Result: 实验显示属性偏移较小，但分类器敏感性在评估中显著，尤其是属性值呈连续谱时。 Conclusion: 需改进标注实践、严格审查评估框架，并考虑属性的社会复杂性以更准确评估偏见。 Abstract: The widespread adoption of generative AI models has raised growing concerns about representational harm and potential discriminatory outcomes. Yet, despite growing literature on this topic, the mechanisms by which bias emerges - especially in unconditional generation - remain disentangled. We define the bias of an attribute as the difference between the probability of its presence in the observed distribution and its expected proportion in an ideal reference distribution. In our analysis, we train a set of unconditional image generative models and adopt a commonly used bias evaluation framework to study bias shift between training and generated distributions. Our experiments reveal that the detected attribute shifts are small. We find that the attribute shifts are sensitive to the attribute classifier used to label generated images in the evaluation framework, particularly when its decision boundaries fall in high-density regions. Our empirical analysis indicates that this classifier sensitivity is often observed in attributes values that lie on a spectrum, as opposed to exhibiting a binary nature. This highlights the need for more representative labeling practices, understanding the shortcomings through greater scrutiny of evaluation frameworks, and recognizing the socially complex nature of attributes when evaluating bias.

[10] CAIRe: Cultural Attribution of Images by Retrieval-Augmented Evaluation

Arnav Yayavaram,Siddharth Yayavaram,Simran Khanuja,Michael Saxon,Graham Neubig

Main category: cs.CV

TL;DR: 论文提出了CAIRe，一种评估图像文化相关性的新指标，用于解决文本到图像模型在跨文化背景下的偏见问题。

Details

Motivation: 确保文本到图像模型在不同文化背景下的公平性能，但现有方法在减少跨文化偏见时存在性能损失、事实错误或冒犯性输出等问题。 Method: 提出CAIRe框架，通过将图像中的实体和概念与知识库关联，基于事实信息对每个文化标签进行独立评分。 Result: CAIRe在手动构建的文化显著但罕见的数据集上比基线方法高出28% F1分数，并在两个文化通用概念数据集上与人类评分的相关性分别为0.56和0.66。 Conclusion: CAIRe能有效评估图像的文化相关性，且与人类判断高度一致，为解决跨文化偏见提供了可靠工具。 Abstract: As text-to-image models become increasingly prevalent, ensuring their equitable performance across diverse cultural contexts is critical. Efforts to mitigate cross-cultural biases have been hampered by trade-offs, including a loss in performance, factual inaccuracies, or offensive outputs. Despite widespread recognition of these challenges, an inability to reliably measure these biases has stalled progress. To address this gap, we introduce CAIRe, a novel evaluation metric that assesses the degree of cultural relevance of an image, given a user-defined set of labels. Our framework grounds entities and concepts in the image to a knowledge base and uses factual information to give independent graded judgments for each culture label. On a manually curated dataset of culturally salient but rare items built using language models, CAIRe surpasses all baselines by 28% F1 points. Additionally, we construct two datasets for culturally universal concept, one comprising of T2I-generated outputs and another retrieved from naturally occurring data. CAIRe achieves Pearson's correlations of 0.56 and 0.66 with human ratings on these sets, based on a 5-point Likert scale of cultural relevance. This demonstrates its strong alignment with human judgment across diverse image sources.

[11] Seedance 1.0: Exploring the Boundaries of Video Generation Models

Yu Gao,Haoyuan Guo,Tuyen Hoang,Weilin Huang,Lu Jiang,Fangyuan Kong,Huixia Li,Jiashi Li,Liang Li,Xiaojie Li,Xunsong Li,Yifu Li,Shanchuan Lin,Zhijie Lin,Jiawei Liu,Shu Liu,Xiaonan Nie,Zhiwu Qing,Yuxi Ren,Li Sun,Zhi Tian,Rui Wang,Sen Wang,Guoqiang Wei,Guohong Wu,Jie Wu,Ruiqi Xia,Fei Xiao,Xuefeng Xiao,Jiangqiao Yan,Ceyuan Yang,Jianchao Yang,Runkai Yang,Tao Yang,Yihang Yang,Zilyu Ye,Xuejiao Zeng,Yan Zeng,Heng Zhang,Yang Zhao,Xiaozheng Zheng,Peihao Zhu,Jiaxin Zou,Feilong Zuo

Main category: cs.CV

TL;DR: Seedance 1.0是一款高性能视频生成模型，通过多源数据、高效架构设计和优化训练方法，显著提升了视频生成的提示跟随、运动合理性和视觉质量。

Details

Motivation: 当前基础模型在视频生成中难以平衡提示跟随、运动合理性和视觉质量，Seedance 1.0旨在解决这些问题。 Method: 采用多源数据增强、高效架构设计、训练范式优化及后训练方法，包括精细监督微调和视频特定RLHF。 Result: Seedance 1.0在1080p分辨率下生成5秒视频仅需41.4秒，具有高质量、快速生成和卓越的时空流畅性。 Conclusion: Seedance 1.0在复杂多主体场景中表现出色，支持多镜头叙事连贯性，是视频生成领域的重大突破。 Abstract: Notable breakthroughs in diffusion modeling have propelled rapid improvements in video generation, yet current foundational model still face critical challenges in simultaneously balancing prompt following, motion plausibility, and visual quality. In this report, we introduce Seedance 1.0, a high-performance and inference-efficient video foundation generation model that integrates several core technical improvements: (i) multi-source data curation augmented with precision and meaningful video captioning, enabling comprehensive learning across diverse scenarios; (ii) an efficient architecture design with proposed training paradigm, which allows for natively supporting multi-shot generation and jointly learning of both text-to-video and image-to-video tasks. (iii) carefully-optimized post-training approaches leveraging fine-grained supervised fine-tuning, and video-specific RLHF with multi-dimensional reward mechanisms for comprehensive performance improvements; (iv) excellent model acceleration achieving ~10x inference speedup through multi-stage distillation strategies and system-level optimizations. Seedance 1.0 can generate a 5-second video at 1080p resolution only with 41.4 seconds (NVIDIA-L20). Compared to state-of-the-art video generation models, Seedance 1.0 stands out with high-quality and fast video generation having superior spatiotemporal fluidity with structural stability, precise instruction adherence in complex multi-subject contexts, native multi-shot narrative coherence with consistent subject representation.

[12] Cross-Frame Representation Alignment for Fine-Tuning Video Diffusion Models

Sungwon Hwang,Hyojin Jang,Kinam Kim,Minho Park,Jaegul choo

Main category: cs.CV

TL;DR: 论文提出了一种名为CREPA的新方法，用于改进视频扩散模型的微调，通过跨帧表示对齐提升语义一致性和视觉质量。

Details

Motivation: 用户级视频扩散模型微调在保留训练数据特定属性方面存在挑战，且现有方法（如REPA）在语义一致性上表现不足。 Method: 提出CREPA方法，通过将帧的隐藏状态与相邻帧的外部特征对齐，优化视频扩散模型的微调效果。 Result: 实验表明，CREPA在CogVideoX-5B和Hunyuan Video等大规模模型上显著提升了视觉保真度和跨帧语义一致性。 Conclusion: CREPA是一种广泛适用的正则化技术，能有效改进视频扩散模型的微调效果。 Abstract: Fine-tuning Video Diffusion Models (VDMs) at the user level to generate videos that reflect specific attributes of training data presents notable challenges, yet remains underexplored despite its practical importance. Meanwhile, recent work such as Representation Alignment (REPA) has shown promise in improving the convergence and quality of DiT-based image diffusion models by aligning, or assimilating, its internal hidden states with external pretrained visual features, suggesting its potential for VDM fine-tuning. In this work, we first propose a straightforward adaptation of REPA for VDMs and empirically show that, while effective for convergence, it is suboptimal in preserving semantic consistency across frames. To address this limitation, we introduce Cross-frame Representation Alignment (CREPA), a novel regularization technique that aligns hidden states of a frame with external features from neighboring frames. Empirical evaluations on large-scale VDMs, including CogVideoX-5B and Hunyuan Video, demonstrate that CREPA improves both visual fidelity and cross-frame semantic coherence when fine-tuned with parameter-efficient methods such as LoRA. We further validate CREPA across diverse datasets with varying attributes, confirming its broad applicability. Project page: https://crepavideo.github.io

[13] PatchGuard: Adversarially Robust Anomaly Detection and Localization through Vision Transformers and Pseudo Anomalies

Mojtaba Nafez,Amirhossein Koochakian,Arad Maleki,Jafar Habibi,Mohammad Hossein Rohban

Main category: cs.CV

TL;DR: PatchGuard是一种基于Vision Transformer的对抗鲁棒性异常检测和定位方法，通过引入伪异常样本和定位掩码，显著提升了对抗攻击下的性能。

Details

Motivation: 当前异常检测和定位方法因训练数据仅包含正常样本而易受对抗攻击，PatchGuard旨在解决这一问题。 Method: 利用前景感知伪异常样本和ViT架构，结合对抗训练和新颖损失函数提升模型鲁棒性。 Result: 在工业和医学数据集上，PatchGuard在对抗环境下AD和AL性能分别提升53.2%和68.5%。 Conclusion: PatchGuard在对抗和非对抗环境下均表现出色，为异常检测和定位提供了更鲁棒的解决方案。 Abstract: Anomaly Detection (AD) and Anomaly Localization (AL) are crucial in fields that demand high reliability, such as medical imaging and industrial monitoring. However, current AD and AL approaches are often susceptible to adversarial attacks due to limitations in training data, which typically include only normal, unlabeled samples. This study introduces PatchGuard, an adversarially robust AD and AL method that incorporates pseudo anomalies with localization masks within a Vision Transformer (ViT)-based architecture to address these vulnerabilities. We begin by examining the essential properties of pseudo anomalies, and follow it by providing theoretical insights into the attention mechanisms required to enhance the adversarial robustness of AD and AL systems. We then present our approach, which leverages Foreground-Aware Pseudo-Anomalies to overcome the deficiencies of previous anomaly-aware methods. Our method incorporates these crafted pseudo-anomaly samples into a ViT-based framework, with adversarial training guided by a novel loss function designed to improve model robustness, as supported by our theoretical analysis. Experimental results on well-established industrial and medical datasets demonstrate that PatchGuard significantly outperforms previous methods in adversarial settings, achieving performance gains of $53.2\%$ in AD and $68.5\%$ in AL, while also maintaining competitive accuracy in non-adversarial settings. The code repository is available at https://github.com/rohban-lab/PatchGuard .

[14] UFM: A Simple Path towards Unified Dense Correspondence with Flow

Yuchen Zhang,Nikhil Keetha,Chenwei Lyu,Bhuvan Jhamb,Yutian Chen,Yuheng Qiu,Jay Karhade,Shreyas Jha,Yaoyu Hu,Deva Ramanan,Sebastian Scherer,Wenshan Wang

Main category: cs.CV

TL;DR: UFM模型通过统一训练在光流估计和宽基线匹配任务中均优于专用方法，实现了快速、通用的图像对应关系。

Details

Motivation: 解决光流估计和宽基线匹配任务中传统方法分离处理的问题，探索统一训练的潜力。 Method: 采用简单的通用Transformer架构，直接回归(u,v)流，避免传统粗到细成本体积的复杂性。 Result: UFM在光流任务中比Unimatch准确28%，在宽基线匹配中比RoMa误差减少62%且速度快6.7倍。 Conclusion: 统一训练在多个领域优于专用方法，为多模态、长距离和实时对应任务开辟了新方向。 Abstract: Dense image correspondence is central to many applications, such as visual odometry, 3D reconstruction, object association, and re-identification. Historically, dense correspondence has been tackled separately for wide-baseline scenarios and optical flow estimation, despite the common goal of matching content between two images. In this paper, we develop a Unified Flow & Matching model (UFM), which is trained on unified data for pixels that are co-visible in both source and target images. UFM uses a simple, generic transformer architecture that directly regresses the (u,v) flow. It is easier to train and more accurate for large flows compared to the typical coarse-to-fine cost volumes in prior work. UFM is 28% more accurate than state-of-the-art flow methods (Unimatch), while also having 62% less error and 6.7x faster than dense wide-baseline matchers (RoMa). UFM is the first to demonstrate that unified training can outperform specialized approaches across both domains. This result enables fast, general-purpose correspondence and opens new directions for multi-modal, long-range, and real-time correspondence tasks.

[15] Lightweight Object Detection Using Quantized YOLOv4-Tiny for Emergency Response in Aerial Imagery

Sindhu Boddu,Arindam Mukherjee

Main category: cs.CV

TL;DR: 本文提出了一种轻量级且节能的空中应急图像目标检测方案，采用YOLOv4-Tiny模型并通过INT8量化优化，在自定义数据集上训练，显著减小模型体积并提升推理速度。

Details

Motivation: 现有公开数据集缺乏无人机视角的应急图像，且现有模型在低功耗边缘设备上效率不足。 Method: 使用YOLOv4-Tiny模型，通过INT8量化优化，并在自定义的10,820张应急图像数据集上训练。 Result: 量化后的模型体积减小71%，推理速度提升44%，性能与YOLOv5-small相当。 Conclusion: 量化YOLOv4-Tiny模型适合在低功耗边缘设备上实时应急检测。 Abstract: This paper presents a lightweight and energy-efficient object detection solution for aerial imagery captured during emergency response situations. We focus on deploying the YOLOv4-Tiny model, a compact convolutional neural network, optimized through post-training quantization to INT8 precision. The model is trained on a custom-curated aerial emergency dataset, consisting of 10,820 annotated images covering critical emergency scenarios. Unlike prior works that rely on publicly available datasets, we created this dataset ourselves due to the lack of publicly available drone-view emergency imagery, making the dataset itself a key contribution of this work. The quantized model is evaluated against YOLOv5-small across multiple metrics, including mean Average Precision (mAP), F1 score, inference time, and model size. Experimental results demonstrate that the quantized YOLOv4-Tiny achieves comparable detection performance while reducing the model size from 22.5 MB to 6.4 MB and improving inference speed by 44\%. With a 71\% reduction in model size and a 44\% increase in inference speed, the quantized YOLOv4-Tiny model proves highly suitable for real-time emergency detection on low-power edge devices.

[16] Efficient Edge Deployment of Quantized YOLOv4-Tiny for Aerial Emergency Object Detection on Raspberry Pi 5

Sindhu Boddu,Arindam Mukherjee

Main category: cs.CV

TL;DR: 论文展示了在资源受限的树莓派5上部署量化YOLOv4-Tiny模型用于实时目标检测的性能评估，量化后模型在速度和功耗上表现优异。

Details

Motivation: 研究旨在探索在资源受限的边缘设备上实现高效、低功耗的实时目标检测，以支持安全关键的应急响应应用。 Method: 采用TensorFlow Lite的后训练量化技术将YOLOv4-Tiny模型量化为INT8精度，并在树莓派5上评估其检测速度、功耗和热可行性。 Result: 量化模型每张图像的推理时间为28.2毫秒，平均功耗为13.85瓦，相比FP32版本显著降低，且对关键应急类别的检测精度保持稳定。 Conclusion: 结果表明，低功耗嵌入式AI系统在安全关键的应急响应应用中具有实时部署的潜力。 Abstract: This paper presents the deployment and performance evaluation of a quantized YOLOv4-Tiny model for real-time object detection in aerial emergency imagery on a resource-constrained edge device the Raspberry Pi 5. The YOLOv4-Tiny model was quantized to INT8 precision using TensorFlow Lite post-training quantization techniques and evaluated for detection speed, power consumption, and thermal feasibility under embedded deployment conditions. The quantized model achieved an inference time of 28.2 ms per image with an average power consumption of 13.85 W, demonstrating a significant reduction in power usage compared to its FP32 counterpart. Detection accuracy remained robust across key emergency classes such as Ambulance, Police, Fire Engine, and Car Crash. These results highlight the potential of low-power embedded AI systems for real-time deployment in safety-critical emergency response applications.

Tong Wang,Guanzhou Chen,Xiaodong Zhang,Chenxi Liu,Jiaqi Wang,Xiaoliang Tan,Wenchao Guo,Qingyuan Yang,Kaiqi Zhang

Main category: cs.CV

TL;DR: 提出了一种多模态自监督学习框架，利用RGB图像、多光谱数据和DSM进行预训练，显著提升了遥感图像下游任务的性能。

Details

Motivation: 高质量标注数据获取成本高且耗时，需要一种高效的自监督学习方法。 Method: 设计了信息感知自适应掩码策略、跨模态掩码机制和多任务自监督目标，捕捉多模态间的相关性和单模态特征结构。 Result: 在15个遥感数据集上的26个任务中表现优异，如Potsdam和Vaihingen语义分割任务mIoU达78.30%和76.50%，US3D深度估计任务RMSE降至0.182。 Conclusion: 该方法在多模态遥感图像任务中优于现有预训练方法，代码和数据集已开源。 Abstract: Remote sensing image interpretation plays a critical role in environmental monitoring, urban planning, and disaster assessment. However, acquiring high-quality labeled data is often costly and time-consuming. To address this challenge, we proposes a multi-modal self-supervised learning framework that leverages high-resolution RGB images, multi-spectral data, and digital surface models (DSM) for pre-training. By designing an information-aware adaptive masking strategy, cross-modal masking mechanism, and multi-task self-supervised objectives, the framework effectively captures both the correlations across different modalities and the unique feature structures within each modality. We evaluated the proposed method on multiple downstream tasks, covering typical remote sensing applications such as scene classification, semantic segmentation, change detection, object detection, and depth estimation. Experiments are conducted on 15 remote sensing datasets, encompassing 26 tasks. The results demonstrate that the proposed method outperforms existing pretraining approaches in most tasks. Specifically, on the Potsdam and Vaihingen semantic segmentation tasks, our method achieved mIoU scores of 78.30\% and 76.50\%, with only 50\% train-set. For the US3D depth estimation task, the RMSE error is reduced to 0.182, and for the binary change detection task in SECOND dataset, our method achieved mIoU scores of 47.51\%, surpassing the second CS-MAE by 3 percentage points. Our pretrain code, checkpoints, and HR-Pairs dataset can be found in https://github.com/CVEO/MSSDF.

[18] CheckManual: A New Challenge and Benchmark for Manual-based Appliance Manipulation

Yuxing Long,Jiyao Zhang,Mingjie Pan,Tianshu Wu,Taewhan Kim,Hao Dong

Main category: cs.CV

TL;DR: 提出了首个基于手册的家电操作基准CheckManual，通过大模型辅助生成手册，并设计了相关挑战、指标和仿真环境。

Details

Motivation: 家电的正确使用显著提升生活质量，但现有研究未能充分利用手册信息，缺乏多页手册的理解能力。 Method: 设计大模型辅助的人工修订数据生成流程，创建基于CAD模型的手册，并建立手册操作挑战、指标和仿真环境。 Result: 提出了首个手册操作规划模型ManualPlan，为CheckManual基准设立基线。 Conclusion: CheckManual为家电操作研究提供了新的基准和方向，ManualPlan展示了手册信息的重要性。 Abstract: Correct use of electrical appliances has significantly improved human life quality. Unlike simple tools that can be manipulated with common sense, different parts of electrical appliances have specific functions defined by manufacturers. If we want the robot to heat bread by microwave, we should enable them to review the microwave manual first. From the manual, it can learn about component functions, interaction methods, and representative task steps about appliances. However, previous manual-related works remain limited to question-answering tasks while existing manipulation researchers ignore the manual's important role and fail to comprehend multi-page manuals. In this paper, we propose the first manual-based appliance manipulation benchmark CheckManual. Specifically, we design a large model-assisted human-revised data generation pipeline to create manuals based on CAD appliance models. With these manuals, we establish novel manual-based manipulation challenges, metrics, and simulator environments for model performance evaluation. Furthermore, we propose the first manual-based manipulation planning model ManualPlan to set up a group of baselines for the CheckManual benchmark.

[19] An Effective End-to-End Solution for Multimodal Action Recognition

Songping Wang,Xiantao Hu,Yueming Lyu,Caifeng Shan

Main category: cs.CV

TL;DR: 提出了一种综合多模态动作识别解决方案，通过数据增强、迁移学习、时空特征提取和预测增强方法，在竞赛中取得了Top-1 99%和Top-5 100%的准确率。

Details

Motivation: 由于三模态数据的稀缺性，多模态动作识别任务面临挑战，需要有效利用多模态信息。 Method: 优化数据增强技术扩展数据规模，利用RGB数据集预训练骨干网络，结合2D CNNs和TSM提取时空特征，并采用SWA、Ensemble和TTA等方法增强预测。 Result: 在竞赛中实现了Top-1 99%和Top-5 100%的准确率。 Conclusion: 该解决方案在多模态动作识别任务中表现出优越性。 Abstract: Recently, multimodal tasks have strongly advanced the field of action recognition with their rich multimodal information. However, due to the scarcity of tri-modal data, research on tri-modal action recognition tasks faces many challenges. To this end, we have proposed a comprehensive multimodal action recognition solution that effectively utilizes multimodal information. First, the existing data are transformed and expanded by optimizing data enhancement techniques to enlarge the training scale. At the same time, more RGB datasets are used to pre-train the backbone network, which is better adapted to the new task by means of transfer learning. Secondly, multimodal spatial features are extracted with the help of 2D CNNs and combined with the Temporal Shift Module (TSM) to achieve multimodal spatial-temporal feature extraction comparable to 3D CNNs and improve the computational efficiency. In addition, common prediction enhancement methods, such as Stochastic Weight Averaging (SWA), Ensemble and Test-Time augmentation (TTA), are used to integrate the knowledge of models from different training periods of the same architecture and different architectures, so as to predict the actions from different perspectives and fully exploit the target information. Ultimately, we achieved the Top-1 accuracy of 99% and the Top-5 accuracy of 100% on the competition leaderboard, demonstrating the superiority of our solution.

[20] Autoregressive Adversarial Post-Training for Real-Time Interactive Video Generation

Shanchuan Lin,Ceyuan Yang,Hao He,Jianwen Jiang,Yuxi Ren,Xin Xia,Yang Zhao,Xuefeng Xiao,Lu Jiang

Main category: cs.CV

TL;DR: 提出了一种自回归对抗后训练（AAPT）方法，将预训练的潜在视频扩散模型转化为实时交互式视频生成器，支持单步生成和交互控制。

Details

Motivation: 现有大规模视频生成模型计算量大，无法满足实时和交互应用的需求。 Method: 采用自回归对抗训练，单步生成潜在帧，并利用KV缓存提高效率，同时通过学生强制训练减少长视频生成中的误差累积。 Result: 8B模型在单H100上实现24fps、736x416分辨率的实时视频生成，或在8xH100上支持1280x720分辨率长达1分钟的视频生成。 Conclusion: AAPT方法在实时性和交互性上表现优异，为视频生成提供了高效解决方案。 Abstract: Existing large-scale video generation models are computationally intensive, preventing adoption in real-time and interactive applications. In this work, we propose autoregressive adversarial post-training (AAPT) to transform a pre-trained latent video diffusion model into a real-time, interactive video generator. Our model autoregressively generates a latent frame at a time using a single neural function evaluation (1NFE). The model can stream the result to the user in real time and receive interactive responses as controls to generate the next latent frame. Unlike existing approaches, our method explores adversarial training as an effective paradigm for autoregressive generation. This not only allows us to design an architecture that is more efficient for one-step generation while fully utilizing the KV cache, but also enables training the model in a student-forcing manner that proves to be effective in reducing error accumulation during long video generation. Our experiments demonstrate that our 8B model achieves real-time, 24fps, streaming video generation at 736x416 resolution on a single H100, or 1280x720 on 8xH100 up to a minute long (1440 frames). Visit our research website at https://seaweed-apt.com/2

[21] A new approach for image segmentation based on diffeomorphic registration and gradient fields

Junchao Zhou

Main category: cs.CV

TL;DR: 提出了一种基于变分框架和微分同胚变换的2D图像分割方法，结合形状分析和LDDMM框架，无需大量训练数据即可实现精确分割。

Details

Motivation: 传统图像分割方法依赖大量数据或有限灵活性，本文旨在提出一种理论扎实且灵活的方法，减少对数据的依赖。 Method: 利用LDDMM框架和微分同胚变换，通过模板曲线的变形实现分割，结合变分表示和图像梯度场优化。 Result: 实现了高精度的图像分割，方法灵活且理论支持强，适用于小数据集场景。 Conclusion: 该方法为图像分割提供了一种新的理论框架，具有实际应用潜力。 Abstract: Image segmentation is a fundamental task in computer vision aimed at delineating object boundaries within images. Traditional approaches, such as edge detection and variational methods, have been widely explored, while recent advances in deep learning have shown promising results but often require extensive training data. In this work, we propose a novel variational framework for 2D image segmentation that integrates concepts from shape analysis and diffeomorphic transformations. Our method models segmentation as the deformation of a template curve via a diffeomorphic transformation of the image domain, using the Large Deformation Diffeomorphic Metric Mapping (LDDMM) framework. The curve evolution is guided by a loss function that compares the deformed curve to the image gradient field, formulated through the varifold representation of geometric shapes. The approach is implemented in Python with GPU acceleration using the PyKeops library. This framework allows for accurate segmentation with a flexible and theoretically grounded methodology that does not rely on large datasets.

[22] SAGE: Exploring the Boundaries of Unsafe Concept Domain with Semantic-Augment Erasing

Hongguang Zhu,Yunchao Wei,Mengyu Wang,Siyu Jiao,Yan Fang,Jiannan Huang,Yao Zhao

Main category: cs.CV

TL;DR: 论文提出了一种名为SAGE的方法，通过语义增强擦除和全局-局部协作保留机制，解决了扩散模型在文本到图像生成中敏感信息的安全问题。

Details

Motivation: 扩散模型在预训练中不可避免地包含敏感信息，导致不安全内容生成和版权侵权风险。现有方法将不安全概念视为固定词重复擦除，陷入“词概念深渊”，无法实现广义概念擦除。 Method: SAGE通过语义增强擦除将概念词擦除转化为概念域擦除，利用循环自检和自擦除探索概念域边界表示；同时提出全局-局部协作保留机制，结合全局语义关系对齐和局部预测噪声保留，减少无关概念保留退化。 Result: 实验表明，SAGE在扩散模型的安全生成方面全面优于其他方法。 Conclusion: SAGE有效解决了扩散模型中的敏感信息问题，提升了安全生成能力，代码和权重将开源。 Abstract: Diffusion models (DMs) have achieved significant progress in text-to-image generation. However, the inevitable inclusion of sensitive information during pre-training poses safety risks, such as unsafe content generation and copyright infringement. Concept erasing finetunes weights to unlearn undesirable concepts, and has emerged as a promising solution. However, existing methods treat unsafe concept as a fixed word and repeatedly erase it, trapping DMs in ``word concept abyss'', which prevents generalized concept-related erasing. To escape this abyss, we introduce semantic-augment erasing which transforms concept word erasure into concept domain erasure by the cyclic self-check and self-erasure. It efficiently explores and unlearns the boundary representation of concept domain through semantic spatial relationships between original and training DMs, without requiring additional preprocessed data. Meanwhile, to mitigate the retention degradation of irrelevant concepts while erasing unsafe concepts, we further propose the global-local collaborative retention mechanism that combines global semantic relationship alignment with local predicted noise preservation, effectively expanding the retentive receptive field for irrelevant concepts. We name our method SAGE, and extensive experiments demonstrate the comprehensive superiority of SAGE compared with other methods in the safe generation of DMs. The code and weights will be open-sourced at https://github.com/KevinLight831/SAGE.

[23] ScaleLSD: Scalable Deep Line Segment Detection Streamlined

Zeran Ke,Bin Tan,Xianwei Zheng,Yujun Shen,Tianfu Wu,Nan Xue

Main category: cs.CV

TL;DR: 本文提出了一种名为ScaleLSD的自监督学习方法，用于高效、高性能的线段检测（LSD），在自然图像中表现优于传统非深度方法。

Details

Motivation: 研究目标是学习一种领域无关的鲁棒LSD模型，适用于任何自然图像，并通过自监督学习实现大规模线段几何表征。 Method: 重新审视并简化了深度和非深度LSD方法的基础设计，提出ScaleLSD，利用超过1000万张未标记的真实图像进行训练。 Result: ScaleLSD在零样本检测性能、单视图3D几何估计、双视图线段匹配及多视图3D线映射中表现优异，首次在各方面超越传统非深度方法。 Conclusion: ScaleLSD显著扩展并强化了图像线段几何的通用性，成为首个在各方面超越传统方法的深度学习方法。 Abstract: This paper studies the problem of Line Segment Detection (LSD) for the characterization of line geometry in images, with the aim of learning a domain-agnostic robust LSD model that works well for any natural images. With the focus of scalable self-supervised learning of LSD, we revisit and streamline the fundamental designs of (deep and non-deep) LSD approaches to have a high-performing and efficient LSD learner, dubbed as ScaleLSD, for the curation of line geometry at scale from over 10M unlabeled real-world images. Our ScaleLSD works very well to detect much more number of line segments from any natural images even than the pioneered non-deep LSD approach, having a more complete and accurate geometric characterization of images using line segments. Experimentally, our proposed ScaleLSD is comprehensively testified under zero-shot protocols in detection performance, single-view 3D geometry estimation, two-view line segment matching, and multiview 3D line mapping, all with excellent performance obtained. Based on the thorough evaluation, our ScaleLSD is observed to be the first deep approach that outperforms the pioneered non-deep LSD in all aspects we have tested, significantly expanding and reinforcing the versatility of the line geometry of images. Code and Models are available at https://github.com/ant-research/scalelsd

[24] UniForward: Unified 3D Scene and Semantic Field Reconstruction via Feed-Forward Gaussian Splatting from Only Sparse-View Images

Qijian Tian,Xin Tan,Jingyu Gong,Yuan Xie,Lizhuang Ma

Main category: cs.CV

TL;DR: 提出了一种名为UniForward的前馈高斯散射模型，用于统一3D场景和语义场重建，仅需未校准的稀疏视图图像即可实现实时重建。

Details

Motivation: 结合3D场景与语义场有助于环境感知和理解，但现有方法需要相机参数或深度真值，限制了实用性。 Method: 通过双分支解耦解码器将语义特征嵌入3D高斯中，使用损失引导的视图采样器优化训练，无需深度或掩码真值。 Result: 模型能实时重建高质量3D场景和语义场，支持任意视角的语义特征渲染和开放词汇分割。 Conclusion: UniForward在3D场景与语义场统一重建中达到最先进性能。 Abstract: We propose a feed-forward Gaussian Splatting model that unifies 3D scene and semantic field reconstruction. Combining 3D scenes with semantic fields facilitates the perception and understanding of the surrounding environment. However, key challenges include embedding semantics into 3D representations, achieving generalizable real-time reconstruction, and ensuring practical applicability by using only images as input without camera parameters or ground truth depth. To this end, we propose UniForward, a feed-forward model to predict 3D Gaussians with anisotropic semantic features from only uncalibrated and unposed sparse-view images. To enable the unified representation of the 3D scene and semantic field, we embed semantic features into 3D Gaussians and predict them through a dual-branch decoupled decoder. During training, we propose a loss-guided view sampler to sample views from easy to hard, eliminating the need for ground truth depth or masks required by previous methods and stabilizing the training process. The whole model can be trained end-to-end using a photometric loss and a distillation loss that leverages semantic features from a pre-trained 2D semantic model. At the inference stage, our UniForward can reconstruct 3D scenes and the corresponding semantic fields in real time from only sparse-view images. The reconstructed 3D scenes achieve high-quality rendering, and the reconstructed 3D semantic field enables the rendering of view-consistent semantic features from arbitrary views, which can be further decoded into dense segmentation masks in an open-vocabulary manner. Experiments on novel view synthesis and novel view segmentation demonstrate that our method achieves state-of-the-art performances for unifying 3D scene and semantic field reconstruction.

Jialong Zuo,Yongtai Deng,Mengdan Tan,Rui Jin,Dongyue Wu,Nong Sang,Liang Pan,Changxin Gao

Main category: cs.CV

TL;DR: 论文提出了一种新的多模态行人重识别问题（OM-ReID），并构建了首个高质量多模态数据集ORBench，同时提出了多模态学习框架ReID5o。

Details

Motivation: 现有方法和数据集局限于有限模态，无法满足实际场景中多模态查询的需求。 Method: 构建ORBench数据集（包含5种模态），并提出ReID5o框架，实现多模态融合与对齐。 Result: ORBench数据集具有多样性优势，ReID5o在实验中表现最佳。 Conclusion: ORBench和ReID5o为多模态行人重识别研究提供了理想平台和有效解决方案。 Abstract: In real-word scenarios, person re-identification (ReID) expects to identify a person-of-interest via the descriptive query, regardless of whether the query is a single modality or a combination of multiple modalities. However, existing methods and datasets remain constrained to limited modalities, failing to meet this requirement. Therefore, we investigate a new challenging problem called Omni Multi-modal Person Re-identification (OM-ReID), which aims to achieve effective retrieval with varying multi-modal queries. To address dataset scarcity, we construct ORBench, the first high-quality multi-modal dataset comprising 1,000 unique identities across five modalities: RGB, infrared, color pencil, sketch, and textual description. This dataset also has significant superiority in terms of diversity, such as the painting perspectives and textual information. It could serve as an ideal platform for follow-up investigations in OM-ReID. Moreover, we propose ReID5o, a novel multi-modal learning framework for person ReID. It enables synergistic fusion and cross-modal alignment of arbitrary modality combinations in a single model, with a unified encoding and multi-expert routing mechanism proposed. Extensive experiments verify the advancement and practicality of our ORBench. A wide range of possible models have been evaluated and compared on it, and our proposed ReID5o model gives the best performance. The dataset and code will be made publicly available at https://github.com/Zplusdragon/ReID5o_ORBench.

[26] Improving Out-of-Distribution Detection via Dynamic Covariance Calibration

Kaiyu Guo,Zijian Wang,Brian C. Lovell,Mahsa Baktashmotlagh

Main category: cs.CV

TL;DR: 提出了一种动态调整先验几何的方法，通过实时更新协方差矩阵来纠正不良分布样本的影响，显著提升了OOD检测性能。

Details

Motivation: 现有基于子空间的方法因静态提取信息几何而无法处理不良分布样本导致的几何失真，需要动态调整先验几何。 Method: 动态更新先验协方差矩阵，沿实时输入特征方向减少协方差，并在残差空间中约束调整，保留关键数据特征。 Result: 在CIFAR和ImageNet-1k数据集上，包括自监督DINO模型，显著提升了OOD检测性能。 Conclusion: 动态调整先验几何的方法有效解决了不良分布样本的影响，提升了OOD检测的鲁棒性。 Abstract: Out-of-Distribution (OOD) detection is essential for the trustworthiness of AI systems. Methods using prior information (i.e., subspace-based methods) have shown effective performance by extracting information geometry to detect OOD data with a more appropriate distance metric. However, these methods fail to address the geometry distorted by ill-distributed samples, due to the limitation of statically extracting information geometry from the training distribution. In this paper, we argue that the influence of ill-distributed samples can be corrected by dynamically adjusting the prior geometry in response to new data. Based on this insight, we propose a novel approach that dynamically updates the prior covariance matrix using real-time input features, refining its information. Specifically, we reduce the covariance along the direction of real-time input features and constrain adjustments to the residual space, thus preserving essential data characteristics and avoiding effects on unintended directions in the principal space. We evaluate our method on two pre-trained models for the CIFAR dataset and five pre-trained models for ImageNet-1k, including the self-supervised DINO model. Extensive experiments demonstrate that our approach significantly enhances OOD detection across various models. The code is released at https://github.com/workerbcd/ooddcc.

[27] SRPL-SFDA: SAM-Guided Reliable Pseudo-Labels for Source-Free Domain Adaptation in Medical Image Segmentation

Xinya Liu,Jianghao Wu,Tao Lu,Shaoting Zhang,Guotai Wang

Main category: cs.CV

TL;DR: 提出了一种基于Segment Anything Model（SAM）的源自由域适应方法（SRPL-SFDA），通过增强伪标签质量和可靠性提升目标域分割性能。

Details

Motivation: 解决源自由域适应（SFDA）在目标域无标签数据下监督不足的问题，同时保护源域数据隐私。 Method: 1）三分支强度增强（T3IE）提升伪标签质量；2）基于SAM输出一致性的可靠伪标签选择；3）可靠性感知训练。 Result: 在两个医学图像数据集上表现优于现有SFDA方法，接近目标域有监督训练性能。 Conclusion: SRPL-SFDA有效提升伪标签质量和SFDA性能，适用于隐私敏感的医学图像分割任务。 Abstract: Domain Adaptation (DA) is crucial for robust deployment of medical image segmentation models when applied to new clinical centers with significant domain shifts. Source-Free Domain Adaptation (SFDA) is appealing as it can deal with privacy concerns and access constraints on source-domain data during adaptation to target-domain data. However, SFDA faces challenges such as insufficient supervision in the target domain with unlabeled images. In this work, we propose a Segment Anything Model (SAM)-guided Reliable Pseudo-Labels method for SFDA (SRPL-SFDA) with three key components: 1) Test-Time Tri-branch Intensity Enhancement (T3IE) that not only improves quality of raw pseudo-labels in the target domain, but also leads to SAM-compatible inputs with three channels to better leverage SAM's zero-shot inference ability for refining the pseudo-labels; 2) A reliable pseudo-label selection module that rejects low-quality pseudo-labels based on Consistency of Multiple SAM Outputs (CMSO) under input perturbations with T3IE; and 3) A reliability-aware training procedure in the unlabeled target domain where reliable pseudo-labels are used for supervision and unreliable parts are regularized by entropy minimization. Experiments conducted on two multi-domain medical image segmentation datasets for fetal brain and the prostate respectively demonstrate that: 1) SRPL-SFDA effectively enhances pseudo-label quality in the unlabeled target domain, and improves SFDA performance by leveraging the reliability-aware training; 2) SRPL-SFDA outperformed state-of-the-art SFDA methods, and its performance is close to that of supervised training in the target domain. The code of this work is available online: https://github.com/HiLab-git/SRPL-SFDA.

[28] Synthetic Human Action Video Data Generation with Pose Transfer

Vaclav Knapp,Matyas Bohacek

Main category: cs.CV

TL;DR: 提出一种基于姿态迁移的合成人类动作视频数据生成方法，提升动作识别任务性能，并开源数据集。

Details

Motivation: 合成数据在视频理解任务中常因不自然特征影响训练效果，限制了其在手势识别、自动驾驶等领域的潜力。 Method: 使用可控3D高斯虚拟模型进行姿态迁移，生成合成人类动作视频数据。 Result: 在Toyota Smarthome和NTU RGB+D数据集上验证了方法有效性，能提升动作识别性能并扩展少样本数据集。 Conclusion: 该方法能有效生成高质量合成数据，弥补真实数据不足，并开源数据集促进研究。 Abstract: In video understanding tasks, particularly those involving human motion, synthetic data generation often suffers from uncanny features, diminishing its effectiveness for training. Tasks such as sign language translation, gesture recognition, and human motion understanding in autonomous driving have thus been unable to exploit the full potential of synthetic data. This paper proposes a method for generating synthetic human action video data using pose transfer (specifically, controllable 3D Gaussian avatar models). We evaluate this method on the Toyota Smarthome and NTU RGB+D datasets and show that it improves performance in action recognition tasks. Moreover, we demonstrate that the method can effectively scale few-shot datasets, making up for groups underrepresented in the real training data and adding diverse backgrounds. We open-source the method along with RANDOM People, a dataset with videos and avatars of novel human identities for pose transfer crowd-sourced from the internet.

[29] Noise Conditional Variational Score Distillation

Xinyu Peng,Ziyang Zheng,Yaoming Wang,Han Li,Nuowen Kan,Wenrui Dai,Chenglin Li,Junni Zou,Hongkai Xiong

Main category: cs.CV

TL;DR: NCVSD是一种新方法，将预训练的扩散模型蒸馏为生成去噪器，通过揭示无条件评分函数隐含地表征去噪后验分布的评分函数，实现快速生成和迭代优化。

Details

Motivation: 研究旨在通过蒸馏扩散模型，构建生成去噪器，以同时实现快速生成和高样本质量。 Method: 将无条件评分函数的洞察融入VSD框架，学习生成去噪器，支持从高噪声水平到低噪声水平的后验分布采样。 Result: NCVSD在类条件图像生成和逆问题求解中表现优异，优于教师扩散模型，并与更大规模的Consistency模型相当。 Conclusion: NCVSD通过快速生成和多步采样优化，显著提升了生成效率和样本质量，适用于灵活可控的采样任务。 Abstract: We propose Noise Conditional Variational Score Distillation (NCVSD), a novel method for distilling pretrained diffusion models into generative denoisers. We achieve this by revealing that the unconditional score function implicitly characterizes the score function of denoising posterior distributions. By integrating this insight into the Variational Score Distillation (VSD) framework, we enable scalable learning of generative denoisers capable of approximating samples from the denoising posterior distribution across a wide range of noise levels. The proposed generative denoisers exhibit desirable properties that allow fast generation while preserve the benefit of iterative refinement: (1) fast one-step generation through sampling from pure Gaussian noise at high noise levels; (2) improved sample quality by scaling the test-time compute with multi-step sampling; and (3) zero-shot probabilistic inference for flexible and controllable sampling. We evaluate NCVSD through extensive experiments, including class-conditional image generation and inverse problem solving. By scaling the test-time compute, our method outperforms teacher diffusion models and is on par with consistency models of larger sizes. Additionally, with significantly fewer NFEs than diffusion-based methods, we achieve record-breaking LPIPS on inverse problems.

[30] ODG: Occupancy Prediction Using Dual Gaussians

Yunxiao Shi,Yinhao Zhu,Shizhong Han,Jisoo Jeong,Amin Ansari,Hong Cai,Fatih Porikli

Main category: cs.CV

TL;DR: 论文提出了一种结合BEV和稀疏点表示的新型3D占用预测方法ODG，通过双分支设计解决现有方法的不足，并在实验中表现出优越性能。

Details

Motivation: 现有3D占用预测方法计算成本高或存在信息丢失问题，BEV对小物体表现不佳，稀疏点对大物体或平面效率低。 Method: 采用双分支设计：基于查询的稀疏点分支和BEV分支，通过交叉注意力共享信息，最终融合输出预测的3D占用。 Result: 在Occ3D-nuScenes和Occ3D-Waymo基准测试中表现优越，推理速度与最新高效方法相当。 Conclusion: ODG方法有效结合BEV和稀疏点表示，解决了现有方法的局限性，同时保持了高效性。 Abstract: 3D occupancy provides fine-grained 3D geometry and semantics for scene understanding which is critical for autonomous driving. Most existing methods, however, carry high compute costs, requiring dense 3D feature volume and cross-attention to effectively aggregate information. More recent works have adopted Bird's Eye View (BEV) or sparse points as scene representation with much reduced cost, but still suffer from their respective shortcomings. More concretely, BEV struggles with small objects that often experience significant information loss after being projected to the ground plane. On the other hand, points can flexibly model little objects in 3D, but is inefficient at capturing flat surfaces or large objects. To address these challenges, in this paper, we present a novel 3D occupancy prediction approach, ODG, which combines BEV and sparse points based representations. We propose a dual-branch design: a query-based sparse points branch and a BEV branch. The 3D information learned in the sparse points branch is shared with the BEV stream via cross-attention, which enriches the weakened signals of difficult objects on the BEV plane. The outputs of both branches are finally fused to generate predicted 3D occupancy. We conduct extensive experiments on the Occ3D-nuScenes and Occ3D-Waymo benchmarks that demonstrate the superiority of our proposed ODG. Moreover, ODG also delivers competitive inference speed when compared to the latest efficient approaches.

[31] A High-Quality Dataset and Reliable Evaluation for Interleaved Image-Text Generation

Yukang Feng,Jianwen Sun,Chuanhao Li,Zizhen Li,Jiaxin Ai,Fanrui Zhang,Yifan Chang,Sizhuo Zhou,Shenglin Zhang,Yu Dai,Kaipeng Zhang

Main category: cs.CV

TL;DR: 论文提出InterSyn数据集和SEIR方法，用于提升多模态模型的图像-文本交织生成能力，并引入SynJudge评估工具。

Details

Motivation: 当前多模态模型在生成紧密交织的图像-文本输出时表现不佳，主要由于训练数据规模、质量和指令丰富度不足。 Method: 提出Self-Evaluation with Iterative Refinement (SEIR)方法构建InterSyn数据集，并开发SynJudge评估模型。 Result: SEIR显著提升数据集质量，InterSyn训练的模型在所有评估指标上均有提升。 Conclusion: InterSyn和SynJudge为下一代多模态系统的发展提供了有效支持。 Abstract: Recent advancements in Large Multimodal Models (LMMs) have significantly improved multimodal understanding and generation. However, these models still struggle to generate tightly interleaved image-text outputs, primarily due to the limited scale, quality and instructional richness of current training datasets. To address this, we introduce InterSyn, a large-scale multimodal dataset constructed using our Self-Evaluation with Iterative Refinement (SEIR) method. InterSyn features multi-turn, instruction-driven dialogues with tightly interleaved imagetext responses, providing rich object diversity and rigorous automated quality refinement, making it well-suited for training next-generation instruction-following LMMs. Furthermore, to address the lack of reliable evaluation tools capable of assessing interleaved multimodal outputs, we introduce SynJudge, an automatic evaluation model designed to quantitatively assess multimodal outputs along four dimensions: text content, image content, image quality, and image-text synergy. Experimental studies show that the SEIR method leads to substantially higher dataset quality compared to an otherwise identical process without refinement. Moreover, LMMs trained on InterSyn achieve uniform performance gains across all evaluation metrics, confirming InterSyn's utility for advancing multimodal systems.

[32] A Novel Lightweight Transformer with Edge-Aware Fusion for Remote Sensing Image Captioning

Swadhin Das,Divyansh Mundra,Priyanshu Dayal,Raksha Sharma

Main category: cs.CV

TL;DR: 提出了一种轻量级Transformer架构，通过降低编码器层维度和使用蒸馏版GPT-2解码器，结合知识蒸馏和边缘感知增强策略，显著提升了遥感图像描述质量。

Details

Motivation: 现有Transformer模型在遥感图像描述中计算成本高，且忽视细粒度结构特征。 Method: 采用轻量级Transformer架构，结合知识蒸馏和边缘感知增强策略。 Result: 实验表明，该方法显著优于现有技术。 Conclusion: 轻量级架构和细粒度特征增强策略有效提升了遥感图像描述性能。 Abstract: Transformer-based models have achieved strong performance in remote sensing image captioning by capturing long-range dependencies and contextual information. However, their practical deployment is hindered by high computational costs, especially in multi-modal frameworks that employ separate transformer-based encoders and decoders. In addition, existing remote sensing image captioning models primarily focus on high-level semantic extraction while often overlooking fine-grained structural features such as edges, contours, and object boundaries. To address these challenges, a lightweight transformer architecture is proposed by reducing the dimensionality of the encoder layers and employing a distilled version of GPT-2 as the decoder. A knowledge distillation strategy is used to transfer knowledge from a more complex teacher model to improve the performance of the lightweight network. Furthermore, an edge-aware enhancement strategy is incorporated to enhance image representation and object boundary understanding, enabling the model to capture fine-grained spatial details in remote sensing images. Experimental results demonstrate that the proposed approach significantly improves caption quality compared to state-of-the-art methods.

[33] TOGA: Temporally Grounded Open-Ended Video QA with Weak Supervision

Ayush Gupta,Anirban Roy,Rama Chellappa,Nathaniel D. Bastian,Alvaro Velasquez,Susmit Jha

Main category: cs.CV

TL;DR: 论文提出TOGA模型，用于弱监督下的视频问答任务，无需时间标注即可生成答案及时间定位。

Details

Motivation: 解决弱监督下视频问答任务中时间定位的问题，避免依赖标注数据。 Method: 通过生成伪标签并施加一致性约束，联合生成答案与时间定位。 Result: 在NExT-GQA、MSVD-QA和ActivityNet-QA基准测试中达到最优性能。 Conclusion: TOGA模型在弱监督下能有效提升视频问答和时间定位的性能。 Abstract: We address the problem of video question answering (video QA) with temporal grounding in a weakly supervised setup, without any temporal annotations. Given a video and a question, we generate an open-ended answer grounded with the start and end time. For this task, we propose TOGA: a vision-language model for Temporally Grounded Open-Ended Video QA with Weak Supervision. We instruct-tune TOGA to jointly generate the answer and the temporal grounding. We operate in a weakly supervised setup where the temporal grounding annotations are not available. We generate pseudo labels for temporal grounding and ensure the validity of these labels by imposing a consistency constraint between the question of a grounding response and the response generated by a question referring to the same temporal segment. We notice that jointly generating the answers with the grounding improves performance on question answering as well as grounding. We evaluate TOGA on grounded QA and open-ended QA tasks. For grounded QA, we consider the NExT-GQA benchmark which is designed to evaluate weakly supervised grounded question answering. For open-ended QA, we consider the MSVD-QA and ActivityNet-QA benchmarks. We achieve state-of-the-art performance for both tasks on these benchmarks.

[34] Harmonizing and Merging Source Models for CLIP-based Domain Generalization

Yuhe Ding,Jian Liang,Bo Jiang,Zi Wang,Aihua Zheng,Bin Luo

Main category: cs.CV

TL;DR: HAM是一种基于CLIP的领域泛化方法，通过无冲突样本增强和模型合并，解决了多源训练中的样本和优化冲突问题，显著提升了泛化性能。

Details

Motivation: 现有方法在多源训练中存在样本冲突和优化冲突，导致泛化能力受限。HAM旨在通过无冲突样本增强和模型合并解决这些问题。 Method: HAM在训练过程中增强无冲突样本，协调模型更新方向，并引入冗余感知的历史模型合并方法整合知识。 Result: 在五个基准数据集上的实验表明，HAM实现了最先进的性能。 Conclusion: HAM通过有效整合多源信息，显著提升了CLIP模型的领域泛化能力。 Abstract: CLIP-based domain generalization aims to improve model generalization to unseen domains by leveraging the powerful zero-shot classification capabilities of CLIP and multiple source datasets. Existing methods typically train a single model across multiple source domains to capture domain-shared information. However, this paradigm inherently suffers from two types of conflicts: 1) sample conflicts, arising from noisy samples and extreme domain shifts among sources; and 2) optimization conflicts, stemming from competition and trade-offs during multi-source training. Both hinder the generalization and lead to suboptimal solutions. Recent studies have shown that model merging can effectively mitigate the competition of multi-objective optimization and improve generalization performance. Inspired by these findings, we propose Harmonizing and Merging (HAM), a novel source model merging framework for CLIP-based domain generalization. During the training process of the source models, HAM enriches the source samples without conflicting samples, and harmonizes the update directions of all models. Then, a redundancy-aware historical model merging method is introduced to effectively integrate knowledge across all source models. HAM comprehensively consolidates source domain information while enabling mutual enhancement among source models, ultimately yielding a final model with optimal generalization capabilities. Extensive experiments on five widely used benchmark datasets demonstrate the effectiveness of our approach, achieving state-of-the-art performance.

[35] Evidential Deep Learning with Spectral-Spatial Uncertainty Disentanglement for Open-Set Hyperspectral Domain Generalization

Amirreza Khoshbakht,Erchan Aptoula

Main category: cs.CV

TL;DR: 提出了一种新的开放集域泛化框架，结合频谱不变频率解耦、双通道残差网络、证据深度学习和频谱空间不确定性解耦，用于高光谱图像分类。

Details

Motivation: 解决现有域适应方法在目标域存在未知类别时无法处理域偏移的问题，避免负迁移和分类性能下降。 Method: 结合SIFD（频谱不变频率解耦）、DCRN（双通道残差网络）、EDL（证据深度学习）和SSUD（频谱空间不确定性解耦）四个模块，提取域不变特征并量化不确定性。 Result: 在三个跨场景高光谱分类任务中表现优异，性能接近最先进的域适应方法，且无需目标域训练数据。 Conclusion: 提出的框架有效解决了开放集域泛化问题，为高光谱图像分类提供了可靠解决方案。 Abstract: Open-set domain generalization(OSDG) for hyperspectral image classification presents significant challenges due to the presence of unknown classes in target domains and the need for models to generalize across multiple unseen domains without target-specific adaptation. Existing domain adaptation methods assume access to target domain data during training and fail to address the fundamental issue of domain shift when unknown classes are present, leading to negative transfer and reduced classification performance. To address these limitations, we propose a novel open-set domain generalization framework that combines four key components: Spectrum-Invariant Frequency Disentanglement (SIFD) for domain-agnostic feature extraction, Dual-Channel Residual Network (DCRN) for robust spectral-spatial feature learning, Evidential Deep Learning (EDL) for uncertainty quantification, and Spectral-Spatial Uncertainty Disentanglement (SSUD) for reliable open-set classification. The SIFD module extracts domain-invariant spectral features in the frequency domain through attention-weighted frequency analysis and domain-agnostic regularization, while DCRN captures complementary spectral and spatial information via parallel pathways with adaptive fusion. EDL provides principled uncertainty estimation using Dirichlet distributions, enabling the SSUD module to make reliable open-set decisions through uncertainty-aware pathway weighting and adaptive rejection thresholding. Experimental results on three cross-scene hyperspectral classification tasks show that our approach achieves performance comparable to state-of-the-art domain adaptation methods while requiring no access to the target domain during training. The implementation will be made available at https://github.com/amir-khb/SSUDOSDG upon acceptance.

[36] Optimizing Cooperative Multi-Object Tracking using Graph Signal Processing

Maria Damanaki,Nikos Piperigkos,Alexandros Gkillas,Aris S. Lalos

Main category: cs.CV

TL;DR: 本文提出了一种基于图拓扑感知优化的多智能体协同多目标跟踪（MOT）框架，通过融合多车辆信息提升3D LiDAR场景中的跟踪精度。

Details

Motivation: 单智能体MOT因遮挡和传感器故障等问题难以全面感知环境，多智能体信息融合对自动驾驶系统至关重要。 Method: 利用检测到的边界框构建全连接图拓扑，采用图拉普拉斯优化技术平滑位置误差并融合多智能体检测信息，分两阶段优化定位和跟踪精度。 Result: 在V2V4Real数据集上的实验表明，该方法显著优于基线框架（如DMSTrack和V2V4Real）。 Conclusion: 提出的框架通过多智能体协同和信息融合，有效提升了3D MOT的性能。 Abstract: Multi-Object Tracking (MOT) plays a crucial role in autonomous driving systems, as it lays the foundations for advanced perception and precise path planning modules. Nonetheless, single agent based MOT lacks in sensing surroundings due to occlusions, sensors failures, etc. Hence, the integration of multiagent information is essential for comprehensive understanding of the environment. This paper proposes a novel Cooperative MOT framework for tracking objects in 3D LiDAR scene by formulating and solving a graph topology-aware optimization problem so as to fuse information coming from multiple vehicles. By exploiting a fully connected graph topology defined by the detected bounding boxes, we employ the Graph Laplacian processing optimization technique to smooth the position error of bounding boxes and effectively combine them. In that manner, we reveal and leverage inherent coherences of diverse multi-agent detections, and associate the refined bounding boxes to tracked objects at two stages, optimizing localization and tracking accuracies. An extensive evaluation study has been conducted, using the real-world V2V4Real dataset, where the proposed method significantly outperforms the baseline frameworks, including the state-of-the-art deep-learning DMSTrack and V2V4Real, in various testing sequences.

Cheng Chen,Yunpeng Zhai,Yifan Zhao,Jinyang Gao,Bolin Ding,Jia Li

Main category: cs.CV

TL;DR: 本文提出了一种基于探索-利用强化学习框架的多模态演示选择方法，用于提升大型视觉语言模型（LVLMs）的上下文学习能力，并在四个VQA数据集上验证了其优越性能。

Details

Motivation: 现有上下文学习方法依赖预定义演示或启发式选择策略，覆盖任务需求不足且忽略演示间交互，导致性能不佳。 Method: 采用探索-利用强化学习框架，自适应选择多模态演示，并通过自我探索优化选择策略。 Result: 在四个VQA数据集上验证了方法的有效性，显著提升了少样本LVLMs的泛化能力。 Conclusion: 该方法通过自适应演示选择优化了LVLMs的上下文学习能力，为多模态任务提供了新思路。 Abstract: In-context learning (ICL), a predominant trend in instruction learning, aims at enhancing the performance of large language models by providing clear task guidance and examples, improving their capability in task understanding and execution. This paper investigates ICL on Large Vision-Language Models (LVLMs) and explores the policies of multi-modal demonstration selection. Existing research efforts in ICL face significant challenges: First, they rely on pre-defined demonstrations or heuristic selecting strategies based on human intuition, which are usually inadequate for covering diverse task requirements, leading to sub-optimal solutions; Second, individually selecting each demonstration fails in modeling the interactions between them, resulting in information redundancy. Unlike these prevailing efforts, we propose a new exploration-exploitation reinforcement learning framework, which explores policies to fuse multi-modal information and adaptively select adequate demonstrations as an integrated whole. The framework allows LVLMs to optimize themselves by continually refining their demonstrations through self-exploration, enabling the ability to autonomously identify and generate the most effective selection policies for in-context learning. Experimental results verify the superior performance of our approach on four Visual Question-Answering (VQA) datasets, demonstrating its effectiveness in enhancing the generalization capability of few-shot LVLMs.

[38] Urban1960SatSeg: Unsupervised Semantic Segmentation of Mid-20$^{th}$ century Urban Landscapes with Satellite Imageries

Tianxiang Hao,Lixian Zhang,Yingjia Zhang,Mengxuan Chen,Jinxiao Zhang,Haohuan Fu

Main category: cs.CV

TL;DR: 论文提出了Urban1960SatBench数据集和Urban1960SatUSM框架，用于解决历史卫星图像语义分割中的质量退化和标注缺失问题，旨在促进对早期城市发展的理解。

Details

Motivation: 历史卫星图像（如20世纪中叶的Keyhole数据）为研究早期城市发展提供了宝贵资源，但图像质量差和缺乏标注阻碍了语义分割任务。 Method: 提出了Urban1960SatBench数据集（标注的历史图像）和Urban1960SatUSM框架（基于自监督学习的无监督分割方法，采用置信度对齐机制和焦点置信度损失）。 Result: Urban1960SatUSM在Urban1960SatSeg数据集上显著优于现有无监督分割方法，适用于噪声历史数据。 Conclusion: 该研究为利用现代计算机视觉技术定量研究长期城市变化提供了新工具。 Abstract: Historical satellite imagery, such as mid-20$^{th}$ century Keyhole data, offers rare insights into understanding early urban development and long-term transformation. However, severe quality degradation (e.g., distortion, misalignment, and spectral scarcity) and annotation absence have long hindered semantic segmentation on such historical RS imagery. To bridge this gap and enhance understanding of urban development, we introduce $\textbf{Urban1960SatBench}$, an annotated segmentation dataset based on historical satellite imagery with the earliest observation time among all existing segmentation datasets, along with a benchmark framework for unsupervised segmentation tasks, $\textbf{Urban1960SatUSM}$. First, $\textbf{Urban1960SatBench}$ serves as a novel, expertly annotated semantic segmentation dataset built on mid-20$^{th}$ century Keyhole imagery, covering 1,240 km$^2$ and key urban classes (buildings, roads, farmland, water). As the earliest segmentation dataset of its kind, it provides a pioneering benchmark for historical urban understanding. Second, $\textbf{Urban1960SatUSM}$(Unsupervised Segmentation Model) is a novel unsupervised semantic segmentation framework for historical RS imagery. It employs a confidence-aware alignment mechanism and focal-confidence loss based on a self-supervised learning architecture, which generates robust pseudo-labels and adaptively prioritizes prediction difficulty and label reliability to improve unsupervised segmentation on noisy historical data without manual supervision. Experiments show Urban1960SatUSM significantly outperforms existing unsupervised segmentation methods on Urban1960SatSeg for segmenting historical urban scenes, promising in paving the way for quantitative studies of long-term urban change using modern computer vision. Our benchmark and supplementary material are available at https://github.com/Tianxiang-Hao/Urban1960SatSeg.

[39] TinySplat: Feedforward Approach for Generating Compact 3D Scene Representation

Zetian Song,Jiaye Fu,Jiaqi Zhang,Xiaohan Lu,Chuanmin Jia,Siwei Ma,Wen Gao

Main category: cs.CV

TL;DR: TinySplat提出了一种新的前馈方法，用于生成紧凑的3D场景表示，通过消除冗余实现了100倍以上的压缩效果，且编码和解码时间大幅减少。

Details

Motivation: 现有的前馈3D高斯泼溅（3DGS）方法虽然重建速度快，但存储成本高，且现有压缩方法因架构不兼容无法适用。 Method: TinySplat结合了训练无关的压缩框架，包括View-Projection Transformation（VPT）减少几何冗余，Visibility-Aware Basis Reduction（VABR）减少感知冗余，以及视频编解码器减少空间冗余。 Result: 在多个基准数据集上，TinySplat实现了100倍以上的压缩，存储大小仅为现有最佳方法的6%，编码和解码时间分别减少75%和99%。 Conclusion: TinySplat是一种高效的前馈压缩方法，显著降低了3D高斯数据的存储和计算成本。 Abstract: The recent development of feedforward 3D Gaussian Splatting (3DGS) presents a new paradigm to reconstruct 3D scenes. Using neural networks trained on large-scale multi-view datasets, it can directly infer 3DGS representations from sparse input views. Although the feedforward approach achieves high reconstruction speed, it still suffers from the substantial storage cost of 3D Gaussians. Existing 3DGS compression methods relying on scene-wise optimization are not applicable due to architectural incompatibilities. To overcome this limitation, we propose TinySplat, a complete feedforward approach for generating compact 3D scene representations. Built upon standard feedforward 3DGS methods, TinySplat integrates a training-free compression framework that systematically eliminates key sources of redundancy. Specifically, we introduce View-Projection Transformation (VPT) to reduce geometric redundancy by projecting geometric parameters into a more compact space. We further present Visibility-Aware Basis Reduction (VABR), which mitigates perceptual redundancy by aligning feature energy along dominant viewing directions via basis transformation. Lastly, spatial redundancy is addressed through an off-the-shelf video codec. Comprehensive experimental results on multiple benchmark datasets demonstrate that TinySplat achieves over 100x compression for 3D Gaussian data generated by feedforward methods. Compared to the state-of-the-art compression approach, we achieve comparable quality with only 6% of the storage size. Meanwhile, our compression framework requires only 25% of the encoding time and 1% of the decoding time.

[40] Marrying Autoregressive Transformer and Diffusion with Multi-Reference Autoregression

Dingcheng Zhen,Qian Qiao,Tan Yu,Kangxi Wu,Ziwei Zhang,Siyuan Liu,Shunshun Yin,Ming Tao

Main category: cs.CV

TL;DR: TransDiff结合自回归Transformer和扩散模型，显著提升图像生成性能，并引入多参考自回归（MRAR）进一步优化生成质量。

Details

Motivation: 结合自回归Transformer和扩散模型的优势，以提升图像生成的质量和效率。 Method: TransDiff通过联合建模框架编码标签和图像为高级语义特征，并利用扩散模型估计图像样本分布。MRAR通过多参考自回归生成图像。 Result: 在ImageNet 256x256基准上，TransDiff的FID为1.61，IS为293.4，推理速度显著快于其他方法。MRAR进一步将FID降至1.42。 Conclusion: TransDiff为图像生成领域开辟了新方向，结合MRAR进一步提升了性能。 Abstract: We introduce TransDiff, the first image generation model that marries Autoregressive (AR) Transformer with diffusion models. In this joint modeling framework, TransDiff encodes labels and images into high-level semantic features and employs a diffusion model to estimate the distribution of image samples. On the ImageNet 256x256 benchmark, TransDiff significantly outperforms other image generation models based on standalone AR Transformer or diffusion models. Specifically, TransDiff achieves a Fr\'echet Inception Distance (FID) of 1.61 and an Inception Score (IS) of 293.4, and further provides x2 faster inference latency compared to state-of-the-art methods based on AR Transformer and x112 faster inference compared to diffusion-only models. Furthermore, building on the TransDiff model, we introduce a novel image generation paradigm called Multi-Reference Autoregression (MRAR), which performs autoregressive generation by predicting the next image. MRAR enables the model to reference multiple previously generated images, thereby facilitating the learning of more diverse representations and improving the quality of generated images in subsequent iterations. By applying MRAR, the performance of TransDiff is improved, with the FID reduced from 1.61 to 1.42. We expect TransDiff to open up a new frontier in the field of image generation.

[41] Generalized Gaussian Entropy Model for Point Cloud Attribute Compression with Dynamic Likelihood Intervals

Changhao Peng,Yuqi Ye,Wei Gao

Main category: cs.CV

TL;DR: 论文提出了一种广义高斯熵模型和动态调整似然区间的方法，显著提升了点云属性压缩的性能。

Details

Motivation: 现有方法中熵参数的未充分利用信息以及固定似然区间的限制。 Method: 引入广义高斯熵模型控制尾形状，并提出Mean Error Discriminator动态调整似然区间。 Result: 在三个基于VAE的点云属性压缩模型中显著提升了率失真性能。 Conclusion: 该方法不仅适用于点云压缩，还可推广至图像和视频压缩任务。 Abstract: Gaussian and Laplacian entropy models are proved effective in learned point cloud attribute compression, as they assist in arithmetic coding of latents. However, we demonstrate through experiments that there is still unutilized information in entropy parameters estimated by neural networks in current methods, which can be used for more accurate probability estimation. Thus we introduce generalized Gaussian entropy model, which controls the tail shape through shape parameter to more accurately estimate the probability of latents. Meanwhile, to the best of our knowledge, existing methods use fixed likelihood intervals for each integer during arithmetic coding, which limits model performance. We propose Mean Error Discriminator (MED) to determine whether the entropy parameter estimation is accurate and then dynamically adjust likelihood intervals. Experiments show that our method significantly improves rate-distortion (RD) performance on three VAE-based models for point cloud attribute compression, and our method can be applied to other compression tasks, such as image and video compression.

[42] HAIF-GS: Hierarchical and Induced Flow-Guided Gaussian Splatting for Dynamic Scene

Jianing Chen,Zehao Li,Yujun Cai,Hao Jiang,Chengxuan Qian,Juyuan Kang,Shuqin Gao,Honglong Zhao,Tianlu Mao,Yucheng Zhang

Main category: cs.CV

TL;DR: HAIF-GS是一种动态3D场景重建框架，通过稀疏锚点驱动变形解决现有方法的冗余更新、运动监督不足和非刚性变形建模弱的问题。

Details

Motivation: 动态3D场景重建在单目视频中仍具挑战性，现有方法存在冗余更新、运动监督不足和非刚性变形建模弱的问题。 Method: HAIF-GS通过锚点过滤器识别运动相关区域，使用自监督流引导变形模块和多级锚点传播机制处理复杂变形。 Result: 实验表明HAIF-GS在渲染质量、时间一致性和重建效率上显著优于现有动态3DGS方法。 Conclusion: HAIF-GS通过结构化动态建模和高效变形机制，显著提升了动态3D场景重建的性能。 Abstract: Reconstructing dynamic 3D scenes from monocular videos remains a fundamental challenge in 3D vision. While 3D Gaussian Splatting (3DGS) achieves real-time rendering in static settings, extending it to dynamic scenes is challenging due to the difficulty of learning structured and temporally consistent motion representations. This challenge often manifests as three limitations in existing methods: redundant Gaussian updates, insufficient motion supervision, and weak modeling of complex non-rigid deformations. These issues collectively hinder coherent and efficient dynamic reconstruction. To address these limitations, we propose HAIF-GS, a unified framework that enables structured and consistent dynamic modeling through sparse anchor-driven deformation. It first identifies motion-relevant regions via an Anchor Filter to suppresses redundant updates in static areas. A self-supervised Induced Flow-Guided Deformation module induces anchor motion using multi-frame feature aggregation, eliminating the need for explicit flow labels. To further handle fine-grained deformations, a Hierarchical Anchor Propagation mechanism increases anchor resolution based on motion complexity and propagates multi-level transformations. Extensive experiments on synthetic and real-world benchmarks validate that HAIF-GS significantly outperforms prior dynamic 3DGS methods in rendering quality, temporal coherence, and reconstruction efficiency.

[43] Revisit What You See: Disclose Language Prior in Vision Tokens for Efficient Guided Decoding of LVLMs

Beomsik Cho,Jaehyung Kim

Main category: cs.CV

TL;DR: ReVisiT是一种简单有效的解码方法，通过引用视觉标记来引导大型视觉语言模型（LVLM）的文本生成，提升视觉信息的利用。

Details

Motivation: 传统LVLM的解码策略未能充分利用视觉信息，导致视觉无关的响应，现有方法通常需要额外训练或多步推理。 Method: ReVisiT将视觉标记投影到文本标记分布空间，通过约束差异最小化动态选择最相关的视觉标记，并用于优化输出分布。 Result: 在三个LVLM幻觉基准测试中，ReVisiT显著提升了视觉相关性，且计算开销极低，性能优于或接近现有最佳方法。 Conclusion: ReVisiT是一种高效的方法，无需额外训练或复杂推理，即可显著提升LVLM的视觉信息利用能力。 Abstract: Large Vision-Language Models (LVLMs) have demonstrated remarkable performance across various multimodal tasks by integrating visual perception with language understanding. However, conventional decoding strategies of LVLMs often fail to successfully utilize visual information, leading to visually ungrounded responses. While various approaches have been proposed to address this limitation, they typically require additional training, multi-step inference procedures, or external model dependencies. This paper introduces ReVisiT, a simple yet effective decoding method that references vision tokens to guide the text generation process in LVLMs. Our approach leverages the semantic information embedded within vision tokens by projecting them into the text token distribution space, and dynamically selecting the most relevant vision token at each decoding step through constrained divergence minimization. This selected vision token is then used to refine the output distribution to better incorporate visual semantics. Experiments on three LVLM hallucination benchmarks with two recent LVLMs demonstrate that ReVisiT consistently enhances visual grounding with minimal computational overhead. Moreover, our method achieves competitive or superior results relative to state-of-the-art baselines while reducing computational costs for up to $2\times$.

[44] Gaussian Herding across Pens: An Optimal Transport Perspective on Global Gaussian Reduction for 3DGS

Tao Wang,Mengyu Li,Geduo Zeng,Cheng Meng,Qiong Zhang

Main category: cs.CV

TL;DR: 3D高斯泼溅（3DGS）是一种高效的辐射场渲染技术，但通常需要大量冗余的高斯基元，导致内存和渲染资源消耗巨大。本文提出了一种基于最优传输的全局高斯混合缩减方法，显著减少了高斯基元数量，同时保持渲染质量。

Details

Motivation: 现有的3DGS压缩方法基于启发式重要性评分，缺乏全局保真度保证。本文旨在通过最优传输视角解决这一问题，实现高效且保真的高斯基元压缩。 Method: 首先通过KD树分区最小化复合传输散度，生成紧凑的几何表示；然后通过微调颜色和透明度属性，将外观与几何解耦，使用更少的高斯基元。 Result: 实验表明，该方法仅需10%的高斯基元即可达到与原始3DGS相当的渲染质量（PSNR、SSIM、LPIPS），并优于现有压缩技术。 Conclusion: 该方法适用于任何3DGS流程，为轻量级神经渲染提供了一种高效且通用的解决方案。 Abstract: 3D Gaussian Splatting (3DGS) has emerged as a powerful technique for radiance field rendering, but it typically requires millions of redundant Gaussian primitives, overwhelming memory and rendering budgets. Existing compaction approaches address this by pruning Gaussians based on heuristic importance scores, without global fidelity guarantee. To bridge this gap, we propose a novel optimal transport perspective that casts 3DGS compaction as global Gaussian mixture reduction. Specifically, we first minimize the composite transport divergence over a KD-tree partition to produce a compact geometric representation, and then decouple appearance from geometry by fine-tuning color and opacity attributes with far fewer Gaussian primitives. Experiments on benchmark datasets show that our method (i) yields negligible loss in rendering quality (PSNR, SSIM, LPIPS) compared to vanilla 3DGS with only 10% Gaussians; and (ii) consistently outperforms state-of-the-art 3DGS compaction techniques. Notably, our method is applicable to any stage of vanilla or accelerated 3DGS pipelines, providing an efficient and agnostic pathway to lightweight neural rendering.

[45] AngleRoCL: Angle-Robust Concept Learning for Physically View-Invariant T2I Adversarial Patches

Wenjun Ji,Yuxiang Fu,Luyang Ying,Deng-Ping Fan,Yuyi Wang,Ming-Ming Cheng,Ivor Tsang,Qing Guo

Main category: cs.CV

TL;DR: 本文研究了文本到图像（T2I）扩散模型生成的对抗性补丁的角度鲁棒性问题，提出了Angle-Robust Concept Learning（AngleRoCL）方法，显著提升了补丁在不同视角下的攻击效果。

Details

Motivation: 现有方法忽略了T2I对抗性补丁在物理世界中不同视角下的攻击效果，本文旨在解决这一问题并揭示文本对补丁角度鲁棒性的影响。 Method: 提出AngleRoCL方法，通过学习通用概念（文本嵌入）来生成具有角度鲁棒性的对抗性补丁，并将其融入文本提示中。 Result: 实验表明，AngleRoCL显著提升了补丁的角度鲁棒性，攻击成功率在挑战性视角下仍保持高水平，平均相对改进超过50%。 Conclusion: 本研究深化了对物理角度鲁棒补丁的理解，揭示了文本概念与T2I生成内容物理属性之间的关系。 Abstract: Cutting-edge works have demonstrated that text-to-image (T2I) diffusion models can generate adversarial patches that mislead state-of-the-art object detectors in the physical world, revealing detectors' vulnerabilities and risks. However, these methods neglect the T2I patches' attack effectiveness when observed from different views in the physical world (i.e., angle robustness of the T2I adversarial patches). In this paper, we study the angle robustness of T2I adversarial patches comprehensively, revealing their angle-robust issues, demonstrating that texts affect the angle robustness of generated patches significantly, and task-specific linguistic instructions fail to enhance the angle robustness. Motivated by the studies, we introduce Angle-Robust Concept Learning (AngleRoCL), a simple and flexible approach that learns a generalizable concept (i.e., text embeddings in implementation) representing the capability of generating angle-robust patches. The learned concept can be incorporated into textual prompts and guides T2I models to generate patches with their attack effectiveness inherently resistant to viewpoint variations. Through extensive simulation and physical-world experiments on five SOTA detectors across multiple views, we demonstrate that AngleRoCL significantly enhances the angle robustness of T2I adversarial patches compared to baseline methods. Our patches maintain high attack success rates even under challenging viewing conditions, with over 50% average relative improvement in attack effectiveness across multiple angles. This research advances the understanding of physically angle-robust patches and provides insights into the relationship between textual concepts and physical properties in T2I-generated contents.

[46] 3DGeoDet: General-purpose Geometry-aware Image-based 3D Object Detection

Yi Zhang,Yi Wang,Yawen Cui,Lap-Pui Chau

Main category: cs.CV

TL;DR: 3DGeoDet是一种新颖的几何感知3D物体检测方法，通过显式和隐式3D几何表示提升性能，无需3D信号监督。

Details

Motivation: 解决基于图像的3D物体检测中缺乏3D几何线索的问题。 Method: 利用预测深度生成显式（体素占用）和隐式（TSDF）3D表示，结合体素占用注意力优化。 Result: 在多个基准数据集上表现优异，如SUN RGB-D提升9.3 mAP@0.5。 Conclusion: 3DGeoDet通过几何表示显著提升3D检测性能，适用于多样环境。 Abstract: This paper proposes 3DGeoDet, a novel geometry-aware 3D object detection approach that effectively handles single- and multi-view RGB images in indoor and outdoor environments, showcasing its general-purpose applicability. The key challenge for image-based 3D object detection tasks is the lack of 3D geometric cues, which leads to ambiguity in establishing correspondences between images and 3D representations. To tackle this problem, 3DGeoDet generates efficient 3D geometric representations in both explicit and implicit manners based on predicted depth information. Specifically, we utilize the predicted depth to learn voxel occupancy and optimize the voxelized 3D feature volume explicitly through the proposed voxel occupancy attention. To further enhance 3D awareness, the feature volume is integrated with an implicit 3D representation, the truncated signed distance function (TSDF). Without requiring supervision from 3D signals, we significantly improve the model's comprehension of 3D geometry by leveraging intermediate 3D representations and achieve end-to-end training. Our approach surpasses the performance of state-of-the-art image-based methods on both single- and multi-view benchmark datasets across diverse environments, achieving a 9.3 mAP@0.5 improvement on the SUN RGB-D dataset, a 3.3 mAP@0.5 improvement on the ScanNetV2 dataset, and a 0.19 AP3D@0.7 improvement on the KITTI dataset. The project page is available at: https://cindy0725.github.io/3DGeoDet/.

[47] GLD-Road:A global-local decoding road network extraction model for remote sensing images

Ligao Deng,Yupeng Deng,Yu Meng,Jingbo Chen,Zhihao Xi,Diyou Liu,Qifeng Chu

Main category: cs.CV

TL;DR: GLD-Road是一种结合全局效率和局部精度的两阶段模型，用于高效提取道路网络，显著提升性能并减少计算时间。

Details

Motivation: 道路网络对测绘、自动驾驶和灾害响应至关重要，但人工标注成本高，现有深度学习方法存在误差或效率问题。 Method: GLD-Road采用两阶段方法：先通过全局检测道路节点并连接，再通过局部迭代优化断开的道路。 Result: 实验表明，GLD-Road在APLS指标上优于现有方法（City-Scale提升1.9%，SpaceNet3提升0.67%），并显著减少检索时间（比Sat2Graph快40%，比RNGDet++快92%）。 Conclusion: GLD-Road在道路网络提取任务中实现了高效与精确的平衡，具有实际应用潜力。 Abstract: Road networks are crucial for mapping, autonomous driving, and disaster response. While manual annotation is costly, deep learning offers efficient extraction. Current methods include postprocessing (prone to errors), global parallel (fast but misses nodes), and local iterative (accurate but slow). We propose GLD-Road, a two-stage model combining global efficiency and local precision. First, it detects road nodes and connects them via a Connect Module. Then, it iteratively refines broken roads using local searches, drastically reducing computation. Experiments show GLD-Road outperforms state-of-the-art methods, improving APLS by 1.9% (City-Scale) and 0.67% (SpaceNet3). It also reduces retrieval time by 40% vs. Sat2Graph (global) and 92% vs. RNGDet++ (local). The experimental results are available at https://github.com/ucas-dlg/GLD-Road.

[48] AD^2-Bench: A Hierarchical CoT Benchmark for MLLM in Autonomous Driving under Adverse Conditions

Zhaoyang Wei,Chenhui Qiang,Bowen Jiang,Xumeng Han,Xuehui Yu,Zhenjun Han

Main category: cs.CV

TL;DR: AD^2-Bench是首个针对恶劣天气和复杂场景下自动驾驶的Chain-of-Thought（CoT）评测基准，包含5.4k高质量标注数据，支持多步推理评估，结果显示当前MLLMs准确率不足60%。

Details

Motivation: 现有评测基准忽视了恶劣天气和复杂交通环境下CoT推理的严格评估需求，AD^2-Bench填补了这一空白。 Method: 构建AD^2-Bench，覆盖多样恶劣环境数据，提供细粒度标注和专用评估框架，支持文本级、点级和区域级视觉提示的推理分析。 Result: 评测显示当前MLLMs在AD^2-Bench上的准确率低于60%，凸显了其挑战性和对鲁棒性推理的需求。 Conclusion: AD^2-Bench为自动驾驶中的CoT推理提供了标准化评测平台，推动了MLLMs推理能力的提升。 Abstract: Chain-of-Thought (CoT) reasoning has emerged as a powerful approach to enhance the structured, multi-step decision-making capabilities of Multi-Modal Large Models (MLLMs), is particularly crucial for autonomous driving with adverse weather conditions and complex traffic environments. However, existing benchmarks have largely overlooked the need for rigorous evaluation of CoT processes in these specific and challenging scenarios. To address this critical gap, we introduce AD^2-Bench, the first Chain-of-Thought benchmark specifically designed for autonomous driving with adverse weather and complex scenes. AD^2-Bench is meticulously constructed to fulfill three key criteria: comprehensive data coverage across diverse adverse environments, fine-grained annotations that support multi-step reasoning, and a dedicated evaluation framework tailored for assessing CoT performance. The core contribution of AD^2-Bench is its extensive collection of over 5.4k high-quality, manually annotated CoT instances. Each intermediate reasoning step in these annotations is treated as an atomic unit with explicit ground truth, enabling unprecedented fine-grained analysis of MLLMs' inferential processes under text-level, point-level, and region-level visual prompts. Our comprehensive evaluation of state-of-the-art MLLMs on AD^2-Bench reveals accuracy below 60%, highlighting the benchmark's difficulty and the need to advance robust, interpretable end-to-end autonomous driving systems. AD^2-Bench thus provides a standardized evaluation platform, driving research forward by improving MLLMs' reasoning in autonomous driving, making it an invaluable resource.

[49] SemanticSplat: Feed-Forward 3D Scene Understanding with Language-Aware Gaussian Fields

Qijing Li,Jingxiang Sun,Liang An,Zhaoqi Su,Hongwen Zhang,Yebin Liu

Main category: cs.CV

TL;DR: SemanticSplat提出了一种基于3D高斯和语义属性的前馈方法，用于联合建模几何、外观和语义，解决了现有方法在语义提取和几何重建上的不足。

Details

Motivation: 现有方法在语义提取和几何重建上表现不佳，且依赖密集输入视图，限制了实用性。 Method: SemanticSplat通过融合多特征场（如LSeg、SAM）和成本体积表示，预测语义各向异性高斯，并采用两阶段蒸馏框架从稀疏视图图像中重建多模态语义特征场。 Result: 实验证明该方法在可提示和开放词汇分割等3D场景理解任务中表现优异。 Conclusion: SemanticSplat通过联合建模几何、外观和语义，实现了更全面的场景理解。 Abstract: Holistic 3D scene understanding, which jointly models geometry, appearance, and semantics, is crucial for applications like augmented reality and robotic interaction. Existing feed-forward 3D scene understanding methods (e.g., LSM) are limited to extracting language-based semantics from scenes, failing to achieve holistic scene comprehension. Additionally, they suffer from low-quality geometry reconstruction and noisy artifacts. In contrast, per-scene optimization methods rely on dense input views, which reduces practicality and increases complexity during deployment. In this paper, we propose SemanticSplat, a feed-forward semantic-aware 3D reconstruction method, which unifies 3D Gaussians with latent semantic attributes for joint geometry-appearance-semantics modeling. To predict the semantic anisotropic Gaussians, SemanticSplat fuses diverse feature fields (e.g., LSeg, SAM) with a cost volume representation that stores cross-view feature similarities, enhancing coherent and accurate scene comprehension. Leveraging a two-stage distillation framework, SemanticSplat reconstructs a holistic multi-modal semantic feature field from sparse-view images. Experiments demonstrate the effectiveness of our method for 3D scene understanding tasks like promptable and open-vocabulary segmentation. Video results are available at https://semanticsplat.github.io.

[50] Consistent Story Generation with Asymmetry Zigzag Sampling

Mingxiao LI,mang ning,Marie-Francine Moens

Main category: cs.CV

TL;DR: 提出了一种名为Zigzag Sampling with Asymmetric Prompts and Visual Sharing的训练无关采样策略，以提升视觉故事生成中的主题一致性。

Details

Motivation: 现有方法在生成多张图像时难以保持主题一致性，而现有解决方案要么资源密集，要么效果有限。 Method: 采用Zigzag采样机制，交替使用非对称提示和视觉共享模块，以保留主题特征并增强一致性。 Result: 实验表明，该方法在生成连贯且一致的视觉故事方面显著优于先前方法。 Conclusion: 该方法为视觉故事生成提供了一种高效且无需额外训练的一致性增强方案。 Abstract: Text-to-image generation models have made significant progress in producing high-quality images from textual descriptions, yet they continue to struggle with maintaining subject consistency across multiple images, a fundamental requirement for visual storytelling. Existing methods attempt to address this by either fine-tuning models on large-scale story visualization datasets, which is resource-intensive, or by using training-free techniques that share information across generations, which still yield limited success. In this paper, we introduce a novel training-free sampling strategy called Zigzag Sampling with Asymmetric Prompts and Visual Sharing to enhance subject consistency in visual story generation. Our approach proposes a zigzag sampling mechanism that alternates between asymmetric prompting to retain subject characteristics, while a visual sharing module transfers visual cues across generated images to %further enforce consistency. Experimental results, based on both quantitative metrics and qualitative evaluations, demonstrate that our method significantly outperforms previous approaches in generating coherent and consistent visual stories. The code is available at https://github.com/Mingxiao-Li/Asymmetry-Zigzag-StoryDiffusion.

[51] ECAM: A Contrastive Learning Approach to Avoid Environmental Collision in Trajectory Forecasting

Giacomo Rosin,Muhammad Rameez Ur Rahman,Sebastiano Vascon

Main category: cs.CV

TL;DR: 论文提出ECAM模块，通过对比学习增强轨迹预测模型的避障能力，显著降低碰撞率。

Details

Motivation: 现有轨迹预测方法常忽略环境因素导致碰撞，需改进避障能力。 Method: 引入ECAM模块，基于对比学习，可集成到现有模型中。 Result: 在ETH/UCY数据集上验证，碰撞率降低40-50%。 Conclusion: ECAM模块有效提升轨迹预测的避障能力，代码已开源。 Abstract: Human trajectory forecasting is crucial in applications such as autonomous driving, robotics and surveillance. Accurate forecasting requires models to consider various factors, including social interactions, multi-modal predictions, pedestrian intention and environmental context. While existing methods account for these factors, they often overlook the impact of the environment, which leads to collisions with obstacles. This paper introduces ECAM (Environmental Collision Avoidance Module), a contrastive learning-based module to enhance collision avoidance ability with the environment. The proposed module can be integrated into existing trajectory forecasting models, improving their ability to generate collision-free predictions. We evaluate our method on the ETH/UCY dataset and quantitatively and qualitatively demonstrate its collision avoidance capabilities. Our experiments show that state-of-the-art methods significantly reduce (-40/50%) the collision rate when integrated with the proposed module. The code is available at https://github.com/CVML-CFU/ECAM.

[52] HSENet: Hybrid Spatial Encoding Network for 3D Medical Vision-Language Understanding

Yanzhao Shi,Xiaodan Zhang,Junzhong Ji,Haoning Jiang,Chengxin Zheng,Yinong Wang,Liangqiong Qu

Main category: cs.CV

TL;DR: HSENet提出了一种新型框架，通过双3D视觉编码器和空间压缩技术，提升3D医学影像与语言模型的结合能力，显著提高了诊断准确性。

Details

Motivation: 现有多模态大语言模型主要针对2D医学影像，无法充分捕捉复杂的3D解剖结构，导致诊断错误。 Method: HSENet采用双3D视觉编码器感知全局和细节信息，并通过Spatial Packer压缩高分辨率3D空间区域为紧凑视觉标记。 Result: 在3D视觉语言检索、医学报告生成和视觉问答任务中，HSENet均取得显著性能提升。 Conclusion: HSENet通过高效3D视觉编码和压缩技术，显著提升了3D医学影像的诊断准确性，具有广泛应用潜力。 Abstract: Automated 3D CT diagnosis empowers clinicians to make timely, evidence-based decisions by enhancing diagnostic accuracy and workflow efficiency. While multimodal large language models (MLLMs) exhibit promising performance in visual-language understanding, existing methods mainly focus on 2D medical images, which fundamentally limits their ability to capture complex 3D anatomical structures. This limitation often leads to misinterpretation of subtle pathologies and causes diagnostic hallucinations. In this paper, we present Hybrid Spatial Encoding Network (HSENet), a framework that exploits enriched 3D medical visual cues by effective visual perception and projection for accurate and robust vision-language understanding. Specifically, HSENet employs dual-3D vision encoders to perceive both global volumetric contexts and fine-grained anatomical details, which are pre-trained by dual-stage alignment with diagnostic reports. Furthermore, we propose Spatial Packer, an efficient multimodal projector that condenses high-resolution 3D spatial regions into a compact set of informative visual tokens via centroid-based compression. By assigning spatial packers with dual-3D vision encoders, HSENet can seamlessly perceive and transfer hybrid visual representations to LLM's semantic space, facilitating accurate diagnostic text generation. Experimental results demonstrate that our method achieves state-of-the-art performance in 3D language-visual retrieval (39.85% of R@100, +5.96% gain), 3D medical report generation (24.01% of BLEU-4, +8.01% gain), and 3D visual question answering (73.60% of Major Class Accuracy, +1.99% gain), confirming its effectiveness. Our code is available at https://github.com/YanzhaoShi/HSENet.

[53] DGAE: Diffusion-Guided Autoencoder for Efficient Latent Representation Learning

Dongxu Liu,Yuang Peng,Haomiao Tang,Yuwei Chen,Chunrui Han,Zheng Ge,Daxin Jiang,Mingxue Liao

Main category: cs.CV

TL;DR: DGAE提出了一种基于扩散模型的自动编码器，通过提升解码器的表达能力，解决了高压缩比下的性能下降问题，同时实现了更紧凑的潜在空间表示。

Details

Motivation: 解决自动编码器在高压缩比下性能下降及GAN训练不稳定的问题，并减少潜在空间维度以实现高效表示。 Method: 提出DGAE，利用扩散模型引导解码器恢复潜在表示中未完全解码的信息信号。 Result: DGAE在高空间压缩率下有效缓解性能下降，潜在空间缩小2倍，与扩散模型结合在ImageNet-1K图像生成中表现优异。 Conclusion: DGAE通过提升解码器表达能力，实现了高效紧凑的潜在表示，并加速扩散模型的收敛。 Abstract: Autoencoders empower state-of-the-art image and video generative models by compressing pixels into a latent space through visual tokenization. Although recent advances have alleviated the performance degradation of autoencoders under high compression ratios, addressing the training instability caused by GAN remains an open challenge. While improving spatial compression, we also aim to minimize the latent space dimensionality, enabling more efficient and compact representations. To tackle these challenges, we focus on improving the decoder's expressiveness. Concretely, we propose DGAE, which employs a diffusion model to guide the decoder in recovering informative signals that are not fully decoded from the latent representation. With this design, DGAE effectively mitigates the performance degradation under high spatial compression rates. At the same time, DGAE achieves state-of-the-art performance with a 2x smaller latent space. When integrated with Diffusion Models, DGAE demonstrates competitive performance on image generation for ImageNet-1K and shows that this compact latent representation facilitates faster convergence of the diffusion model.

[54] HopaDIFF: Holistic-Partial Aware Fourier Conditioned Diffusion for Referring Human Action Segmentation in Multi-Person Scenarios

Kunyu Peng,Junchao Huang,Xiangsheng Huang,Di Wen,Junwei Zheng,Yufan Chen,Kailun Yang,Jiamin Wu,Chongqing Hao,Rainer Stiefelhagen

Main category: cs.CV

TL;DR: 该论文提出了一种基于文本参考的多人物动作分割方法，并构建了首个相关数据集RHAS133。通过提出的HopaDIFF框架，结合傅里叶条件和注意力机制，实现了最先进的性能。

Details

Motivation: 现有动作分割方法主要针对单人物固定动作序列，忽视了多人物场景。本文旨在填补这一空白，通过文本描述指导多人物动作分割。 Method: 提出了HopaDIFF框架，结合了跨输入门注意力xLSTM和傅里叶条件，增强整体-局部长程推理和细粒度控制。 Result: 在RHAS133数据集上，HopaDIFF在多种评估设置中达到了最先进的性能。 Conclusion: HopaDIFF为多人物动作分割提供了有效解决方案，并通过新数据集和框架推动了该领域的发展。 Abstract: Action segmentation is a core challenge in high-level video understanding, aiming to partition untrimmed videos into segments and assign each a label from a predefined action set. Existing methods primarily address single-person activities with fixed action sequences, overlooking multi-person scenarios. In this work, we pioneer textual reference-guided human action segmentation in multi-person settings, where a textual description specifies the target person for segmentation. We introduce the first dataset for Referring Human Action Segmentation, i.e., RHAS133, built from 133 movies and annotated with 137 fine-grained actions with 33h video data, together with textual descriptions for this new task. Benchmarking existing action recognition methods on RHAS133 using VLM-based feature extractors reveals limited performance and poor aggregation of visual cues for the target person. To address this, we propose a holistic-partial aware Fourier-conditioned diffusion framework, i.e., HopaDIFF, leveraging a novel cross-input gate attentional xLSTM to enhance holistic-partial long-range reasoning and a novel Fourier condition to introduce more fine-grained control to improve the action segmentation generation. HopaDIFF achieves state-of-the-art results on RHAS133 in diverse evaluation settings. The code is available at https://github.com/KPeng9510/HopaDIFF.git.

[55] Self-Supervised Multi-Part Articulated Objects Modeling via Deformable Gaussian Splatting and Progressive Primitive Segmentation

Haowen Wang,Xiaoping Yuan,Zhao Jin,Zhen Zhao,Zhengping Che,Yousong Xue,Jin Tian,Yakun Huang,Jian Tang

Main category: cs.CV

TL;DR: DeGSS提出了一种统一框架，通过可变形3D高斯场编码铰接物体的几何、外观和运动，支持无监督的部件分割和精确建模。

Details

Motivation: 铰接物体在现实世界中普遍存在，但现有方法在缺乏人工标注时难以构建统一表示。 Method: DeGSS将每个交互状态建模为共享场的平滑变形，通过变形轨迹实现渐进式部件分割。 Result: 方法在合成和真实数据集上表现优异，优于现有方法。 Conclusion: DeGSS为铰接物体提供了一种连续、解耦的表示，支持部件级重建和运动建模。 Abstract: Articulated objects are ubiquitous in everyday life, and accurate 3D representations of their geometry and motion are critical for numerous applications. However, in the absence of human annotation, existing approaches still struggle to build a unified representation for objects that contain multiple movable parts. We introduce DeGSS, a unified framework that encodes articulated objects as deformable 3D Gaussian fields, embedding geometry, appearance, and motion in one compact representation. Each interaction state is modeled as a smooth deformation of a shared field, and the resulting deformation trajectories guide a progressive coarse-to-fine part segmentation that identifies distinct rigid components, all in an unsupervised manner. The refined field provides a spatially continuous, fully decoupled description of every part, supporting part-level reconstruction and precise modeling of their kinematic relationships. To evaluate generalization and realism, we enlarge the synthetic PartNet-Mobility benchmark and release RS-Art, a real-to-sim dataset that pairs RGB captures with accurately reverse-engineered 3D models. Extensive experiments demonstrate that our method outperforms existing methods in both accuracy and stability.

Maik Dannecker,Vasiliki Sideri-Lampretsa,Sophie Starck,Angeline Mihailov,Mathieu Milh,Nadine Girard,Guillaume Auzias,Daniel Rueckert

Main category: cs.CV

TL;DR: CINeMA是一种新型框架，用于在低数据环境下创建高分辨率、多模态的胎儿和新生儿脑图谱，显著提升效率和灵活性。

Details

Motivation: 研究胎儿和新生儿大脑快速发育阶段需要高分辨率图谱，但传统方法依赖大数据，而病理数据稀缺。 Method: CINeMA在潜在空间中操作，避免计算密集型图像配准，支持基于解剖特征的灵活条件。 Result: CINeMA在准确性、效率和多功能性上超越现有方法，支持组织分割、年龄预测等任务。 Conclusion: CINeMA是推动脑研究的有力工具，代码和图谱已开源。 Abstract: Magnetic resonance imaging of fetal and neonatal brains reveals rapid neurodevelopment marked by substantial anatomical changes unfolding within days. Studying this critical stage of the developing human brain, therefore, requires accurate brain models-referred to as atlases-of high spatial and temporal resolution. To meet these demands, established traditional atlases and recently proposed deep learning-based methods rely on large and comprehensive datasets. This poses a major challenge for studying brains in the presence of pathologies for which data remains scarce. We address this limitation with CINeMA (Conditional Implicit Neural Multi-Modal Atlas), a novel framework for creating high-resolution, spatio-temporal, multimodal brain atlases, suitable for low-data settings. Unlike established methods, CINeMA operates in latent space, avoiding compute-intensive image registration and reducing atlas construction times from days to minutes. Furthermore, it enables flexible conditioning on anatomical features including GA, birth age, and pathologies like ventriculomegaly (VM) and agenesis of the corpus callosum (ACC). CINeMA supports downstream tasks such as tissue segmentation and age prediction whereas its generative properties enable synthetic data creation and anatomically informed data augmentation. Surpassing state-of-the-art methods in accuracy, efficiency, and versatility, CINeMA represents a powerful tool for advancing brain research. We release the code and atlases at https://github.com/m-dannecker/CINeMA.

[57] Reasoning Models Are More Easily Gaslighted Than You Think

Bin Zhu,Hailong Yin,Jingjing Chen,Yu-Gang Jiang

Main category: cs.CV

TL;DR: 论文评估了三种先进推理模型在误导性用户输入下的表现，发现其准确性显著下降，并提出了GaslightingBench-R基准以进一步测试模型的抗干扰能力。

Details

Motivation: 探索推理模型在面对误导性用户输入时的鲁棒性，填补现有研究的空白。 Method: 系统评估了三种推理模型（OpenAI的o4-mini、Claude-3.7-Sonnet和Gemini-2.5-Flash）在三个多模态基准（MMMU、MathVista和CharXiv）上的表现，并设计了GaslightingBench-R基准。 Result: 模型在误导性提示下准确性平均下降25-29%，在GaslightingBench-R上下降超过53%。 Conclusion: 推理模型在逐步推理和信念坚持之间存在显著差距，鲁棒性存在根本性限制。 Abstract: Recent advances in reasoning-centric models promise improved robustness through mechanisms such as chain-of-thought prompting and test-time scaling. However, their ability to withstand misleading user input remains underexplored. In this paper, we conduct a systematic evaluation of three state-of-the-art reasoning models, i.e., OpenAI's o4-mini, Claude-3.7-Sonnet and Gemini-2.5-Flash, across three multimodal benchmarks: MMMU, MathVista, and CharXiv. Our evaluation reveals significant accuracy drops (25-29% on average) following gaslighting negation prompts, indicating that even top-tier reasoning models struggle to preserve correct answers under manipulative user feedback. Built upon the insights of the evaluation and to further probe this vulnerability, we introduce GaslightingBench-R, a new diagnostic benchmark specifically designed to evaluate reasoning models' susceptibility to defend their belief under gaslighting negation prompt. Constructed by filtering and curating 1,025 challenging samples from the existing benchmarks, GaslightingBench-R induces even more dramatic failures, with accuracy drops exceeding 53% on average. Our findings reveal fundamental limitations in the robustness of reasoning models, highlighting the gap between step-by-step reasoning and belief persistence.

[58] Adding simple structure at inference improves Vision-Language Compositionality

Imanol Miranda,Ander Salaberria,Eneko Agirre,Gorka Azkune

Main category: cs.CV

TL;DR: 本文提出了一种在推理时改进双编码器视觉语言模型（VLM）组合性的方法，通过分割图像和文本并匹配对齐部分来提升检索性能。

Details

Motivation: 现有双编码器VLM（如CLIP）在组合性任务上表现不佳，限制了检索性能。目前研究多关注训练方法，而推理时技术较少被探索。 Method: 在推理时，将图像分割为小块，提取文本片段（对象、属性和关系），用VLM对齐图像块与文本片段，并聚合相似性得分。 Result: 该方法在多种数据集上显著提升了VLM的组合性表现，尤其是在属性-对象绑定任务中，且无需额外训练。 Conclusion: 推理时技术具有潜力，图像分割是关键改进点，未来可进一步优化推理方法。 Abstract: Dual encoder Vision-Language Models (VLM) such as CLIP are widely used for image-text retrieval tasks. However, those models struggle with compositionality, showing a bag-of-words-like behavior that limits their retrieval performance. Many different training approaches have been proposed to improve the vision-language compositionality capabilities of those models. In comparison, inference-time techniques have received little attention. In this paper, we propose to add simple structure at inference, where, given an image and a caption: i) we divide the image into different smaller crops, ii) we extract text segments, capturing objects, attributes and relations, iii) using a VLM, we find the image crops that better align with text segments obtaining matches, and iv) we compute the final image-text similarity aggregating the individual similarities of the matches. Based on various popular dual encoder VLMs, we evaluate our approach in controlled and natural datasets for VL compositionality. We find that our approach consistently improves the performance of evaluated VLMs without any training, which shows the potential of inference-time techniques. The results are especially good for attribute-object binding as shown in the controlled dataset. As a result of an extensive analysis: i) we show that processing image crops is actually essential for the observed gains in performance, and ii) we identify specific areas to further improve inference-time approaches.

[59] Towards Practical Alzheimer's Disease Diagnosis: A Lightweight and Interpretable Spiking Neural Model

Changwei Wu,Yifei Chen,Yuxin Du,Jinying Zong,Jie Dong,Mingxuan Liu,Yong Peng,Jin Fan,Feiwei Qin,Changmiao Wang

Main category: cs.CV

TL;DR: 论文提出了一种名为FasterSNN的混合神经网络架构，用于阿尔茨海默病（AD）的早期诊断，结合了生物启发的LIF神经元、区域自适应卷积和多尺度脉冲注意力，提高了效率和稳定性。

Details

Motivation: 早期诊断AD（尤其是轻度认知障碍阶段）至关重要，但现有方法存在主观评估和高成本问题。深度学习方法虽自动化但能耗高，而SNNs适合建模AD的稀疏事件驱动模式，但表达能力和训练稳定性不足。 Method: 提出FasterSNN，结合LIF神经元、区域自适应卷积和多尺度脉冲注意力，稀疏高效处理3D MRI数据。 Result: 在基准数据集上，FasterSNN表现出竞争性性能，效率和稳定性显著提升。 Conclusion: FasterSNN为AD筛查提供了一种高效、稳定的解决方案，具有实际应用潜力。 Abstract: Early diagnosis of Alzheimer's Disease (AD), especially at the mild cognitive impairment (MCI) stage, is vital yet hindered by subjective assessments and the high cost of multimodal imaging modalities. Although deep learning methods offer automated alternatives, their energy inefficiency and computational demands limit real-world deployment, particularly in resource-constrained settings. As a brain-inspired paradigm, spiking neural networks (SNNs) are inherently well-suited for modeling the sparse, event-driven patterns of neural degeneration in AD, offering a promising foundation for interpretable and low-power medical diagnostics. However, existing SNNs often suffer from weak expressiveness and unstable training, which restrict their effectiveness in complex medical tasks. To address these limitations, we propose FasterSNN, a hybrid neural architecture that integrates biologically inspired LIF neurons with region-adaptive convolution and multi-scale spiking attention. This design enables sparse, efficient processing of 3D MRI while preserving diagnostic accuracy. Experiments on benchmark datasets demonstrate that FasterSNN achieves competitive performance with substantially improved efficiency and stability, supporting its potential for practical AD screening. Our source code is available at https://github.com/wuchangw/FasterSNN.

[60] CHIP: A multi-sensor dataset for 6D pose estimation of chairs in industrial settings

Mattia Nardon,Mikel Mujika Agirre,Ander González Tomé,Daniel Sedano Algarabel,Josep Rueda Collell,Ana Paola Caro,Andrea Caraffa,Fabio Poiesi,Paul Ian Chippendale,Davide Boscaini

Main category: cs.CV

TL;DR: CHIP是首个针对工业环境中机器人操作的6D姿态估计数据集，填补了现有数据集的不足，包含77,811张RGBD图像，并展示了独特的挑战。

Details

Motivation: 现有6D姿态估计基准在真实工业条件下评估不足，CHIP填补了这一空白。 Method: CHIP数据集包含七种不同椅子，使用三种RGBD技术捕获，并自动标注真实6D姿态。 Result: 三种零样本6D姿态估计方法在CHIP上表现显示仍有改进空间。 Conclusion: CHIP为工业环境中的6D姿态估计提供了新挑战，并将公开发布。 Abstract: Accurate 6D pose estimation of complex objects in 3D environments is essential for effective robotic manipulation. Yet, existing benchmarks fall short in evaluating 6D pose estimation methods under realistic industrial conditions, as most datasets focus on household objects in domestic settings, while the few available industrial datasets are limited to artificial setups with objects placed on tables. To bridge this gap, we introduce CHIP, the first dataset designed for 6D pose estimation of chairs manipulated by a robotic arm in a real-world industrial environment. CHIP includes seven distinct chairs captured using three different RGBD sensing technologies and presents unique challenges, such as distractor objects with fine-grained differences and severe occlusions caused by the robotic arm and human operators. CHIP comprises 77,811 RGBD images annotated with ground-truth 6D poses automatically derived from the robot's kinematics, averaging 11,115 annotations per chair. We benchmark CHIP using three zero-shot 6D pose estimation methods, assessing performance across different sensor types, localization priors, and occlusion levels. Results show substantial room for improvement, highlighting the unique challenges posed by the dataset. CHIP will be publicly released.

[61] Non-Contact Health Monitoring During Daily Personal Care Routines

Xulin Ma,Jiankai Tang,Zhang Jiang,Songqin Cheng,Yuanchun Shi,Dong LI,Xin Liu,Daniel McDuff,Xiaojing Liu,Yuntao Wang

Main category: cs.CV

TL;DR: 论文提出了LADH数据集，结合RGB和红外视频提升远程光电容积描记术（rPPG）在长期健康监测中的准确性和鲁棒性。

Details

Motivation: 解决rPPG在高海拔环境中因光照变化、遮挡和动态面部姿势带来的挑战。 Method: 使用240个同步RGB和红外面部视频，结合多任务学习，提升生理信号监测性能。 Result: 结合RGB和红外视频输入，心率估计的平均绝对误差为4.99 BPM。 Conclusion: LADH数据集和多任务学习方法显著提升了rPPG在长期健康监测中的表现。 Abstract: Remote photoplethysmography (rPPG) enables non-contact, continuous monitoring of physiological signals and offers a practical alternative to traditional health sensing methods. Although rPPG is promising for daily health monitoring, its application in long-term personal care scenarios, such as mirror-facing routines in high-altitude environments, remains challenging due to ambient lighting variations, frequent occlusions from hand movements, and dynamic facial postures. To address these challenges, we present LADH (Long-term Altitude Daily Health), the first long-term rPPG dataset containing 240 synchronized RGB and infrared (IR) facial videos from 21 participants across five common personal care scenarios, along with ground-truth PPG, respiration, and blood oxygen signals. Our experiments demonstrate that combining RGB and IR video inputs improves the accuracy and robustness of non-contact physiological monitoring, achieving a mean absolute error (MAE) of 4.99 BPM in heart rate estimation. Furthermore, we find that multi-task learning enhances performance across multiple physiological indicators simultaneously. Dataset and code are open at https://github.com/McJackTang/FusionVitals.

[62] The Four Color Theorem for Cell Instance Segmentation

Ye Zhang,Yu Zhou,Yifeng Wang,Jun Xiao,Ziyue Wang,Yongbing Zhang,Jianxu Chen

Main category: cs.CV

TL;DR: 提出了一种基于四色定理的新型细胞实例分割方法，通过四色编码简化实例区分，并设计了渐进训练策略解决编码非唯一性问题。

Details

Motivation: 生物医学图像中紧密接触细胞的准确分割是一个持续挑战，现有方法在性能和计算效率之间难以平衡。 Method: 将细胞视为国家、组织视为海洋，引入四色编码方案，将实例分割转化为仅需预测四类的约束语义分割问题，并设计渐进训练策略。 Result: 在多种模式下的实验表明，该方法达到了最先进的性能。 Conclusion: 提出的方法通过四色编码和渐进训练策略，显著简化了细胞实例分割问题，并取得了优异效果。 Abstract: Cell instance segmentation is critical to analyzing biomedical images, yet accurately distinguishing tightly touching cells remains a persistent challenge. Existing instance segmentation frameworks, including detection-based, contour-based, and distance mapping-based approaches, have made significant progress, but balancing model performance with computational efficiency remains an open problem. In this paper, we propose a novel cell instance segmentation method inspired by the four-color theorem. By conceptualizing cells as countries and tissues as oceans, we introduce a four-color encoding scheme that ensures adjacent instances receive distinct labels. This reformulation transforms instance segmentation into a constrained semantic segmentation problem with only four predicted classes, substantially simplifying the instance differentiation process. To solve the training instability caused by the non-uniqueness of four-color encoding, we design an asymptotic training strategy and encoding transformation method. Extensive experiments on various modes demonstrate our approach achieves state-of-the-art performance. The code is available at https://github.com/zhangye-zoe/FCIS.

[63] MPFNet: A Multi-Prior Fusion Network with a Progressive Training Strategy for Micro-Expression Recognition

Chuang Ma,Shaokai Zhao,Dongdong Zhou,Yu Pei,Zhiguo Luo,Liang Xie,Ye Yan,Erwei Yin

Main category: cs.CV

TL;DR: 论文提出了一种多先验融合网络（MPFNet），通过渐进式训练策略优化微表情识别任务，结合通用和高级特征编码器，显著提升了识别准确率。

Details

Motivation: 微表情识别因持续时间短、强度低而更具挑战性，现有方法未能充分利用多源先验知识。 Method: 提出MPFNet，包含通用特征编码器（GFE）和高级特征编码器（AFE），基于I3D和坐标注意力机制，并设计了MPFNet-P和MPFNet-C两种变体。 Result: 在SMIC、CASME II和SAMM数据集上分别达到0.811、0.924和0.857的准确率，部分数据集达到最优性能。 Conclusion: MPFNet通过多先验融合显著提升了微表情识别性能，为相关领域提供了新思路。 Abstract: Micro-expression recognition (MER), a critical subfield of affective computing, presents greater challenges than macro-expression recognition due to its brief duration and low intensity. While incorporating prior knowledge has been shown to enhance MER performance, existing methods predominantly rely on simplistic, singular sources of prior knowledge, failing to fully exploit multi-source information. This paper introduces the Multi-Prior Fusion Network (MPFNet), leveraging a progressive training strategy to optimize MER tasks. We propose two complementary encoders: the Generic Feature Encoder (GFE) and the Advanced Feature Encoder (AFE), both based on Inflated 3D ConvNets (I3D) with Coordinate Attention (CA) mechanisms, to improve the model's ability to capture spatiotemporal and channel-specific features. Inspired by developmental psychology, we present two variants of MPFNet--MPFNet-P and MPFNet-C--corresponding to two fundamental modes of infant cognitive development: parallel and hierarchical processing. These variants enable the evaluation of different strategies for integrating prior knowledge. Extensive experiments demonstrate that MPFNet significantly improves MER accuracy while maintaining balanced performance across categories, achieving accuracies of 0.811, 0.924, and 0.857 on the SMIC, CASME II, and SAMM datasets, respectively. To the best of our knowledge, our approach achieves state-of-the-art performance on the SMIC and SAMM datasets.

[64] Vision Matters: Simple Visual Perturbations Can Boost Multimodal Math Reasoning

Yuting Li,Lai Wei,Kaipeng Zheng,Jingyuan Huang,Linghe Kong,Lichao Sun,Weiran Huang

Main category: cs.CV

TL;DR: 研究发现，当前的多模态大语言模型（MLLMs）在视觉处理上存在不足，仅通过图像描述的语言模型表现甚至优于MLLMs。为此，作者提出了一种简单的视觉扰动框架，无需算法修改或额外数据即可增强感知鲁棒性，显著提升了数学推理性能。

Details

Motivation: 当前MLLMs在视觉处理上的不足，尤其是无法有效整合视觉信息进行推理，促使作者探索一种无需复杂修改的视觉扰动方法。 Method: 提出三种视觉扰动策略（干扰拼接、保持主导性的混合、随机旋转），可轻松集成到现有训练流程中（如SFT、DPO、GRPO）。 Result: 实验表明，该方法在多个数据集上显著提升了数学推理性能，训练后的Qwen2.5-VL-7B模型在开源7B RL调优模型中表现优异。 Conclusion: 视觉扰动在多模态数学推理中至关重要，研究揭示了不同扰动策略对视觉推理的独特贡献，强调了‘更好的推理始于更好的视觉’。代码已开源。 Abstract: Despite the rapid progress of multimodal large language models (MLLMs), they have largely overlooked the importance of visual processing. In a simple yet revealing experiment, we interestingly find that language-only models, when provided with image captions, can achieve comparable or even better performance than MLLMs that consume raw visual inputs. This suggests that current MLLMs may generate accurate visual descriptions but fail to effectively integrate them during reasoning. Motivated by this, we propose a simple visual perturbation framework that enhances perceptual robustness without requiring algorithmic modifications or additional training data. Our approach introduces three targeted perturbations: distractor concatenation, dominance-preserving mixup, and random rotation, that can be easily integrated into existing post-training pipelines including SFT, DPO, and GRPO. Through extensive experiments across multiple datasets, we demonstrate consistent improvements in mathematical reasoning performance, with gains comparable to those achieved through algorithmic changes. Additionally, we achieve competitive performance among open-source 7B RL-tuned models by training Qwen2.5-VL-7B with visual perturbation. Through comprehensive ablation studies, we analyze the effectiveness of different perturbation strategies, revealing that each perturbation type contributes uniquely to different aspects of visual reasoning. Our findings highlight the critical role of visual perturbation in multimodal mathematical reasoning: better reasoning begins with better seeing. Our code is available at https://github.com/YutingLi0606/Vision-Matters.

[65] ELBO-T2IAlign: A Generic ELBO-Based Method for Calibrating Pixel-level Text-Image Alignment in Diffusion Models

Qin Zhou,Zhiyang Zhang,Jinglong Wang,Xiaobin Li,Jing Zhang,Qian Yu,Lu Sheng,Dong Xu

Main category: cs.CV

TL;DR: 该论文提出了一种名为ELBO-T2IAlign的方法，用于校准扩散模型中像素与文本的对齐问题，并通过零样本参考图像分割任务评估其效果。

Details

Motivation: 现有方法假设扩散模型中的文本-图像对齐是完美的，但实际上存在偏差，尤其是在小尺寸、遮挡或罕见物体类别中。 Method: 使用证据下界（ELBO）校准像素-文本对齐，无需训练且适用于多种扩散模型架构。 Result: 实验证明，该方法在图像分割和生成任务中有效改善了像素-文本对齐。 Conclusion: ELBO-T2IAlign是一种简单通用的方法，能够显著提升扩散模型的文本-图像对齐能力。 Abstract: Diffusion models excel at image generation. Recent studies have shown that these models not only generate high-quality images but also encode text-image alignment information through attention maps or loss functions. This information is valuable for various downstream tasks, including segmentation, text-guided image editing, and compositional image generation. However, current methods heavily rely on the assumption of perfect text-image alignment in diffusion models, which is not the case. In this paper, we propose using zero-shot referring image segmentation as a proxy task to evaluate the pixel-level image and class-level text alignment of popular diffusion models. We conduct an in-depth analysis of pixel-text misalignment in diffusion models from the perspective of training data bias. We find that misalignment occurs in images with small sized, occluded, or rare object classes. Therefore, we propose ELBO-T2IAlign, a simple yet effective method to calibrate pixel-text alignment in diffusion models based on the evidence lower bound (ELBO) of likelihood. Our method is training-free and generic, eliminating the need to identify the specific cause of misalignment and works well across various diffusion model architectures. Extensive experiments on commonly used benchmark datasets on image segmentation and generation have verified the effectiveness of our proposed calibration approach.

[66] Class Similarity-Based Multimodal Classification under Heterogeneous Category Sets

Yangrui Zhu,Junhua Bao,Yipan Wei,Yapeng Li,Bo Du

Main category: cs.CV

TL;DR: 论文提出了一种多模态异构类别集学习（MMHCL）的实用设置，并提出了基于类相似性的跨模态融合模型（CSCF），以解决多模态数据中类别分布不一致的问题。

Details

Motivation: 现实应用中，多模态数据的类别分布存在不一致性，导致现有方法难以有效利用跨模态信息识别所有类别。 Method: CSCF将模态特定特征对齐到共享语义空间，通过不确定性估计选择最具判别性的模态进行决策融合，并基于类相似性整合跨模态信息。 Result: 实验表明，CSCF在多个基准数据集上显著优于现有方法，有效解决了MMHCL任务。 Conclusion: CSCF通过类相似性驱动的跨模态融合，成功解决了多模态异构类别集学习问题。 Abstract: Existing multimodal methods typically assume that different modalities share the same category set. However, in real-world applications, the category distributions in multimodal data exhibit inconsistencies, which can hinder the model's ability to effectively utilize cross-modal information for recognizing all categories. In this work, we propose the practical setting termed Multi-Modal Heterogeneous Category-set Learning (MMHCL), where models are trained in heterogeneous category sets of multi-modal data and aim to recognize complete classes set of all modalities during test. To effectively address this task, we propose a Class Similarity-based Cross-modal Fusion model (CSCF). Specifically, CSCF aligns modality-specific features to a shared semantic space to enable knowledge transfer between seen and unseen classes. It then selects the most discriminative modality for decision fusion through uncertainty estimation. Finally, it integrates cross-modal information based on class similarity, where the auxiliary modality refines the prediction of the dominant one. Experimental results show that our method significantly outperforms existing state-of-the-art (SOTA) approaches on multiple benchmark datasets, effectively addressing the MMHCL task.

[67] Hierarchical Image Matching for UAV Absolute Visual Localization via Semantic and Structural Constraints

Xiangkai Zhang,Xiang Zhou,Mao Chen,Yuchen Lu,Xu Yang,Zhiyong Liu

Main category: cs.CV

TL;DR: 提出了一种基于层次化跨源图像匹配的无人机绝对定位方法，结合语义感知和结构约束的粗匹配模块与轻量级细粒度匹配模块，显著提升了在GNSS不可用场景下的定位精度和鲁棒性。

Details

Motivation: 无人机在GNSS信号不可用时需要依赖视觉绝对定位方法，但现有方法因跨源差异和时间变化导致匹配困难，亟需更高效的解决方案。 Method: 采用层次化跨源图像匹配方法，包括语义感知和结构约束的粗匹配模块与轻量级细粒度匹配模块，结合图像检索构建定位流程。 Result: 在公开基准数据集和新引入的CS-UAV数据集上验证了方法的优越性和鲁棒性。 Conclusion: 该方法在GNSS不可用场景下有效提升了无人机绝对定位的准确性和稳定性。 Abstract: Absolute localization, aiming to determine an agent's location with respect to a global reference, is crucial for unmanned aerial vehicles (UAVs) in various applications, but it becomes challenging when global navigation satellite system (GNSS) signals are unavailable. Vision-based absolute localization methods, which locate the current view of the UAV in a reference satellite map to estimate its position, have become popular in GNSS-denied scenarios. However, existing methods mostly rely on traditional and low-level image matching, suffering from difficulties due to significant differences introduced by cross-source discrepancies and temporal variations. To overcome these limitations, in this paper, we introduce a hierarchical cross-source image matching method designed for UAV absolute localization, which integrates a semantic-aware and structure-constrained coarse matching module with a lightweight fine-grained matching module. Specifically, in the coarse matching module, semantic features derived from a vision foundation model first establish region-level correspondences under semantic and structural constraints. Then, the fine-grained matching module is applied to extract fine features and establish pixel-level correspondences. Building upon this, a UAV absolute visual localization pipeline is constructed without any reliance on relative localization techniques, mainly by employing an image retrieval module before the proposed hierarchical image matching modules. Experimental evaluations on public benchmark datasets and a newly introduced CS-UAV dataset demonstrate superior accuracy and robustness of the proposed method under various challenging conditions, confirming its effectiveness.

[68] Outside Knowledge Conversational Video (OKCV) Dataset -- Dialoguing over Videos

Benjamin Reichman,Constantin Patsch,Jack Truxal,Atishay Jain,Larry Heck

Main category: cs.CV

TL;DR: 论文提出了一种基于视频的视觉问答任务，要求模型结合外部知识回答问题，并引入了一个包含2017个视频和5986个对话的数据集。

Details

Motivation: 探索在视频对话中结合视觉信息和外部知识回答问题的挑战。 Method: 构建了一个包含视频和对话的数据集，并提供了基线模型评估。 Result: 展示了任务中的挑战，并公开了数据集。 Conclusion: 该任务为结合视觉和外部知识的对话系统提供了新的研究方向。 Abstract: In outside knowledge visual question answering (OK-VQA), the model must identify relevant visual information within an image and incorporate external knowledge to accurately respond to a question. Extending this task to a visually grounded dialogue setting based on videos, a conversational model must both recognize pertinent visual details over time and answer questions where the required information is not necessarily present in the visual information. Moreover, the context of the overall conversation must be considered for the subsequent dialogue. To explore this task, we introduce a dataset comprised of $2,017$ videos with $5,986$ human-annotated dialogues consisting of $40,954$ interleaved dialogue turns. While the dialogue context is visually grounded in specific video segments, the questions further require external knowledge that is not visually present. Thus, the model not only has to identify relevant video parts but also leverage external knowledge to converse within the dialogue. We further provide several baselines evaluated on our dataset and show future challenges associated with this task. The dataset is made publicly available here: https://github.com/c-patsch/OKCV.

[69] Inverting Black-Box Face Recognition Systems via Zero-Order Optimization in Eigenface Space

Anton Razzhigaev,Matvey Mikhalchuk,Klim Kireev,Igor Udovichenko,Andrey Kuznetsov,Aleksandr Petiushko

Main category: cs.CV

TL;DR: DarkerBB是一种新方法，通过仅使用相似性分数在PCA特征脸空间中进行零阶优化，从黑盒识别模型中重建彩色人脸图像。

Details

Motivation: 解决仅使用相似性分数进行模型反演这一更具挑战性的场景，以揭示黑盒识别模型带来的隐私威胁。 Method: 在PCA特征脸空间中进行零阶优化，仅依赖相似性分数重建彩色人脸图像。 Result: 在LFW、AgeDB-30和CFP-FP基准测试中，DarkerBB在仅使用相似性分数的设置下达到了最先进的验证准确率，且查询效率具有竞争力。 Conclusion: DarkerBB展示了在信息高度受限的情况下，仍能有效重建人脸图像，突显了黑盒模型潜在的隐私风险。 Abstract: Reconstructing facial images from black-box recognition models poses a significant privacy threat. While many methods require access to embeddings, we address the more challenging scenario of model inversion using only similarity scores. This paper introduces DarkerBB, a novel approach that reconstructs color faces by performing zero-order optimization within a PCA-derived eigenface space. Despite this highly limited information, experiments on LFW, AgeDB-30, and CFP-FP benchmarks demonstrate that DarkerBB achieves state-of-the-art verification accuracies in the similarity-only setting, with competitive query efficiency.

[70] Q-SAM2: Accurate Quantization for Segment Anything Model 2

Nicola Farronato,Florian Scheidegger,Mattia Rigotti,Cristiano Malossi,Michele Magno,Haotong Qin

Main category: cs.CV

TL;DR: Q-SAM2是一种针对SAM2的高效低比特量化方法，通过线性层校准和量化感知训练解决了量化过程中的性能下降问题，显著提升了效率。

Details

Motivation: SAM2的计算和内存消耗高，限制了其在资源受限场景中的应用，因此需要一种高效的量化方法。 Method: Q-SAM2引入了线性层校准方法进行低比特初始化，并通过量化感知训练（QAT）抑制异常值，使网络适应量化阈值。 Result: 实验表明，Q-SAM2在高效的同时保持了高精度，尤其在2比特量化中表现优异，且校准技术在后训练量化中也有效。 Conclusion: Q-SAM2是一种高效且准确的量化方法，适用于资源受限场景，显著提升了SAM2的实用性。 Abstract: The Segment Anything Model 2 (SAM2) has gained significant attention as a foundational approach for promptable image and video segmentation. However, its expensive computational and memory consumption poses a severe challenge for its application in resource-constrained scenarios. In this paper, we propose an accurate low-bit quantization method for efficient SAM2, termed Q-SAM2. To address the performance degradation caused by the singularities in weight and activation distributions during quantization, Q-SAM2 introduces two novel technical contributions. We first introduce a linear layer calibration method for low-bit initialization of SAM2, which minimizes the Frobenius norm over a small image batch to reposition weight distributions for improved quantization. We then propose a Quantization-Aware Training (QAT) pipeline that applies clipping to suppress outliers and allows the network to adapt to quantization thresholds during training. Our comprehensive experiments demonstrate that Q-SAM2 allows for highly accurate inference while substantially improving efficiency. Both quantitative and visual results show that our Q-SAM2 surpasses existing state-of-the-art general quantization schemes, especially for ultra-low 2-bit quantization. While designed for quantization-aware training, our proposed calibration technique also proves effective in post-training quantization, achieving up to a 66% mIoU accuracy improvement over non-calibrated models.

[71] Accurate and efficient zero-shot 6D pose estimation with frozen foundation models

Andrea Caraffa,Davide Boscaini,Fabio Poiesi

Main category: cs.CV

TL;DR: FreeZeV2是一种无需训练的6D姿态估计方法，通过预训练的几何和视觉基础模型实现对新物体的强泛化能力，显著提升了速度和准确性。

Details

Motivation: 解决现有方法需要大量任务特定训练数据的问题，探索是否可以通过预训练模型实现高效准确的6D姿态估计。 Method: 采用稀疏特征提取、特征感知评分机制和模块化设计，支持多实例分割模型集成。 Result: 在BOP Benchmark上达到新SOTA，速度提升8倍，准确性提高5%；集成模型后准确性再提升8%，速度仍快2.5倍。 Conclusion: FreeZeV2证明了无需任务特定训练即可实现高效准确的6D姿态估计，为相关领域提供了新思路。 Abstract: Estimating the 6D pose of objects from RGBD data is a fundamental problem in computer vision, with applications in robotics and augmented reality. A key challenge is achieving generalization to novel objects that were not seen during training. Most existing approaches address this by scaling up training on synthetic data tailored to the task, a process that demands substantial computational resources. But is task-specific training really necessary for accurate and efficient 6D pose estimation of novel objects? To answer No!, we introduce FreeZeV2, the second generation of FreeZe: a training-free method that achieves strong generalization to unseen objects by leveraging geometric and vision foundation models pre-trained on unrelated data. FreeZeV2 improves both accuracy and efficiency over FreeZe through three key contributions: (i) a sparse feature extraction strategy that reduces inference-time computation without sacrificing accuracy; (ii) a feature-aware scoring mechanism that improves both pose selection during RANSAC-based 3D registration and the final ranking of pose candidates; and (iii) a modular design that supports ensembles of instance segmentation models, increasing robustness to segmentation masks errors. We evaluate FreeZeV2 on the seven core datasets of the BOP Benchmark, where it establishes a new state-of-the-art in 6D pose estimation of unseen objects. When using the same segmentation masks, FreeZeV2 achieves a remarkable 8x speedup over FreeZe while also improving accuracy by 5%. When using ensembles of segmentation models, FreeZeV2 gains an additional 8% in accuracy while still running 2.5x faster than FreeZe. FreeZeV2 was awarded Best Overall Method at the BOP Challenge 2024.

[72] DreamCS: Geometry-Aware Text-to-3D Generation with Unpaired 3D Reward Supervision

Xiandong Zou,Ruihao Xia,Hongsong Wang,Pan Zhou

Main category: cs.CV

TL;DR: 论文提出了一种名为DreamCS的框架，通过构建首个大规模无配对3D偏好数据集3D-MeshPref，并利用Cauchy-Schwarz散度目标训练奖励模型RewardCS，以提升文本到3D生成的质量。

Details

Motivation: 现有文本到3D生成方法难以生成符合人类偏好的3D资产，且现有偏好对齐技术依赖难以收集的配对多视图2D图像，导致几何伪影。 Method: 构建3D-MeshPref数据集，利用大语言模型标注和人类评估修正；开发RewardCS奖励模型，采用Cauchy-Schwarz散度目标；提出DreamCS框架，整合RewardCS到文本到3D流程。 Result: 实验表明DreamCS优于现有方法，生成的3D资产几何准确且符合人类偏好。 Conclusion: DreamCS通过直接学习3D偏好数据，显著提升了文本到3D生成的质量和人类对齐性。 Abstract: While text-to-3D generation has attracted growing interest, existing methods often struggle to produce 3D assets that align well with human preferences. Current preference alignment techniques for 3D content typically rely on hardly-collected preference-paired multi-view 2D images to train 2D reward models, when then guide 3D generation -- leading to geometric artifacts due to their inherent 2D bias. To address these limitations, we construct 3D-MeshPref, the first large-scale unpaired 3D preference dataset, featuring diverse 3D meshes annotated by a large language model and refined by human evaluators. We then develop RewardCS, the first reward model trained directly on unpaired 3D-MeshPref data using a novel Cauchy-Schwarz divergence objective, enabling effective learning of human-aligned 3D geometric preferences without requiring paired comparisons. Building on this, we propose DreamCS, a unified framework that integrates RewardCS into text-to-3D pipelines -- enhancing both implicit and explicit 3D generation with human preference feedback. Extensive experiments show DreamCS outperforms prior methods, producing 3D assets that are both geometrically faithful and human-preferred. Code and models will be released publicly.

Chuang Maa,Yu Peia,Jianhang Zhanga,Shaokai Zhaoa,Bowen Jib,Liang Xiea,Ye Yana,Erwei Yin

Main category: cs.CV

TL;DR: 该论文提出了一种新的微表情数据集MMME，首次实现了面部动作信号、中枢神经系统信号和外周生理信号的同步采集，并通过多模态融合显著提升了微表情识别和检测性能。

Details

Motivation: 现有微表情研究局限于单一视觉模态，忽略了其他生理模态传递的丰富情感信息，导致性能不足。因此，探索视觉特征与生理信号的跨模态关联机制，开发多模态融合框架是推进微表情分析的关键。 Method: 研究引入了MMME数据集，包含634个微表情、2,841个宏表情和2,890个同步多模态生理信号试验，并通过实验验证了数据集的可靠性和多模态融合的性能提升。 Result: 实验表明，结合生理信号显著提升了微表情的识别和检测性能，MMME是目前模态多样性最全面的微表情数据集。 Conclusion: MMME为探索微表情的神经机制和视觉-生理协同效应提供了关键数据支持，推动了微表情研究从单一视觉分析向多模态融合的范式转变。 Abstract: Micro-expressions (MEs) are subtle, fleeting nonverbal cues that reveal an individual's genuine emotional state. Their analysis has attracted considerable interest due to its promising applications in fields such as healthcare, criminal investigation, and human-computer interaction. However, existing ME research is limited to single visual modality, overlooking the rich emotional information conveyed by other physiological modalities, resulting in ME recognition and spotting performance far below practical application needs. Therefore, exploring the cross-modal association mechanism between ME visual features and physiological signals (PS), and developing a multimodal fusion framework, represents a pivotal step toward advancing ME analysis. This study introduces a novel ME dataset, MMME, which, for the first time, enables synchronized collection of facial action signals (MEs), central nervous system signals (EEG), and peripheral PS (PPG, RSP, SKT, EDA, and ECG). By overcoming the constraints of existing ME corpora, MMME comprises 634 MEs, 2,841 macro-expressions (MaEs), and 2,890 trials of synchronized multimodal PS, establishing a robust foundation for investigating ME neural mechanisms and conducting multimodal fusion-based analyses. Extensive experiments validate the dataset's reliability and provide benchmarks for ME analysis, demonstrating that integrating MEs with PS significantly enhances recognition and spotting performance. To the best of our knowledge, MMME is the most comprehensive ME dataset to date in terms of modality diversity. It provides critical data support for exploring the neural mechanisms of MEs and uncovering the visual-physiological synergistic effects, driving a paradigm shift in ME research from single-modality visual analysis to multimodal fusion. The dataset will be publicly available upon acceptance of this paper.

[74] DynaSplat: Dynamic-Static Gaussian Splatting with Hierarchical Motion Decomposition for Scene Reconstruction

Junli Deng,Ping Shi,Qipei Li,Jinyang Guo

Main category: cs.CV

TL;DR: DynaSplat通过动态-静态分离和分层运动建模，扩展了高斯泼溅技术，实现了对复杂动态场景的高精度重建。

Details

Motivation: 现有方法难以处理真实世界动态场景的复杂性，因此需要一种更高效、直观的动态场景重建方法。 Method: 结合变形偏移统计和2D运动流一致性分类静态与动态元素，采用分层运动建模策略，并引入基于物理的不透明度估计。 Result: 在多个挑战性数据集上，DynaSplat在准确性和真实感上超越了现有方法，且更紧凑高效。 Conclusion: DynaSplat为动态场景重建提供了一种高精度、直观且高效的解决方案。 Abstract: Reconstructing intricate, ever-changing environments remains a central ambition in computer vision, yet existing solutions often crumble before the complexity of real-world dynamics. We present DynaSplat, an approach that extends Gaussian Splatting to dynamic scenes by integrating dynamic-static separation and hierarchical motion modeling. First, we classify scene elements as static or dynamic through a novel fusion of deformation offset statistics and 2D motion flow consistency, refining our spatial representation to focus precisely where motion matters. We then introduce a hierarchical motion modeling strategy that captures both coarse global transformations and fine-grained local movements, enabling accurate handling of intricate, non-rigid motions. Finally, we integrate physically-based opacity estimation to ensure visually coherent reconstructions, even under challenging occlusions and perspective shifts. Extensive experiments on challenging datasets reveal that DynaSplat not only surpasses state-of-the-art alternatives in accuracy and realism but also provides a more intuitive, compact, and efficient route to dynamic scene reconstruction.

Chen Gao,Liankai Jin,Xingyu Peng,Jiazhao Zhang,Yue Deng,Annan Li,He Wang,Si Liu

Main category: cs.CV

TL;DR: 该论文提出了一种通用导航代理OctoNav-R1，通过多模态自由指令完成任务，并设计了OctoNav-Bench基准和TBA-CoT数据集以支持训练。

Details

Motivation: 现有导航研究任务分散且方法独立，缺乏通用性。论文旨在开发能处理多模态指令的通用导航代理。 Method: 提出OctoNav-Bench基准和TBA-CoT数据集，构建基于MLLMs的OctoNav-R1模型，采用三阶段混合训练范式（HTP）。 Result: OctoNav-R1在性能上优于现有方法。 Conclusion: 通过TBA-SFT和Nav-GPRO训练，模型在导航任务中展现出更强的推理能力，为通用导航代理提供了新方向。 Abstract: Embodied navigation stands as a foundation pillar within the broader pursuit of embodied AI. However, previous navigation research is divided into different tasks/capabilities, e.g., ObjNav, ImgNav and VLN, where they differ in task objectives and modalities, making datasets and methods are designed individually. In this work, we take steps toward generalist navigation agents, which can follow free-form instructions that include arbitrary compounds of multi-modal and multi-capability. To achieve this, we propose a large-scale benchmark and corresponding method, termed OctoNav-Bench and OctoNav-R1. Specifically, OctoNav-Bench features continuous environments and is constructed via a designed annotation pipeline. We thoroughly craft instruction-trajectory pairs, where instructions are diverse in free-form with arbitrary modality and capability. Also, we construct a Think-Before-Action (TBA-CoT) dataset within OctoNav-Bench to provide the thinking process behind actions. For OctoNav-R1, we build it upon MLLMs and adapt it to a VLA-type model, which can produce low-level actions solely based on 2D visual observations. Moreover, we design a Hybrid Training Paradigm (HTP) that consists of three stages, i.e., Action-/TBA-SFT, Nav-GPRO, and Online RL stages. Each stage contains specifically designed learning policies and rewards. Importantly, for TBA-SFT and Nav-GRPO designs, we are inspired by the OpenAI-o1 and DeepSeek-R1, which show impressive reasoning ability via thinking-before-answer. Thus, we aim to investigate how to achieve thinking-before-action in the embodied navigation field, to improve model's reasoning ability toward generalists. Specifically, we propose TBA-SFT to utilize the TBA-CoT dataset to fine-tune the model as a cold-start phrase and then leverage Nav-GPRO to improve its thinking ability. Finally, OctoNav-R1 shows superior performance compared with previous methods.

[76] Learning to Align: Addressing Character Frequency Distribution Shifts in Handwritten Text Recognition

Panagiotis Kaliosis,John Pavlopoulos

Main category: cs.CV

TL;DR: 提出了一种基于Wasserstein距离的损失函数，用于提升手写文本识别的准确性和鲁棒性，尤其在处理字符频率分布变化时表现优异。

Details

Motivation: 手写文本识别因字符集随时间或区域变化而具有挑战性，现有模型在特定子集上表现不佳。 Method: 提出一种新损失函数，利用Wasserstein距离对齐预测文本与目标字符频率分布，并在推理时通过引导解码方案提升模型性能。 Result: 实验证明该方法在多个数据集和架构上均能提升泛化能力和性能。 Conclusion: 该方法有效解决了字符频率分布变化带来的问题，并开源了代码。 Abstract: Handwritten text recognition aims to convert visual input into machine-readable text, and it remains challenging due to the evolving and context-dependent nature of handwriting. Character sets change over time, and character frequency distributions shift across historical periods or regions, often causing models trained on broad, heterogeneous corpora to underperform on specific subsets. To tackle this, we propose a novel loss function that incorporates the Wasserstein distance between the character frequency distribution of the predicted text and a target distribution empirically derived from training data. By penalizing divergence from expected distributions, our approach enhances both accuracy and robustness under temporal and contextual intra-dataset shifts. Furthermore, we demonstrate that character distribution alignment can also improve existing models at inference time without requiring retraining by integrating it as a scoring function in a guided decoding scheme. Experimental results across multiple datasets and architectures confirm the effectiveness of our method in boosting generalization and performance. We open source our code at https://github.com/pkaliosis/fada.

[77] IntPhys 2: Benchmarking Intuitive Physics Understanding In Complex Synthetic Environments

Florian Bordes,Quentin Garrido,Justine T Kao,Adina Williams,Michael Rabbat,Emmanuel Dupoux

Main category: cs.CV

TL;DR: IntPhys 2是一个视频基准测试，用于评估深度学习模型对直观物理的理解能力，基于四个核心原则，发现当前模型在复杂场景中表现不佳，与人类表现差距显著。

Details

Motivation: 评估深度学习模型对直观物理的理解能力，填补现有模型与人类认知之间的差距。 Method: 基于违反期望框架设计测试，涵盖四个核心原则（永久性、不可变性、时空连续性和固体性），在多样化虚拟环境中评估模型。 Result: 当前模型在复杂场景中表现接近随机（50%），远低于人类近乎完美的准确率。 Conclusion: 现有模型在直观物理理解上存在显著不足，需改进模型架构和训练方法。 Abstract: We present IntPhys 2, a video benchmark designed to evaluate the intuitive physics understanding of deep learning models. Building on the original IntPhys benchmark, IntPhys 2 focuses on four core principles related to macroscopic objects: Permanence, Immutability, Spatio-Temporal Continuity, and Solidity. These conditions are inspired by research into intuitive physical understanding emerging during early childhood. IntPhys 2 offers a comprehensive suite of tests, based on the violation of expectation framework, that challenge models to differentiate between possible and impossible events within controlled and diverse virtual environments. Alongside the benchmark, we provide performance evaluations of several state-of-the-art models. Our findings indicate that while these models demonstrate basic visual understanding, they face significant challenges in grasping intuitive physics across the four principles in complex scenes, with most models performing at chance levels (50%), in stark contrast to human performance, which achieves near-perfect accuracy. This underscores the gap between current models and human-like intuitive physics understanding, highlighting the need for advancements in model architectures and training methodologies.

[78] Leveraging Depth and Language for Open-Vocabulary Domain-Generalized Semantic Segmentation

Siyu Chen,Ting Han,Chengzheng Fu,Changshe Zhang,Chaolei Wang,Jinhe Su,Guorong Cai,Meiliu Wu

Main category: cs.CV

TL;DR: 论文提出了一种名为Vireo的单阶段框架，用于开放词汇领域通用语义分割（OV-DGSS），结合了开放词汇语义分割（OVSS）和领域通用语义分割（DGSS）的优势，通过几何特征与语言线索对齐等方法提升性能。

Details

Motivation: 开放词汇语义分割和领域通用语义分割的互补性促使了OV-DGSS的研究，旨在为未见类别生成像素级掩码并保持跨领域鲁棒性，适用于自动驾驶等现实场景。 Method: Vireo框架基于冻结的视觉基础模型（VFMs），通过深度VFMs引入场景几何特征，提出GeoText Prompts、CMPE和DOV-VEH三个关键组件，以对齐视觉与文本模态并提升性能。 Result: Vireo在领域通用性和开放词汇识别方面均显著超越现有方法，实现了最先进的性能。 Conclusion: Vireo为多样化和动态环境中的鲁棒视觉理解提供了统一且可扩展的解决方案。 Abstract: Open-Vocabulary semantic segmentation (OVSS) and domain generalization in semantic segmentation (DGSS) highlight a subtle complementarity that motivates Open-Vocabulary Domain-Generalized Semantic Segmentation (OV-DGSS). OV-DGSS aims to generate pixel-level masks for unseen categories while maintaining robustness across unseen domains, a critical capability for real-world scenarios such as autonomous driving in adverse conditions. We introduce Vireo, a novel single-stage framework for OV-DGSS that unifies the strengths of OVSS and DGSS for the first time. Vireo builds upon the frozen Visual Foundation Models (VFMs) and incorporates scene geometry via Depth VFMs to extract domain-invariant structural features. To bridge the gap between visual and textual modalities under domain shift, we propose three key components: (1) GeoText Prompts, which align geometric features with language cues and progressively refine VFM encoder representations; (2) Coarse Mask Prior Embedding (CMPE) for enhancing gradient flow for faster convergence and stronger textual influence; and (3) the Domain-Open-Vocabulary Vector Embedding Head (DOV-VEH), which fuses refined structural and semantic features for robust prediction. Comprehensive evaluation on these components demonstrates the effectiveness of our designs. Our proposed Vireo achieves the state-of-the-art performance and surpasses existing methods by a large margin in both domain generalization and open-vocabulary recognition, offering a unified and scalable solution for robust visual understanding in diverse and dynamic environments. Code is available at https://github.com/anonymouse-9c53tp182bvz/Vireo.

[79] 3D-Aware Vision-Language Models Fine-Tuning with Geometric Distillation

Seonho Lee,Jiho Choi,Inha Kang,Jiwook Kim,Junsung Park,Hyunjung Shim

Main category: cs.CV

TL;DR: 提出了一种轻量级、无需标注的微调框架Geometric Distillation，通过注入几何线索提升预训练视觉语言模型（VLM）的3D空间理解能力。

Details

Motivation: 现有视觉语言模型在3D空间结构理解上存在局限，需要一种高效方法提升其3D感知能力。 Method: 通过从现成的3D基础模型（如MASt3R、VGGT）中提取稀疏对应、相对深度关系和密集成本体积，注入几何线索，不改变模型架构。 Result: 在3D视觉语言推理和3D感知基准测试中表现优于现有方法，计算成本显著降低。 Conclusion: 为2D训练的VLM提供了一种可扩展且高效的3D理解路径，适用于空间多模态任务。 Abstract: Vision-Language Models (VLMs) have shown remarkable performance on diverse visual and linguistic tasks, yet they remain fundamentally limited in their understanding of 3D spatial structures. We propose Geometric Distillation, a lightweight, annotation-free fine-tuning framework that injects human-inspired geometric cues into pretrained VLMs without modifying their architecture. By distilling (1) sparse correspondences, (2) relative depth relations, and (3) dense cost volumes from off-the-shelf 3D foundation models (e.g., MASt3R, VGGT), our method shapes representations to be geometry-aware while remaining compatible with natural image-text inputs. Through extensive evaluations on 3D vision-language reasoning and 3D perception benchmarks, our method consistently outperforms prior approaches, achieving improved 3D spatial reasoning with significantly lower computational cost. Our work demonstrates a scalable and efficient path to bridge 2D-trained VLMs with 3D understanding, opening up wider use in spatially grounded multimodal tasks.

[80] The Less You Depend, The More You Learn: Synthesizing Novel Views from Sparse, Unposed Images without Any 3D Knowledge

Haoru Wang,Kai Ye,Yangyan Li,Wenzheng Chen,Baoquan Chen

Main category: cs.CV

TL;DR: 论文研究了通用性新视角合成（NVS）问题，提出了一种减少3D知识依赖的方法，通过数据驱动实现高性能。

Details

Motivation: 探索3D知识在新视角合成中的作用，并验证减少其依赖的可行性。 Method: 提出了一种最小化3D归纳偏差和姿态依赖的框架，直接从稀疏2D图像学习3D信息。 Result: 实验表明，该方法能生成逼真且3D一致的新视角，性能与依赖姿态输入的方法相当。 Conclusion: 数据驱动的方法在减少3D知识依赖的同时，仍能实现高性能，验证了其有效性。 Abstract: We consider the problem of generalizable novel view synthesis (NVS), which aims to generate photorealistic novel views from sparse or even unposed 2D images without per-scene optimization. This task remains fundamentally challenging, as it requires inferring 3D structure from incomplete and ambiguous 2D observations. Early approaches typically rely on strong 3D knowledge, including architectural 3D inductive biases (e.g., embedding explicit 3D representations, such as NeRF or 3DGS, into network design) and ground-truth camera poses for both input and target views. While recent efforts have sought to reduce the 3D inductive bias or the dependence on known camera poses of input views, critical questions regarding the role of 3D knowledge and the necessity of circumventing its use remain under-explored. In this work, we conduct a systematic analysis on the 3D knowledge and uncover a critical trend: the performance of methods that requires less 3D knowledge accelerates more as data scales, eventually achieving performance on par with their 3D knowledge-driven counterparts, which highlights the increasing importance of reducing dependence on 3D knowledge in the era of large-scale data. Motivated by and following this trend, we propose a novel NVS framework that minimizes 3D inductive bias and pose dependence for both input and target views. By eliminating this 3D knowledge, our method fully leverages data scaling and learns implicit 3D awareness directly from sparse 2D images, without any 3D inductive bias or pose annotation during training. Extensive experiments demonstrate that our model generates photorealistic and 3D-consistent novel views, achieving even comparable performance with methods that rely on posed inputs, thereby validating the feasibility and effectiveness of our data-centric paradigm. Project page: https://pku-vcl-geometry.github.io/Less3Depend/ .

[81] EquiCaps: Predictor-Free Pose-Aware Pre-Trained Capsule Networks

Athinoulla Konstantinou,Georgios Leontidis,Mamatha Thota,Aiden Durrant

Main category: cs.CV

TL;DR: 论文提出EquiCaps，一种基于胶囊网络的自我监督方法，无需专用预测器即可实现姿态感知表示，并在姿态估计任务中表现优异。

Details

Motivation: 探索如何利用胶囊网络固有的姿态感知能力，避免依赖专用预测器来实现等变性，从而提升姿态估计任务的性能。 Method: 引入EquiCaps，利用胶囊网络的固有特性实现姿态感知自我监督，并通过多几何变换任务和3DIEBench-T数据集验证其性能。 Result: EquiCaps在旋转预测任务中表现优于现有方法，R²达到0.78，并在复杂几何变换下保持稳健的等变性。 Conclusion: EquiCaps展示了胶囊网络在无需预测器的情况下实现等变性的潜力，为姿态感知表示提供了新思路。 Abstract: Learning self-supervised representations that are invariant and equivariant to transformations is crucial for advancing beyond traditional visual classification tasks. However, many methods rely on predictor architectures to encode equivariance, despite evidence that architectural choices, such as capsule networks, inherently excel at learning interpretable pose-aware representations. To explore this, we introduce EquiCaps (Equivariant Capsule Network), a capsule-based approach to pose-aware self-supervision that eliminates the need for a specialised predictor for enforcing equivariance. Instead, we leverage the intrinsic pose-awareness capabilities of capsules to improve performance in pose estimation tasks. To further challenge our assumptions, we increase task complexity via multi-geometric transformations to enable a more thorough evaluation of invariance and equivariance by introducing 3DIEBench-T, an extension of a 3D object-rendering benchmark dataset. Empirical results demonstrate that EquiCaps outperforms prior state-of-the-art equivariant methods on rotation prediction, achieving a supervised-level $R^2$ of 0.78 on the 3DIEBench rotation prediction benchmark and improving upon SIE and CapsIE by 0.05 and 0.04 $R^2$, respectively. Moreover, in contrast to non-capsule-based equivariant approaches, EquiCaps maintains robust equivariant performance under combined geometric transformations, underscoring its generalisation capabilities and the promise of predictor-free capsule architectures.

[82] CEM-FBGTinyDet: Context-Enhanced Foreground Balance with Gradient Tuning for tiny Objects

Tao Liu,Zhenchao Cui

Main category: cs.CV

TL;DR: 论文提出E-FPN-BS，通过多尺度特征增强和自适应优化解决小目标检测中高层特征未训练的问题。

Details

Motivation: 标准标签分配协议下，高层特征（P5-P6）常因零正样本锚点而未被训练，导致语义表示缺失和低层特征缺乏语义上下文。 Method: 提出E-FPN-BS架构，包含上下文增强模块（CEM）和前景-背景分离模块（FBSM），并引入动态梯度平衡损失（DCLoss）。 Result: 在多个基准数据集上验证了方法的优异性能和泛化能力。 Conclusion: E-FPN-BS有效解决了小目标检测中高层特征未训练的问题，提升了检测性能。 Abstract: Tiny object detection (TOD) reveals a fundamental flaw in feature pyramid networks: high-level features (P5-P6) frequently receive zero positive anchors under standard label assignment protocols, leaving their semantic representations untrained due to exclusion from loss computation. This creates dual deficiencies: (1) Stranded high-level features become semantic dead-ends without gradient updates, while (2) low-level features lack essential semantic context for robust classification. We propose E-FPN-BS that systematically converts wasted high-level semantics into low-level feature enhancements. To address these issues, we propose E-FPN-BS, a novel architecture integrating multi-scale feature enhancement and adaptive optimization. First, our Context Enhancement Module(CEM) employs dual-branch processing to align and compress high-level features for effective global-local fusion. Second, the Foreground-Background Separation Module (FBSM) generates spatial gating masks that dynamically amplify discriminative regions. To address gradient imbalance across object scales, we further propose a Dynamic Gradient-Balanced Loss (DCLoss) that automatically modulates loss contributions via scale-aware gradient equilibrium. Extensive experiments across multiple benchmark datasets demonstrate the outstanding performance and generalization ability of our approach.

[83] Only-Style: Stylistic Consistency in Image Generation without Content Leakage

Tilemachos Aravanis,Panagiotis Filntisis,Petros Maragos,George Retsinas

Main category: cs.CV

TL;DR: 论文提出了一种名为Only-Style的方法，旨在解决图像生成中内容泄漏的问题，同时保持风格一致性。通过自适应调整参数和局部化内容泄漏，该方法显著优于现有技术。

Details

Motivation: 现有方法在风格一致的图像生成中难以有效分离语义内容和风格元素，导致内容泄漏问题。 Method: Only-Style通过局部化内容泄漏并自适应调整风格对齐参数，平衡风格一致性和泄漏消除。 Result: 实验表明，该方法在多样实例中显著优于现有技术，实现了无内容泄漏的稳健风格一致性。 Conclusion: Only-Style是一种有效的方法，能够解决内容泄漏问题，同时保持风格一致性，并提供了新的评估框架。 Abstract: Generating images in a consistent reference visual style remains a challenging computer vision task. State-of-the-art methods aiming for style-consistent generation struggle to effectively separate semantic content from stylistic elements, leading to content leakage from the image provided as a reference to the targets. To address this challenge, we propose Only-Style: a method designed to mitigate content leakage in a semantically coherent manner while preserving stylistic consistency. Only-Style works by localizing content leakage during inference, allowing the adaptive tuning of a parameter that controls the style alignment process, specifically within the image patches containing the subject in the reference image. This adaptive process best balances stylistic consistency with leakage elimination. Moreover, the localization of content leakage can function as a standalone component, given a reference-target image pair, allowing the adaptive tuning of any method-specific parameter that provides control over the impact of the stylistic reference. In addition, we propose a novel evaluation framework to quantify the success of style-consistent generations in avoiding undesired content leakage. Our approach demonstrates a significant improvement over state-of-the-art methods through extensive evaluation across diverse instances, consistently achieving robust stylistic consistency without undesired content leakage.

[84] MetricHMR: Metric Human Mesh Recovery from Monocular Images

He Zhang,Chentao Song,Hongwen Zhang,Tao Yu

Main category: cs.CV

TL;DR: MetricHMR是一种从单目图像中恢复具有精确全局平移的度量人体网格的方法，解决了现有HMR方法在尺度和深度上的模糊性问题。

Details

Motivation: 现有HMR方法存在严重的尺度和深度模糊性，导致重建结果在几何上不合理。MetricHMR旨在解决这一问题，实现度量尺度的人体网格恢复。 Method: 通过系统分析现有HMR方法的相机模型，强调标准透视投影模型的关键作用，并提出一种基于射线图的新方法，联合编码边界框信息、相机参数和几何线索，实现端到端的度量HMR。 Result: 实验表明，MetricHMR在度量姿态、形状和全局平移估计上达到最先进性能，优于现有HMR方法。 Conclusion: MetricHMR通过标准透视投影模型和射线图方法，成功解决了尺度和深度模糊性问题，实现了高精度的度量人体网格恢复。 Abstract: We introduce MetricHMR (Metric Human Mesh Recovery), an approach for metric human mesh recovery with accurate global translation from monocular images. In contrast to existing HMR methods that suffer from severe scale and depth ambiguity, MetricHMR is able to produce geometrically reasonable body shape and global translation in the reconstruction results. To this end, we first systematically analyze previous HMR methods on camera models to emphasize the critical role of the standard perspective projection model in enabling metric-scale HMR. We then validate the acceptable ambiguity range of metric HMR under the standard perspective projection model. Finally, we contribute a novel approach that introduces a ray map based on the standard perspective projection to jointly encode bounding-box information, camera parameters, and geometric cues for End2End metric HMR without any additional metric-regularization modules. Extensive experiments demonstrate that our method achieves state-of-the-art performance, even compared with sequential HMR methods, in metric pose, shape, and global translation estimation across both indoor and in-the-wild scenarios.

[85] Structural-Spectral Graph Convolution with Evidential Edge Learning for Hyperspectral Image Clustering

Jianhan Qi,Yuheng Jia,Hui Liu,Junhui Hou

Main category: cs.CV

TL;DR: 论文提出了一种针对高光谱图像（HSI）聚类的改进方法，通过结合结构-光谱图卷积算子（SSGCO）和证据引导的自适应边学习（EGAEL）模块，提升了聚类精度。

Details

Motivation: 现有基于图神经网络（GNNs）的方法未能充分利用HSI的光谱信息，且超像素拓扑图的不准确性可能导致类语义混淆。 Method: 提出SSGCO以同时提取空间和光谱特征，并设计EGAEL模块自适应优化边权重。方法集成到对比学习框架中，实现表示学习和聚类的同步进行。 Result: 在四个HSI数据集上，聚类精度分别提升了2.61%、6.06%、4.96%和3.15%。 Conclusion: SSGCO和EGAEL模块有效提升了HSI聚类的表现，方法具有实际应用潜力。 Abstract: Hyperspectral image (HSI) clustering assigns similar pixels to the same class without any annotations, which is an important yet challenging task. For large-scale HSIs, most methods rely on superpixel segmentation and perform superpixel-level clustering based on graph neural networks (GNNs). However, existing GNNs cannot fully exploit the spectral information of the input HSI, and the inaccurate superpixel topological graph may lead to the confusion of different class semantics during information aggregation. To address these challenges, we first propose a structural-spectral graph convolutional operator (SSGCO) tailored for graph-structured HSI superpixels to improve their representation quality through the co-extraction of spatial and spectral features. Second, we propose an evidence-guided adaptive edge learning (EGAEL) module that adaptively predicts and refines edge weights in the superpixel topological graph. We integrate the proposed method into a contrastive learning framework to achieve clustering, where representation learning and clustering are simultaneously conducted. Experiments demonstrate that the proposed method improves clustering accuracy by 2.61%, 6.06%, 4.96% and 3.15% over the best compared methods on four HSI datasets. Our code is available at https://github.com/jhqi/SSGCO-EGAEL.

[86] HadaNorm: Diffusion Transformer Quantization through Mean-Centered Transformations

Marco Federici,Riccardo Del Chiaro,Boris van Breugel,Paul Whatmough,Markus Nagel

Main category: cs.CV

TL;DR: HadaNorm是一种新型线性变换方法，通过归一化激活特征通道并结合Hadamard变换，有效减少异常值，实现更激进的激活量化，提升扩散模型的量化效果。

Details

Motivation: 扩散模型在图像生成领域表现优异，但其高内存和计算需求限制了在资源受限设备上的部署。后训练量化（PTQ）虽能降低矩阵操作的位宽，但标准方法难以处理异常值，且高压缩需额外权重和激活变换。 Method: 提出HadaNorm方法，通过归一化激活特征通道并应用Hadamard变换，减少异常值影响，从而支持更激进的激活量化。 Result: HadaNorm在Transformer块的各组件中一致减少量化误差，相比现有方法实现了更优的效率-性能权衡。 Conclusion: HadaNorm为扩散模型的量化提供了一种高效解决方案，显著提升了在资源受限设备上的部署潜力。 Abstract: Diffusion models represent the cutting edge in image generation, but their high memory and computational demands hinder deployment on resource-constrained devices. Post-Training Quantization (PTQ) offers a promising solution by reducing the bitwidth of matrix operations. However, standard PTQ methods struggle with outliers, and achieving higher compression often requires transforming model weights and activations before quantization. In this work, we propose HadaNorm, a novel linear transformation that extends existing approaches and effectively mitigates outliers by normalizing activations feature channels before applying Hadamard transformations, enabling more aggressive activation quantization. We demonstrate that HadaNorm consistently reduces quantization error across the various components of transformer blocks, achieving superior efficiency-performance trade-offs when compared to state-of-the-art methods.

[87] LEO-VL: Towards 3D Vision-Language Generalists via Data Scaling with Efficient Representation

Jiangyong Huang,Xiaojian Ma,Xiongkun Linghu,Yue Fan,Junchao He,Wenxin Tan,Qing Li,Song-Chun Zhu,Yixin Chen,Baoxiong Jia,Siyuan Huang

Main category: cs.CV

TL;DR: 论文提出LEO-VL模型，基于高效场景表示CFG，解决了3D-VL通用模型的数据扩展性问题，并在多个任务上取得SOTA性能。

Details

Motivation: 开发能够理解3D场景并执行自然语言指令的3D-VL通用模型是长期目标，但现有模型在能力和鲁棒性上落后于2D模型，主要障碍是数据扩展性问题。 Method: 提出LEO-VL模型，采用CFG（高效场景表示）减少token开销，并基于700k高质量3D-VL数据进行训练。 Result: LEO-VL在多个3D QA基准测试（如SQA3D、MSQA、Beacon3D）上达到SOTA性能。 Conclusion: CFG的高效性、任务和场景多样性的重要性以及数据筛选原则的有效性得到验证，同时提出的SceneDPO增强了模型鲁棒性，推动了3D-VL通用模型的发展。 Abstract: Developing 3D-VL generalists capable of understanding 3D scenes and following natural language instructions to perform a wide range of tasks has been a long-standing goal in the 3D-VL community. Despite recent progress, 3D-VL models still lag behind their 2D counterparts in capability and robustness, falling short of the generalist standard. A key obstacle to developing 3D-VL generalists lies in data scalability, hindered by the lack of an efficient scene representation. We propose LEO-VL, a 3D-VL model built upon condensed feature grid (CFG), an efficient scene representation that bridges 2D perception and 3D spatial structure while significantly reducing token overhead. This efficiency unlocks large-scale training towards 3D-VL generalist, for which we curate over 700k high-quality 3D-VL data spanning four domains of real-world indoor scenes and five tasks such as captioning and dialogue. LEO-VL achieves state-of-the-art performance on a variety of 3D QA benchmarks, including SQA3D, MSQA, and Beacon3D. Ablation studies confirm the efficiency of our representation, the importance of task and scene diversity, and the validity of our data curation principle. Furthermore, we introduce SceneDPO, a novel post-training objective that enhances the robustness of 3D-VL models. We hope our findings contribute to the advancement of scalable and robust 3D-VL generalists.

[88] CausalVQA: A Physically Grounded Causal Reasoning Benchmark for Video Models

Aaron Foss,Chloe Evans,Sasha Mitts,Koustuv Sinha,Ammar Rizvi,Justine T. Kao

Main category: cs.CV

TL;DR: CausalVQA是一个用于视频问答（VQA）的基准数据集，专注于测试模型对现实世界中因果关系的理解能力。

Details

Motivation: 现有VQA数据集多关注表面感知或狭窄的物理推理问题，缺乏对因果关系的深入探究。CausalVQA填补了这一空白。 Method: 数据集包含五种问题类型（反事实、假设、预测、规划和描述性），并通过质量控制机制避免模型利用简单捷径。 Result: 当前前沿多模态模型在CausalVQA上的表现显著低于人类，尤其在预测和假设问题上。 Conclusion: CausalVQA揭示了当前系统在时空推理、物理原理理解和替代方案预测方面的不足。 Abstract: We introduce CausalVQA, a benchmark dataset for video question answering (VQA) composed of question-answer pairs that probe models' understanding of causality in the physical world. Existing VQA benchmarks either tend to focus on surface perceptual understanding of real-world videos, or on narrow physical reasoning questions created using simulation environments. CausalVQA fills an important gap by presenting challenging questions that are grounded in real-world scenarios, while focusing on models' ability to predict the likely outcomes of different actions and events through five question types: counterfactual, hypothetical, anticipation, planning and descriptive. We designed quality control mechanisms that prevent models from exploiting trivial shortcuts, requiring models to base their answers on deep visual understanding instead of linguistic cues. We find that current frontier multimodal models fall substantially below human performance on the benchmark, especially on anticipation and hypothetical questions. This highlights a challenge for current systems to leverage spatial-temporal reasoning, understanding of physical principles, and comprehension of possible alternatives to make accurate predictions in real-world settings.

Ziyi Wang,Yanran Zhang,Jie Zhou,Jiwen Lu

Main category: cs.CV

TL;DR: UniPre3D是一种统一预训练方法，适用于任何尺度的点云和任何架构的3D模型，通过预测高斯基元和使用可微分高斯渲染实现端到端优化。

Details

Motivation: 点云数据的尺度多样性对统一表示学习技术提出了挑战，目前缺乏适用于对象和场景级点云的统一预训练方法。 Method: 提出UniPre3D，预测高斯基元作为预训练任务，结合可微分高斯渲染和2D特征以优化几何结构学习。 Result: 实验验证了该方法在多种对象和场景级任务中的普适有效性。 Conclusion: UniPre3D是首个能无缝应用于不同尺度点云的统一预训练方法，具有广泛适用性。 Abstract: The scale diversity of point cloud data presents significant challenges in developing unified representation learning techniques for 3D vision. Currently, there are few unified 3D models, and no existing pre-training method is equally effective for both object- and scene-level point clouds. In this paper, we introduce UniPre3D, the first unified pre-training method that can be seamlessly applied to point clouds of any scale and 3D models of any architecture. Our approach predicts Gaussian primitives as the pre-training task and employs differentiable Gaussian splatting to render images, enabling precise pixel-level supervision and end-to-end optimization. To further regulate the complexity of the pre-training task and direct the model's focus toward geometric structures, we integrate 2D features from pre-trained image models to incorporate well-established texture knowledge. We validate the universal effectiveness of our proposed method through extensive experiments across a variety of object- and scene-level tasks, using diverse point cloud models as backbones. Code is available at https://github.com/wangzy22/UniPre3D.

[90] Vision Generalist Model: A Survey

Ziyi Wang,Yongming Rao,Shuofeng Sun,Xinrun Liu,Yi Wei,Xumin Yu,Zuyan Liu,Yanbo Wang,Hongmin Liu,Jie Zhou,Jiwen Lu

Main category: cs.CV

TL;DR: 本文综述了视觉通用模型，探讨了其特点、能力及在计算机视觉任务中的应用，包括背景回顾、框架设计、性能提升技术、相关领域关联以及未来研究方向。

Details

Motivation: 通用模型在自然语言处理中的成功激发了将其应用于计算机视觉任务的兴趣，但视觉任务的输入输出多样性带来了挑战。 Method: 回顾背景（数据集、任务、基准）、分析现有框架设计、介绍性能提升技术、探讨相关领域关联。 Result: 提供了视觉通用模型的全面概述，包括其应用场景和性能表现。 Conclusion: 总结了当前挑战，并提出了未来研究方向，为研究者提供了深入理解该领域的指导。 Abstract: Recently, we have witnessed the great success of the generalist model in natural language processing. The generalist model is a general framework trained with massive data and is able to process various downstream tasks simultaneously. Encouraged by their impressive performance, an increasing number of researchers are venturing into the realm of applying these models to computer vision tasks. However, the inputs and outputs of vision tasks are more diverse, and it is difficult to summarize them as a unified representation. In this paper, we provide a comprehensive overview of the vision generalist models, delving into their characteristics and capabilities within the field. First, we review the background, including the datasets, tasks, and benchmarks. Then, we dig into the design of frameworks that have been proposed in existing research, while also introducing the techniques employed to enhance their performance. To better help the researchers comprehend the area, we take a brief excursion into related domains, shedding light on their interconnections and potential synergies. To conclude, we provide some real-world application scenarios, undertake a thorough examination of the persistent challenges, and offer insights into possible directions for future research endeavors.

[91] Kvasir-VQA-x1: A Multimodal Dataset for Medical Reasoning and Robust MedVQA in Gastrointestinal Endoscopy

Sushant Gautam,Michael A. Riegler,Pål Halvorsen

Main category: cs.CV

TL;DR: Kvasir-VQA-x1是一个新的大规模胃肠道内窥镜数据集，旨在通过增加临床复杂性和视觉多样性推动MedVQA领域发展。

Details

Motivation: 现有MedVQA数据集缺乏临床复杂性和视觉多样性，限制了临床决策支持系统的进展。 Method: 使用大语言模型生成159,549个新问答对，并引入视觉增强模拟常见成像伪影，支持两个评估轨道。 Result: Kvasir-VQA-x1提供了更具挑战性和临床相关性的基准，促进多模态AI系统开发。 Conclusion: 该数据集遵循FAIR原则，为研究社区提供了有价值的资源，加速临床AI系统发展。 Abstract: Medical Visual Question Answering (MedVQA) is a promising field for developing clinical decision support systems, yet progress is often limited by the available datasets, which can lack clinical complexity and visual diversity. To address these gaps, we introduce Kvasir-VQA-x1, a new, large-scale dataset for gastrointestinal (GI) endoscopy. Our work significantly expands upon the original Kvasir-VQA by incorporating 159,549 new question-answer pairs that are designed to test deeper clinical reasoning. We developed a systematic method using large language models to generate these questions, which are stratified by complexity to better assess a model's inference capabilities. To ensure our dataset prepares models for real-world clinical scenarios, we have also introduced a variety of visual augmentations that mimic common imaging artifacts. The dataset is structured to support two main evaluation tracks: one for standard VQA performance and another to test model robustness against these visual perturbations. By providing a more challenging and clinically relevant benchmark, Kvasir-VQA-x1 aims to accelerate the development of more reliable and effective multimodal AI systems for use in clinical settings. The dataset is fully accessible and adheres to FAIR data principles, making it a valuable resource for the wider research community. Code and data: https://github.com/Simula/Kvasir-VQA-x1 and https://huggingface.co/datasets/SimulaMet/Kvasir-VQA-x1

[92] Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing

Junfei Wu,Jian Guan,Kaituo Feng,Qiang Liu,Shu Wu,Liang Wang,Wei Wu,Tieniu Tan

Main category: cs.CV

TL;DR: 提出了一种通过视觉空间绘图操作增强大型视觉语言模型（LVLMs）空间推理能力的新方法，显著提升了模型在空间推理任务中的表现。

Details

Motivation: 现有方法在多模态推理中主要依赖纯文本方式，缺乏对空间关系的精确几何理解和连续跟踪能力，导致在空间推理任务中表现受限。 Method: 提出了一种名为“drawing to reason in space”的新范式，通过赋予模型基本绘图操作（如标注边界框和绘制辅助线），使其能够在视觉空间中直接表达和分析空间关系。采用三阶段训练框架：合成数据冷启动训练、反思拒绝采样和强化学习。 Result: 模型VILASR在迷宫导航、静态空间推理、视频推理和多视角推理任务中平均提升了18.4%，显著优于现有方法。 Conclusion: 通过视觉绘图操作增强LVLMs的空间推理能力是有效的，为多模态推理提供了新思路。 Abstract: As textual reasoning with large language models (LLMs) has advanced significantly, there has been growing interest in enhancing the multimodal reasoning capabilities of large vision-language models (LVLMs). However, existing methods primarily approach multimodal reasoning in a straightforward, text-centric manner, where both reasoning and answer derivation are conducted purely through text, with the only difference being the presence of multimodal input. As a result, these methods often encounter fundamental limitations in spatial reasoning tasks that demand precise geometric understanding and continuous spatial tracking-capabilities that humans achieve through mental visualization and manipulation. To address the limitations, we propose drawing to reason in space, a novel paradigm that enables LVLMs to reason through elementary drawing operations in the visual space. By equipping models with basic drawing operations, including annotating bounding boxes and drawing auxiliary lines, we empower them to express and analyze spatial relationships through direct visual manipulation, meanwhile avoiding the performance ceiling imposed by specialized perception tools in previous tool-integrated reasoning approaches. To cultivate this capability, we develop a three-stage training framework: cold-start training with synthetic data to establish basic drawing abilities, reflective rejection sampling to enhance self-reflection behaviors, and reinforcement learning to directly optimize for target rewards. Extensive experiments demonstrate that our model, named VILASR, consistently outperforms existing methods across diverse spatial reasoning benchmarks, involving maze navigation, static spatial reasoning, video-based reasoning, and multi-view-based reasoning tasks, with an average improvement of 18.4%.

[93] Vectorized Region Based Brush Strokes for Artistic Rendering

Jeripothula Prudviraj,Vikram Jamwal

Main category: cs.CV

TL;DR: 提出了一种基于语义引导的图像到绘画方法，通过优化笔触参数和顺序渲染，提升绘画的艺术性和质量。

Details

Motivation: 解决现有笔触绘画系统在捕捉艺术意图和原则时的不足，弥合静态艺术品与其创作过程之间的情感与教育鸿沟。 Method: 结合语义引导、笔触参数计算和顺序渲染策略，实现区域化的绘画生成。 Result: 实验表明，该方法在多种输入图像上均能生成高保真且笔触质量优越的绘画作品。 Conclusion: 该方法有效提升了绘画生成的艺术性和实用性，为艺术创作和教育提供了新工具。 Abstract: Creating a stroke-by-stroke evolution process of a visual artwork tries to bridge the emotional and educational gap between the finished static artwork and its creation process. Recent stroke-based painting systems focus on capturing stroke details by predicting and iteratively refining stroke parameters to maximize the similarity between the input image and the rendered output. However, these methods often struggle to produce stroke compositions that align with artistic principles and intent. To address this, we explore an image-to-painting method that (i) facilitates semantic guidance for brush strokes in targeted regions, (ii) computes the brush stroke parameters, and (iii) establishes a sequence among segments and strokes to sequentially render the final painting. Experimental results on various input image types, such as face images, paintings, and photographic images, show that our method aligns with a region-based painting strategy while rendering a painting with high fidelity and superior stroke quality.

[94] Efficient Part-level 3D Object Generation via Dual Volume Packing

Jiaxiang Tang,Ruijie Lu,Zhaoshuo Li,Zekun Hao,Xuan Li,Fangyin Wei,Shuran Song,Gang Zeng,Ming-Yu Liu,Tsung-Yi Lin

Main category: cs.CV

TL;DR: 提出了一种新的端到端框架，用于生成具有任意数量语义有意义部分的3D对象，解决了现有方法生成单一网格的限制。

Details

Motivation: 现有3D对象生成方法通常生成单一网格，限制了部分编辑能力，且不同对象的部分数量可变。 Method: 采用双体积打包策略，将所有部分组织到两个互补的体积中，生成完整且交错的部分，最终组装成对象。 Result: 实验表明，该方法在质量、多样性和泛化能力上优于之前的基于图像的部分级生成方法。 Conclusion: 该方法成功解决了部分级3D对象生成的挑战，为编辑和操作提供了更大的灵活性。 Abstract: Recent progress in 3D object generation has greatly improved both the quality and efficiency. However, most existing methods generate a single mesh with all parts fused together, which limits the ability to edit or manipulate individual parts. A key challenge is that different objects may have a varying number of parts. To address this, we propose a new end-to-end framework for part-level 3D object generation. Given a single input image, our method generates high-quality 3D objects with an arbitrary number of complete and semantically meaningful parts. We introduce a dual volume packing strategy that organizes all parts into two complementary volumes, allowing for the creation of complete and interleaved parts that assemble into the final object. Experiments show that our model achieves better quality, diversity, and generalization than previous image-based part-level generation methods.

[95] ReSim: Reliable World Simulation for Autonomous Driving

Jiazhi Yang,Kashyap Chitta,Shenyuan Gao,Long Chen,Yuqian Shao,Xiaosong Jia,Hongyang Li,Andreas Geiger,Xiangyu Yue,Li Chen

Main category: cs.CV

TL;DR: 论文提出ReSim模型，通过结合真实驾驶数据和模拟器中的非专家数据，提升驾驶场景模拟的多样性和可靠性，并引入Video2Reward模块评估动作奖励。

Details

Motivation: 现有驾驶世界模型仅基于真实安全驾驶数据，难以模拟危险或非专家行为，限制了其在策略评估等任务中的应用。 Method: 结合真实驾驶数据和模拟器中的非专家数据，构建可控世界模型，采用扩散变换器架构的视频生成器，并设计策略提升预测可控性和保真度。 Result: ReSim模型在视觉保真度上提升44%，对专家和非专家行为的可控性提升超50%，在NAVSIM上的规划和策略选择性能分别提升2%和25%。 Conclusion: ReSim通过数据多样化和可控性提升，实现了开放世界驾驶场景的可靠模拟，为策略评估等任务提供了有效工具。 Abstract: How can we reliably simulate future driving scenarios under a wide range of ego driving behaviors? Recent driving world models, developed exclusively on real-world driving data composed mainly of safe expert trajectories, struggle to follow hazardous or non-expert behaviors, which are rare in such data. This limitation restricts their applicability to tasks such as policy evaluation. In this work, we address this challenge by enriching real-world human demonstrations with diverse non-expert data collected from a driving simulator (e.g., CARLA), and building a controllable world model trained on this heterogeneous corpus. Starting with a video generator featuring a diffusion transformer architecture, we devise several strategies to effectively integrate conditioning signals and improve prediction controllability and fidelity. The resulting model, ReSim, enables Reliable Simulation of diverse open-world driving scenarios under various actions, including hazardous non-expert ones. To close the gap between high-fidelity simulation and applications that require reward signals to judge different actions, we introduce a Video2Reward module that estimates a reward from ReSim's simulated future. Our ReSim paradigm achieves up to 44% higher visual fidelity, improves controllability for both expert and non-expert actions by over 50%, and boosts planning and policy selection performance on NAVSIM by 2% and 25%, respectively.

[96] AnimateAnyMesh: A Feed-Forward 4D Foundation Model for Text-Driven Universal Mesh Animation

Zijie Wu,Chaohui Yu,Fan Wang,Xiang Bai

Main category: cs.CV

TL;DR: AnimateAnyMesh是一个基于文本驱动的高效动画生成框架，首次实现了任意3D网格的动画化，通过DyMeshVAE架构和Rectified Flow训练策略，显著提升了4D内容生成的效率和质量。

Details

Motivation: 当前4D内容生成面临建模时空分布的复杂性和训练数据稀缺的挑战，AnimateAnyMesh旨在解决这些问题。 Method: 采用DyMeshVAE架构分离时空特征并保留局部拓扑结构，结合Rectified Flow训练策略实现高质量文本条件生成。 Result: 实验表明，该方法能在几秒内生成语义准确且时间连贯的网格动画，质量和效率均优于现有方法。 Conclusion: AnimateAnyMesh显著推动了4D内容生成的实用化和普及化，所有数据和模型将开源。 Abstract: Recent advances in 4D content generation have attracted increasing attention, yet creating high-quality animated 3D models remains challenging due to the complexity of modeling spatio-temporal distributions and the scarcity of 4D training data. In this paper, we present AnimateAnyMesh, the first feed-forward framework that enables efficient text-driven animation of arbitrary 3D meshes. Our approach leverages a novel DyMeshVAE architecture that effectively compresses and reconstructs dynamic mesh sequences by disentangling spatial and temporal features while preserving local topological structures. To enable high-quality text-conditional generation, we employ a Rectified Flow-based training strategy in the compressed latent space. Additionally, we contribute the DyMesh Dataset, containing over 4M diverse dynamic mesh sequences with text annotations. Experimental results demonstrate that our method generates semantically accurate and temporally coherent mesh animations in a few seconds, significantly outperforming existing approaches in both quality and efficiency. Our work marks a substantial step forward in making 4D content creation more accessible and practical. All the data, code, and models will be open-released.

[97] InterActHuman: Multi-Concept Human Animation with Layout-Aligned Audio Conditions

Zhenzhi Wang,Jiaqi Yang,Jianwen Jiang,Chao Liang,Gaojie Lin,Zerong Zheng,Ceyuan Yang,Dahua Lin

Main category: cs.CV

TL;DR: 提出了一种新框架，通过区域特定的条件绑定，实现多概念（如多人和物体）的精确控制，解决了现有方法仅支持单一主体和全局条件注入的问题。

Details

Motivation: 现有方法无法处理多概念场景中的人-人和人-物交互，限制了应用。本文旨在通过局部条件绑定实现多概念的精确控制。 Method: 利用掩码预测器自动推断布局信息，并通过迭代方式将局部音频条件注入对应区域，实现布局对齐的多模态匹配。 Result: 实验证明，该方法在多模态条件下的显式布局控制优于隐式方法和其他现有方法。 Conclusion: 提出的框架能够高质量生成可控的多概念以人为中心的视频，为复杂场景提供了有效解决方案。 Abstract: End-to-end human animation with rich multi-modal conditions, e.g., text, image and audio has achieved remarkable advancements in recent years. However, most existing methods could only animate a single subject and inject conditions in a global manner, ignoring scenarios that multiple concepts could appears in the same video with rich human-human interactions and human-object interactions. Such global assumption prevents precise and per-identity control of multiple concepts including humans and objects, therefore hinders applications. In this work, we discard the single-entity assumption and introduce a novel framework that enforces strong, region-specific binding of conditions from modalities to each identity's spatiotemporal footprint. Given reference images of multiple concepts, our method could automatically infer layout information by leveraging a mask predictor to match appearance cues between the denoised video and each reference appearance. Furthermore, we inject local audio condition into its corresponding region to ensure layout-aligned modality matching in a iterative manner. This design enables the high-quality generation of controllable multi-concept human-centric videos. Empirical results and ablation studies validate the effectiveness of our explicit layout control for multi-modal conditions compared to implicit counterparts and other existing methods.

[98] A Shortcut-aware Video-QA Benchmark for Physical Understanding via Minimal Video Pairs

Benno Krojer,Mojtaba Komeili,Candace Ross,Quentin Garrido,Koustuv Sinha,Nicolas Ballas,Mahmoud Assran

Main category: cs.CV

TL;DR: 论文提出了MVP基准，用于评估视频语言模型的物理世界理解能力，避免现有基准因表面视觉或文本线索导致的评分膨胀。

Details

Motivation: 现有基准易受表面线索影响，导致模型性能评估不准确，因此需要一种更可靠的评估方法。 Method: 引入MVP基准，包含55K高质量多选题视频QA样本，每个样本有最小变化对，要求模型同时正确回答两个问题以避免依赖表面线索。 Result: 人类表现92.9%，最佳开源模型40.2%，随机表现25%。 Conclusion: MVP基准有效避免了表面线索的干扰，为评估视频语言模型的物理理解能力提供了可靠工具。 Abstract: Existing benchmarks for assessing the spatio-temporal understanding and reasoning abilities of video language models are susceptible to score inflation due to the presence of shortcut solutions based on superficial visual or textual cues. This paper mitigates the challenges in accurately assessing model performance by introducing the Minimal Video Pairs (MVP) benchmark, a simple shortcut-aware video QA benchmark for assessing the physical understanding of video language models. The benchmark is comprised of 55K high-quality multiple-choice video QA examples focusing on physical world understanding. Examples are curated from nine video data sources, spanning first-person egocentric and exocentric videos, robotic interaction data, and cognitive science intuitive physics benchmarks. To mitigate shortcut solutions that rely on superficial visual or textual cues and biases, each sample in MVP has a minimal-change pair -- a visually similar video accompanied by an identical question but an opposing answer. To answer a question correctly, a model must provide correct answers for both examples in the minimal-change pair; as such, models that solely rely on visual or textual biases would achieve below random performance. Human performance on MVP is 92.9\%, while the best open-source state-of-the-art video-language model achieves 40.2\% compared to random performance at 25\%.

[99] EditInspector: A Benchmark for Evaluation of Text-Guided Image Edits

Ron Yosef,Moran Yanuka,Yonatan Bitton,Dani Lischinski

Main category: cs.CV

TL;DR: EditInspector是一个基于人工标注的新基准，用于评估文本引导的图像编辑质量，发现现有模型在全面评估编辑时表现不佳，并提出两种新方法改进。

Details

Motivation: 随着生成式AI的发展，文本引导的图像编辑日益普及，但缺乏一个全面的框架来验证和评估这些编辑的质量。 Method: 引入EditInspector基准，利用人工标注和模板验证编辑，并评估现有模型在多个维度上的表现。提出两种新方法改进性能。 Result: 现有模型在评估编辑时表现不佳，常产生幻觉描述。提出的新方法在伪影检测和差异描述生成上优于现有技术。 Conclusion: EditInspector为文本引导编辑的评估提供了新基准，新方法显著提升了评估性能。 Abstract: Text-guided image editing, fueled by recent advancements in generative AI, is becoming increasingly widespread. This trend highlights the need for a comprehensive framework to verify text-guided edits and assess their quality. To address this need, we introduce EditInspector, a novel benchmark for evaluation of text-guided image edits, based on human annotations collected using an extensive template for edit verification. We leverage EditInspector to evaluate the performance of state-of-the-art (SoTA) vision and language models in assessing edits across various dimensions, including accuracy, artifact detection, visual quality, seamless integration with the image scene, adherence to common sense, and the ability to describe edit-induced changes. Our findings indicate that current models struggle to evaluate edits comprehensively and frequently hallucinate when describing the changes. To address these challenges, we propose two novel methods that outperform SoTA models in both artifact detection and difference caption generation.

[100] Hearing Hands: Generating Sounds from Physical Interactions in 3D Scenes

Yiming Dou,Wonseok Oh,Yuqing Luo,Antonio Loquercio,Andrew Owens

Main category: cs.CV

TL;DR: 研究如何通过预测人手与3D场景物理交互的声音，实现交互式3D场景重建。

Details

Motivation: 探索如何通过声音增强3D场景的交互性，模拟真实世界中人手与物体的互动。 Method: 记录人手操作3D场景中物体的视频，利用动作-声音对训练校正流模型，将3D手部轨迹映射到对应音频。 Result: 生成的音频能准确传达材质特性和动作，人类观察者难以区分其与真实声音。 Conclusion: 该方法成功实现了通过声音预测增强3D场景交互性，具有较高的真实感。 Abstract: We study the problem of making 3D scene reconstructions interactive by asking the following question: can we predict the sounds of human hands physically interacting with a scene? First, we record a video of a human manipulating objects within a 3D scene using their hands. We then use these action-sound pairs to train a rectified flow model to map 3D hand trajectories to their corresponding audio. At test time, a user can query the model for other actions, parameterized as sequences of hand poses, to estimate their corresponding sounds. In our experiments, we find that our generated sounds accurately convey material properties and actions, and that they are often indistinguishable to human observers from real sounds. Project page: https://www.yimingdou.com/hearing_hands/

[101] Text-Aware Image Restoration with Diffusion Models

Jaewon Min,Jin Hyeon Kim,Paul Hyunbin Cho,Jaeeun Lee,Jihye Park,Minkyu Park,Sangpil Kim,Hyunhee Park,Seungryong Kim

Main category: cs.CV

TL;DR: 本文提出了一种新的图像修复任务TAIR，专注于同时恢复视觉内容和文本保真度，并提出了TeReDiff框架，显著提升了文本识别准确性。

Details

Motivation: 现有基于扩散的图像修复方法在自然图像修复中表现良好，但在文本区域重建时容易产生错误的文本模式（文本图像幻觉）。 Method: 提出了SA-Text基准数据集和TeReDiff框架，结合扩散模型和文本检测模块，通过联合训练提取丰富的文本表示。 Result: 实验表明，TeReDiff在文本识别准确性上显著优于现有方法。 Conclusion: TAIR任务和TeReDiff框架为解决文本图像幻觉问题提供了有效方案。 Abstract: Image restoration aims to recover degraded images. However, existing diffusion-based restoration methods, despite great success in natural image restoration, often struggle to faithfully reconstruct textual regions in degraded images. Those methods frequently generate plausible but incorrect text-like patterns, a phenomenon we refer to as text-image hallucination. In this paper, we introduce Text-Aware Image Restoration (TAIR), a novel restoration task that requires the simultaneous recovery of visual contents and textual fidelity. To tackle this task, we present SA-Text, a large-scale benchmark of 100K high-quality scene images densely annotated with diverse and complex text instances. Furthermore, we propose a multi-task diffusion framework, called TeReDiff, that integrates internal features from diffusion models into a text-spotting module, enabling both components to benefit from joint training. This allows for the extraction of rich text representations, which are utilized as prompts in subsequent denoising steps. Extensive experiments demonstrate that our approach consistently outperforms state-of-the-art restoration methods, achieving significant gains in text recognition accuracy. See our project page: https://cvlab-kaist.github.io/TAIR/

[102] PlayerOne: Egocentric World Simulator

Yuanpeng Tu,Hao Luo,Xi Chen,Xiang Bai,Fan Wang,Hengshuang Zhao

Main category: cs.CV

TL;DR: PlayerOne是首个以自我为中心的逼真世界模拟器，能够动态生成与现实场景严格对齐的自我中心视频。

Details

Motivation: 旨在实现沉浸式且无限制的动态环境探索，推动世界建模及其多样化应用的研究。 Method: 采用从粗到细的流程，包括大规模文本-视频对预训练和同步运动-视频数据微调，设计了部分解耦的运动注入方案和联合重建框架。 Result: 实验表明其在精确控制多样化人类运动和世界一致性建模方面具有强大泛化能力。 Conclusion: PlayerOne开创了自我中心现实世界模拟的新领域，为世界建模及其应用开辟了新方向。 Abstract: We introduce PlayerOne, the first egocentric realistic world simulator, facilitating immersive and unrestricted exploration within vividly dynamic environments. Given an egocentric scene image from the user, PlayerOne can accurately construct the corresponding world and generate egocentric videos that are strictly aligned with the real scene human motion of the user captured by an exocentric camera. PlayerOne is trained in a coarse-to-fine pipeline that first performs pretraining on large-scale egocentric text-video pairs for coarse-level egocentric understanding, followed by finetuning on synchronous motion-video data extracted from egocentric-exocentric video datasets with our automatic construction pipeline. Besides, considering the varying importance of different components, we design a part-disentangled motion injection scheme, enabling precise control of part-level movements. In addition, we devise a joint reconstruction framework that progressively models both the 4D scene and video frames, ensuring scene consistency in the long-form video generation. Experimental results demonstrate its great generalization ability in precise control of varying human movements and worldconsistent modeling of diverse scenarios. It marks the first endeavor into egocentric real-world simulation and can pave the way for the community to delve into fresh frontiers of world modeling and its diverse applications.

cs.GR [Back]

[103] STREAMINGGS: Voxel-Based Streaming 3D Gaussian Splatting with Memory Optimization and Architectural Support

Chenqi Zhang,Yu Feng,Jieru Zhao,Guangda Liu,Wenchao Ding,Chentao Wu,Minyi Guo

Main category: cs.GR

TL;DR: STREAMINGGS是一种3D高斯泼溅（3DGS）的算法-架构协同设计，通过内存中心渲染显著提升移动设备上的实时性能。

Details

Motivation: 3DGS在移动设备上难以达到90 FPS的实时要求，现有加速器忽视了内存效率，导致DRAM流量冗余。 Method: 采用完全流式3DGS设计，从基于瓦片的渲染转变为内存中心渲染，实现细粒度流水线并减少DRAM流量。 Result: 设计在移动Ampere GPU上实现了45.7倍加速和62.9倍能耗节省。 Conclusion: STREAMINGGS通过优化内存效率显著提升了3DGS在移动设备上的性能和能效。 Abstract: 3D Gaussian Splatting (3DGS) has gained popularity for its efficiency and sparse Gaussian-based representation. However, 3DGS struggles to meet the real-time requirement of 90 frames per second (FPS) on resource-constrained mobile devices, achieving only 2 to 9 FPS.Existing accelerators focus on compute efficiency but overlook memory efficiency, leading to redundant DRAM traffic. We introduce STREAMINGGS, a fully streaming 3DGS algorithm-architecture co-design that achieves fine-grained pipelining and reduces DRAM traffic by transforming from a tile-centric rendering to a memory-centric rendering. Results show that our design achieves up to 45.7 $\times$ speedup and 62.9 $\times$ energy savings over mobile Ampere GPUs.

[104] SILK: Smooth InterpoLation frameworK for motion in-betweening A Simplified Computational Approach

Elly Akhoundi,Hung Yu Ling,Anup Anand Deshmukh,Judith Butepage

Main category: cs.GR

TL;DR: 本文提出了一种基于Transformer的简单框架，用于运动插值任务，强调数据建模选择对性能的重要性。

Details

Motivation: 运动插值是动画师的关键工具，现有机器学习方法依赖复杂模型，本文旨在简化模型并探索数据建模的影响。 Method: 采用单一Transformer编码器框架，研究数据量、姿态表示和速度输入特征对性能的影响。 Result: 实验表明，增加数据量、优化姿态表示和引入速度特征可提升运动插值质量，挑战了模型复杂性决定性能的假设。 Conclusion: 数据建模选择对运动插值性能至关重要，提供了一种更数据为中心的方法。 Abstract: Motion in-betweening is a crucial tool for animators, enabling intricate control over pose-level details in each keyframe. Recent machine learning solutions for motion in-betweening rely on complex models, incorporating skeleton-aware architectures or requiring multiple modules and training steps. In this work, we introduce a simple yet effective Transformer-based framework, employing a single Transformer encoder to synthesize realistic motions for motion in-betweening tasks. We find that data modeling choices play a significant role in improving in-betweening performance. Among others, we show that increasing data volume can yield equivalent or improved motion transitions, that the choice of pose representation is vital for achieving high-quality results, and that incorporating velocity input features enhances animation performance. These findings challenge the assumption that model complexity is the primary determinant of animation quality and provide insights into a more data-centric approach to motion interpolation. Additional videos and supplementary material are available at https://silk-paper.github.io.

[105] VideoMat: Extracting PBR Materials from Video Diffusion Models

Jacob Munkberg,Zian Wang,Ruofan Liang,Tianchang Shen,Jon Hasselgren

Main category: cs.GR

TL;DR: 利用视频扩散模型、视频内部分解和基于物理的可微分渲染，从文本提示或单张图像生成高质量3D模型材质。

Details

Motivation: 为3D模型生成高质量材质，支持文本或单张图像输入，提升内容创作效率。 Method: 1. 微调视频扩散模型以符合输入几何和光照条件；2. 从生成视频中提取材质属性（基础色、粗糙度、金属度）；3. 结合可微分路径追踪器生成兼容常见工具的PBR材质。 Result: 生成多视角一致的3D模型材质，可直接用于内容创作工具。 Conclusion: 该方法通过结合扩散模型与物理渲染，实现了高效且高质量的3D材质生成。 Abstract: We leverage finetuned video diffusion models, intrinsic decomposition of videos, and physically-based differentiable rendering to generate high quality materials for 3D models given a text prompt or a single image. We condition a video diffusion model to respect the input geometry and lighting condition. This model produces multiple views of a given 3D model with coherent material properties. Secondly, we use a recent model to extract intrinsics (base color, roughness, metallic) from the generated video. Finally, we use the intrinsics alongside the generated video in a differentiable path tracer to robustly extract PBR materials directly compatible with common content creation tools.

[106] TransGI: Real-Time Dynamic Global Illumination With Object-Centric Neural Transfer Model

Yijie Deng,Lei Han,Lu Fang

Main category: cs.GR

TL;DR: TransGI是一种新型神经渲染方法，用于实时高保真全局光照，通过对象中心的神经传递模型和辐射共享照明系统实现高效渲染。

Details

Motivation: 解决现有神经渲染算法在实时渲染和任意光照条件下的局限性，尤其是材料表示的紧凑性和表现力不足的问题。 Method: 提出对象中心的神经传递模型（MLP解码器和顶点附加潜在特征）和局部光探针结合辐射共享策略。 Result: 实验结果显示，TransGI能在10毫秒内完成帧渲染，且渲染质量显著优于基线方法。 Conclusion: TransGI在实时性和渲染质量之间取得了平衡，为高保真全局光照提供了有效解决方案。 Abstract: Neural rendering algorithms have revolutionized computer graphics, yet their impact on real-time rendering under arbitrary lighting conditions remains limited due to strict latency constraints in practical applications. The key challenge lies in formulating a compact yet expressive material representation. To address this, we propose TransGI, a novel neural rendering method for real-time, high-fidelity global illumination. It comprises an object-centric neural transfer model for material representation and a radiance-sharing lighting system for efficient illumination. Traditional BSDF representations and spatial neural material representations lack expressiveness, requiring thousands of ray evaluations to converge to noise-free colors. Conversely, real-time methods trade quality for efficiency by supporting only diffuse materials. In contrast, our object-centric neural transfer model achieves compactness and expressiveness through an MLP-based decoder and vertex-attached latent features, supporting glossy effects with low memory overhead. For dynamic, varying lighting conditions, we introduce local light probes capturing scene radiance, coupled with an across-probe radiance-sharing strategy for efficient probe generation. We implemented our method in a real-time rendering engine, combining compute shaders and CUDA-based neural networks. Experimental results demonstrate that our method achieves real-time performance of less than 10 ms to render a frame and significantly improved rendering quality compared to baseline methods.

[107] DGS-LRM: Real-Time Deformable 3D Gaussian Reconstruction From Monocular Videos

Chieh Hubert Lin,Zhaoyang Lv,Songyin Wu,Zhen Xu,Thu Nguyen-Phuoc,Hung-Yu Tseng,Julian Straub,Numair Khan,Lei Xiao,Ming-Hsuan Yang,Yuheng Ren,Richard Newcombe,Zhao Dong,Zhengqin Li

Main category: cs.GR

TL;DR: DGS-LRM是一种前馈方法，首次从单目视频中预测可变形3D高斯斑点，用于动态场景重建。

Details

Motivation: 现有前馈模型多限于静态场景，无法重建动态物体运动，亟需解决动态场景重建的挑战。 Method: 提出大规模合成数据集、可变形3D高斯表示和大型Transformer网络，实现实时动态重建。 Result: DGS-LRM重建质量媲美优化方法，优于现有动态重建方法，支持长程3D跟踪。 Conclusion: DGS-LRM为动态场景重建提供了高效且高质量的解决方案。 Abstract: We introduce the Deformable Gaussian Splats Large Reconstruction Model (DGS-LRM), the first feed-forward method predicting deformable 3D Gaussian splats from a monocular posed video of any dynamic scene. Feed-forward scene reconstruction has gained significant attention for its ability to rapidly create digital replicas of real-world environments. However, most existing models are limited to static scenes and fail to reconstruct the motion of moving objects. Developing a feed-forward model for dynamic scene reconstruction poses significant challenges, including the scarcity of training data and the need for appropriate 3D representations and training paradigms. To address these challenges, we introduce several key technical contributions: an enhanced large-scale synthetic dataset with ground-truth multi-view videos and dense 3D scene flow supervision; a per-pixel deformable 3D Gaussian representation that is easy to learn, supports high-quality dynamic view synthesis, and enables long-range 3D tracking; and a large transformer network that achieves real-time, generalizable dynamic scene reconstruction. Extensive qualitative and quantitative experiments demonstrate that DGS-LRM achieves dynamic scene reconstruction quality comparable to optimization-based methods, while significantly outperforming the state-of-the-art predictive dynamic reconstruction method on real-world examples. Its predicted physically grounded 3D deformation is accurate and can readily adapt for long-range 3D tracking tasks, achieving performance on par with state-of-the-art monocular video 3D tracking methods.

cs.CL [Back]

[108] LLM-as-a-qualitative-judge: automating error analysis in natural language generation

Nadezhda Chirkova,Tunde Oluwaseyi Ajayi,Seth Aycock,Zain Muhammad Mujahid,Vladana Perlić,Ekaterina Borisova,Markarit Vartampetian

Main category: cs.CL

TL;DR: 提出了一种基于LLM的定性评估方法（LLM-as-a-qualitative-judge），通过结构化报告分析NLG系统的常见问题类型，帮助开发者改进系统。

Details

Motivation: 当前LLM评估主要依赖数值评分，缺乏对问题的具体分析，难以指导系统改进。 Method: 采用两步法：1) 开放式逐实例问题分析；2) 直观累积算法聚类问题。 Result: 在2/3案例中正确识别实例特定问题，生成的错误类型报告与人工标注报告相似。 Conclusion: LLM-as-a-qualitative-judge能有效提供改进NLG系统的具体建议，代码和数据已公开。 Abstract: Prompting large language models (LLMs) to evaluate generated text, known as LLM-as-a-judge, has become a standard evaluation approach in natural language generation (NLG), but is primarily used as a quantitative tool, i.e. with numerical scores as main outputs. In this work, we propose LLM-as-a-qualitative-judge, an LLM-based evaluation approach with the main output being a structured report of common issue types in the NLG system outputs. Our approach is targeted at providing developers with meaningful insights on what improvements can be done to a given NLG system and consists of two main steps, namely open-ended per-instance issue analysis and clustering of the discovered issues using an intuitive cumulative algorithm. We also introduce a strategy for evaluating the proposed approach, coupled with ~300 annotations of issues in instances from 12 NLG datasets. Our results show that LLM-as-a-qualitative-judge correctly recognizes instance-specific issues in 2/3 cases and is capable of producing error type reports resembling the reports composed by human annotators. Our code and data are publicly available at https://github.com/tunde-ajayi/llm-as-a-qualitative-judge.

[109] PHRASED: Phrase Dictionary Biasing for Speech Translation

Peidong Wang,Jian Xue,Rui Zhao,Junkun Chen,Aswin Shanmugam Subramanian,Jinyu Li

Main category: cs.CL

TL;DR: 提出了一种基于短语词典偏置的方法，显著提升了语音翻译任务中短语翻译的准确性。

Details

Motivation: 由于训练数据中短语出现频率低，语音翻译任务中短语的正确翻译具有挑战性。 Method: 利用源语言到目标语言的短语映射对，提出短语词典偏置方法，并应用于流式语音翻译模型和多模态大语言模型。 Result: 短语词典偏置方法在流式语音翻译模型中相对短语列表偏置提升了21%，在多模态大语言模型中短语召回率相对提升了85%。 Conclusion: 短语词典偏置方法有效提升了短语翻译的准确性，尤其在多模态大语言模型中表现突出。 Abstract: Phrases are essential to understand the core concepts in conversations. However, due to their rare occurrence in training data, correct translation of phrases is challenging in speech translation tasks. In this paper, we propose a phrase dictionary biasing method to leverage pairs of phrases mapping from the source language to the target language. We apply the phrase dictionary biasing method to two types of widely adopted models, a transducer-based streaming speech translation model and a multimodal large language model. Experimental results show that the phrase dictionary biasing method outperforms phrase list biasing by 21% relatively for the streaming speech translation model. In addition, phrase dictionary biasing enables multimodal large language models to use external phrase information, achieving 85% relative improvement in phrase recall.

[110] A Technique for Isolating Lexically-Independent Phonetic Dependencies in Generative CNNs

Bruno Ferenc Šegedin

Main category: cs.CL

TL;DR: 研究探讨了生成卷积神经网络（CNN）在音频波形学习中的词汇无关泛化能力，并提出了一种新方法，通过绕过全连接层（FC）输入随机特征图来探测模型的泛化能力。

Details

Motivation: 探索深度神经网络（DNNs）是否能从词汇学习中提取音位规则，并研究缩小全连接层瓶颈对泛化能力的影响。 Method: 训练生成CNN于音频波形数据，缩小FC层通道数，并绕过FC输入随机特征图生成音频输出。 Result: 卷积层能动态泛化音位依赖关系，超越FC层学习的词汇约束配置。 Conclusion: 提出了一种新方法，证明卷积层具有词汇无关的音位泛化能力。 Abstract: The ability of deep neural networks (DNNs) to represent phonotactic generalizations derived from lexical learning remains an open question. This study (1) investigates the lexically-invariant generalization capacity of generative convolutional neural networks (CNNs) trained on raw audio waveforms of lexical items and (2) explores the consequences of shrinking the fully-connected layer (FC) bottleneck from 1024 channels to 8 before training. Ultimately, a novel technique for probing a model's lexically-independent generalizations is proposed that works only under the narrow FC bottleneck: generating audio outputs by bypassing the FC and inputting randomized feature maps into the convolutional block. These outputs are equally biased by a phonotactic restriction in training as are outputs generated with the FC. This result shows that the convolutional layers can dynamically generalize phonetic dependencies beyond lexically-constrained configurations learned by the FC.

[111] Extrapolation by Association: Length Generalization Transfer in Transformers

Ziyang Cai,Nayoung Lee,Avi Schwarzschild,Samet Oymak,Dimitris Papailiopoulos

Main category: cs.CL

TL;DR: 研究发现Transformer模型可以通过相关任务的训练实现长度泛化的迁移，即在辅助任务上训练后能泛化到目标任务的更长输入。

Details

Motivation: 探索Transformer模型如何通过任务关联实现长度泛化，以理解其泛化能力的来源。 Method: 通过在不同算法任务（如算术运算、字符串变换、迷宫导航）上训练模型，研究长度泛化的迁移现象。 Result: 模型能从相关任务中继承泛化能力，且预训练语言模型也表现出类似效果，表明预训练提供了可重用的计算框架。 Conclusion: 研究揭示了Transformer模型如何通过任务间的组合复用实现泛化，加深了对模型处理分布外输入的理解。 Abstract: Transformer language models have demonstrated impressive generalization capabilities in natural language domains, yet we lack a fine-grained understanding of how such generalization arises. In this paper, we investigate length generalization--the ability to extrapolate from shorter to longer inputs--through the lens of \textit{task association}. We find that length generalization can be \textit{transferred} across related tasks. That is, training a model with a longer and related auxiliary task can lead it to generalize to unseen and longer inputs from some other target task. We demonstrate this length generalization transfer across diverse algorithmic tasks, including arithmetic operations, string transformations, and maze navigation. Our results show that transformer models can inherit generalization capabilities from similar tasks when trained jointly. Moreover, we observe similar transfer effects in pretrained language models, suggesting that pretraining equips models with reusable computational scaffolding that facilitates extrapolation in downstream settings. Finally, we provide initial mechanistic evidence that length generalization transfer correlates with the re-use of the same attention heads between the tasks. Together, our findings deepen our understanding of how transformers generalize to out-of-distribution inputs and highlight the compositional reuse of inductive structure across tasks.

[112] Self-Anchored Attention Model for Sample-Efficient Classification of Prosocial Text Chat

Zhuofang Li,Rafal Kocielnik,Fereshteh Soltani,Penphob,Boonyarungsrit,Animashree Anandkumar,R. Michael Alvarez

Main category: cs.CL

TL;DR: 论文提出了一种新方法（SAAM）用于识别和分类游戏聊天中的亲社会行为，相比现有技术提升了7.9%，并在低资源环境下展示了有效性。

Details

Motivation: 现有研究主要集中在检测游戏聊天中的负面内容，而亲社会行为的识别资源匮乏，但其重要性不亚于毒性检测。 Method: 结合无监督发现与游戏领域专家合作，提出SAAM模型，利用整个训练集作为“锚点”提升性能。 Result: SAAM在低资源环境下表现优异，首次实现了游戏聊天中亲社会行为的自动分类。 Conclusion: 该研究为在线平台从单纯惩罚毒性转向鼓励积极互动提供了新思路。 Abstract: Millions of players engage daily in competitive online games, communicating through in-game chat. Prior research has focused on detecting relatively small volumes of toxic content using various Natural Language Processing (NLP) techniques for the purpose of moderation. However, recent studies emphasize the importance of detecting prosocial communication, which can be as crucial as identifying toxic interactions. Recognizing prosocial behavior allows for its analysis, rewarding, and promotion. Unlike toxicity, there are limited datasets, models, and resources for identifying prosocial behaviors in game-chat text. In this work, we employed unsupervised discovery combined with game domain expert collaboration to identify and categorize prosocial player behaviors from game chat. We further propose a novel Self-Anchored Attention Model (SAAM) which gives 7.9% improvement compared to the best existing technique. The approach utilizes the entire training set as "anchors" to help improve model performance under the scarcity of training data. This approach led to the development of the first automated system for classifying prosocial behaviors in in-game chats, particularly given the low-resource settings where large-scale labeled data is not available. Our methodology was applied to one of the most popular online gaming titles - Call of Duty(R): Modern Warfare(R)II, showcasing its effectiveness. This research is novel in applying NLP techniques to discover and classify prosocial behaviors in player in-game chat communication. It can help shift the focus of moderation from solely penalizing toxicity to actively encouraging positive interactions on online platforms.

[113] Did I Faithfully Say What I Thought? Bridging the Gap Between Neural Activity and Self-Explanations in Large Language Models

Milan Bhan,Jean-Noel Vittaut,Nicolas Chesneau,Sarath Chandar,Marie-Jeanne Lesot

Main category: cs.CL

TL;DR: 本文提出了一种新框架，通过比较LLM生成的自我自然语言解释（self-NLE）与模型内部隐藏状态的解释，定量测量其忠实性。

Details

Motivation: 现有方法大多依赖行为测试或计算块识别，未能深入模型推理的神经活动层面，导致无法准确评估self-NLE的忠实性。 Method: 引入一个灵活的框架，直接比较self-NLE与模型内部隐藏状态的解释，以定量测量忠实性。 Result: 该框架提供了对self-NLE忠实性的深入理解，并建立了self-NLE与模型推理之间的直接联系。 Conclusion: 该方法推动了self-NLE忠实性的研究，并为生成更忠实的self-NLE奠定了基础。 Abstract: Large Language Models (LLM) have demonstrated the capability of generating free text self Natural Language Explanation (self-NLE) to justify their answers. Despite their logical appearance, self-NLE do not necessarily reflect the LLM actual decision-making process, making such explanations unfaithful. While existing methods for measuring self-NLE faithfulness mostly rely on behavioral tests or computational block identification, none of them examines the neural activity underlying the model's reasoning. This work introduces a novel flexible framework for quantitatively measuring the faithfulness of LLM-generated self-NLE by directly comparing the latter with interpretations of the model's internal hidden states. The proposed framework is versatile and provides deep insights into self-NLE faithfulness by establishing a direct connection between self-NLE and model reasoning. This approach advances the understanding of self-NLE faithfulness and provides building blocks for generating more faithful self-NLE.

[114] $(RSA)^2$: A Rhetorical-Strategy-Aware Rational Speech Act Framework for Figurative Language Understanding

Cesare Spinoso-Di Piano,David Austin,Pablo Piantanida,Jackie Chi Kit Cheung

Main category: cs.CL

TL;DR: 论文提出了Rhetorical-Strategy-Aware RSA框架（$(RSA)^2$），用于解释比喻性语言，无需建模说话者的动机，结合LLMs在讽刺解释任务中达到最优性能。

Details

Motivation: 比喻性语言（如讽刺、夸张、低调陈述）在人类交流中普遍存在，但现有RSA框架无法解释或需特定建模说话者动机。 Method: 引入$(RSA)^2$框架，通过考虑说话者的修辞策略来建模比喻性语言使用。 Result: $(RSA)^2$框架无需建模说话者动机即可实现人类兼容的非字面解释，结合LLMs在新讽刺数据集PragMega+上达到最优性能。 Conclusion: $(RSA)^2$为比喻性语言解释提供了有效框架，展示了在讽刺解释任务中的优越性。 Abstract: Figurative language (e.g., irony, hyperbole, understatement) is ubiquitous in human communication, resulting in utterances where the literal and the intended meanings do not match. The Rational Speech Act (RSA) framework, which explicitly models speaker intentions, is the most widespread theory of probabilistic pragmatics, but existing implementations are either unable to account for figurative expressions or require modeling the implicit motivations for using figurative language (e.g., to express joy or annoyance) in a setting-specific way. In this paper, we introduce the Rhetorical-Strategy-Aware RSA $(RSA)^2$ framework which models figurative language use by considering a speaker's employed rhetorical strategy. We show that $(RSA)^2$ enables human-compatible interpretations of non-literal utterances without modeling a speaker's motivations for being non-literal. Combined with LLMs, it achieves state-of-the-art performance on the ironic split of PragMega+, a new irony interpretation dataset introduced in this study.

[115] Alzheimer's Dementia Detection Using Perplexity from Paired Large Language Models

Yao Xiao,Heidi Christensen,Stefan Goetze

Main category: cs.CL

TL;DR: 本文提出了一种基于Mistral-7B大语言模型的配对困惑度方法，用于检测阿尔茨海默病（AD），在准确率上优于现有方法，并展示了其可解释性和潜在应用。

Details

Motivation: 阿尔茨海默病（AD）常伴随语言能力下降，现有检测方法的准确性和可解释性不足，需要改进。 Method: 使用Mistral-7B大语言模型，改进配对困惑度方法，并通过提示微调模型分析AD语言模式。 Result: 新方法平均准确率提升3.33%，优于ADReSS 2020基准方法6.35%，且决策边界清晰可解释。 Conclusion: 该方法不仅提高了AD检测的准确性，还为模型解释和数据增强提供了新思路。 Abstract: Alzheimer's dementia (AD) is a neurodegenerative disorder with cognitive decline that commonly impacts language ability. This work extends the paired perplexity approach to detecting AD by using a recent large language model (LLM), the instruction-following version of Mistral-7B. We improve accuracy by an average of 3.33% over the best current paired perplexity method and by 6.35% over the top-ranked method from the ADReSS 2020 challenge benchmark. Our further analysis demonstrates that the proposed approach can effectively detect AD with a clear and interpretable decision boundary in contrast to other methods that suffer from opaque decision-making processes. Finally, by prompting the fine-tuned LLMs and comparing the model-generated responses to human responses, we illustrate that the LLMs have learned the special language patterns of AD speakers, which opens up possibilities for novel methods of model interpretation and data augmentation.

[116] Towards Efficient and Effective Alignment of Large Language Models

Yuxin Jiang

Main category: cs.CL

TL;DR: 论文提出了一系列新方法（如Lion、WebR、LTE、BMC和FollowBench）来改进大语言模型（LLM）的对齐问题，涵盖数据收集、训练和评估。

Details

Motivation: 尽管LLM在多种任务中表现出色，但如何高效且有效地使其与人类期望对齐仍是一个关键挑战。 Method: 1. Lion框架通过对抗蒸馏优化数据收集；2. WebR从原始网页自动合成数据；3. LTE框架通过元学习优化知识更新；4. BMC改进DPO以捕捉标记级相关性；5. FollowBench评估模型对复杂约束的遵循能力。 Result: 提出的方法在零样本推理、数据多样性、知识更新和约束遵循等方面取得了显著改进。 Conclusion: 论文通过创新方法解决了LLM对齐的多个关键问题，为未来研究提供了重要方向。 Abstract: Large language models (LLMs) exhibit remarkable capabilities across diverse tasks, yet aligning them efficiently and effectively with human expectations remains a critical challenge. This thesis advances LLM alignment by introducing novel methodologies in data collection, training, and evaluation. We first address alignment data collection. Existing approaches rely heavily on manually curated datasets or proprietary models. To overcome these limitations, we propose Lion, an adversarial distillation framework that iteratively refines training data by identifying and generating challenging instructions, enabling state-of-the-art zero-shot reasoning. Additionally, we introduce Web Reconstruction (WebR), a fully automated framework that synthesizes instruction-tuning data directly from raw web documents, significantly improving data diversity and scalability over existing synthetic data methods. Next, we enhance alignment training through novel optimization techniques. We develop Learning to Edit (LTE), a framework that enables LLMs to efficiently integrate new knowledge while preserving existing information. LTE leverages meta-learning to improve both real-time and batch knowledge updates. Furthermore, we introduce Bridging and Modeling Correlations (BMC), a refinement of Direct Preference Optimization (DPO) that explicitly captures token-level correlations in preference data, leading to superior alignment across QA and mathematical reasoning tasks. Finally, we tackle the challenge of evaluating alignment. Existing benchmarks emphasize response quality but overlook adherence to specific constraints. To bridge this gap, we introduce FollowBench, a multi-level, fine-grained benchmark assessing LLMs' ability to follow complex constraints across diverse instruction types. Our results expose key weaknesses in current models' constraint adherence, offering insights for future improvements.

[117] Multi-Agent Language Models: Advancing Cooperation, Coordination, and Adaptation

Arjun Vaithilingam Sudhakar

Main category: cs.CL

TL;DR: 研究探讨大型语言模型（LLMs）是否具备心智理论能力，通过多智能体强化学习（MARL）框架评估其在协作任务中的表现。

Details

Motivation: 理解LLMs是否能推断他人意图，对促进人机协作具有重要意义。 Method: 利用基于LLM的智能体，通过自然语言交互在MARL框架中进行协作实验。 Result: 研究表明LLMs在协作任务中展现出一定的意图推理能力。 Conclusion: LLMs具备潜在的心智理论能力，为人机协作系统的未来发展提供了新方向。 Abstract: Modern Large Language Models (LLMs) exhibit impressive zero-shot and few-shot generalization capabilities across complex natural language tasks, enabling their widespread use as virtual assistants for diverse applications such as translation and summarization. Despite being trained solely on large corpora of text without explicit supervision on author intent, LLMs appear to infer the underlying meaning of textual interactions. This raises a fundamental question: can LLMs model and reason about the intentions of others, i.e., do they possess a form of theory of mind? Understanding other's intentions is crucial for effective collaboration, which underpins human societal success and is essential for cooperative interactions among multiple agents, including humans and autonomous systems. In this work, we investigate the theory of mind in LLMs through the lens of cooperative multi-agent reinforcement learning (MARL), where agents learn to collaborate via repeated interactions, mirroring human social reasoning. Our approach aims to enhance artificial agent's ability to adapt and cooperate with both artificial and human partners. By leveraging LLM-based agents capable of natural language interaction, we move towards creating hybrid human-AI systems that can foster seamless collaboration, with broad implications for the future of human-artificial interaction.

[118] RePO: Replay-Enhanced Policy Optimization

Siheng Li,Zhanhui Zhou,Wai Lam,Chao Yang,Chaochao Lu

Main category: cs.CL

TL;DR: RePO通过利用多样化的回放策略从回放缓冲区中检索离策略样本，显著提升了大型语言模型的优化效率，相比GRPO在性能和计算效率上均有显著提升。

Details

Motivation: GRPO方法因使用多个同策略输出导致计算成本高和数据效率低，需要一种更高效的方法来优化大型语言模型。 Method: 提出RePO方法，利用回放缓冲区中的离策略样本进行策略优化，增加了样本多样性。 Result: 在多个数学推理基准测试中，RePO显著提升了模型性能（如Qwen2.5-Math-1.5B提升18.4分），同时计算成本仅增加15%。 Conclusion: RePO是一种高效且性能优越的策略优化方法，适用于大型语言模型的优化。 Abstract: Reinforcement learning (RL) is vital for optimizing large language models (LLMs). Recent Group Relative Policy Optimization (GRPO) estimates advantages using multiple on-policy outputs per prompt, leading to high computational costs and low data efficiency. To address this, we introduce Replay-Enhanced Policy Optimization (RePO), which leverages diverse replay strategies to retrieve off-policy samples from a replay buffer, allowing policy optimization based on a broader and more diverse set of samples for each prompt. Experiments on five LLMs across seven mathematical reasoning benchmarks demonstrate that RePO achieves absolute average performance gains of $18.4$ and $4.1$ points for Qwen2.5-Math-1.5B and Qwen3-1.7B, respectively, compared to GRPO. Further analysis indicates that RePO increases computational cost by $15\%$ while raising the number of effective optimization steps by $48\%$ for Qwen3-1.7B, with both on-policy and off-policy sample numbers set to $8$. The repository can be accessed at https://github.com/SihengLi99/RePO.

[119] Latent Multi-Head Attention for Small Language Models

Sushant Mehta,Raj Dandekar,Rajat Dandekar,Sreedath Panat

Main category: cs.CL

TL;DR: 研究探讨了潜在多头注意力（MLA）在小语言模型中的应用，发现MLA结合旋转位置嵌入（RoPE）在内存和性能上实现了帕累托改进。

Details

Motivation: 探索小语言模型中潜在多头注意力的效率与质量权衡，为内存受限的应用提供优化方案。 Method: 训练30M参数的GPT模型，比较标准多头注意力（MHA）、MLA及MLA+RoPE三种架构，重点关注内存和性能指标。 Result: MLA+RoPE（半秩潜在维度）减少45%的KV缓存内存，验证损失仅增加0.3%，推理速度提升1.4倍。 Conclusion: MLA+RoPE在小模型中表现优异，为资源受限场景提供了高效解决方案。 Abstract: We present the first comprehensive study of latent multi-head attention (MLA) for small language models, revealing interesting efficiency-quality trade-offs. Training 30M-parameter GPT models on 100,000 synthetic stories, we benchmark three architectural variants: standard multi-head attention (MHA), MLA, and MLA with rotary positional embeddings (MLA+RoPE). Our key finding is that MLA+RoPE with half-rank latent dimensions (r = d/2) achieves a 45% KV-cache memory reduction while incurring only a 0.3% increase in validation loss (essentially matching MHA quality)- a Pareto improvement for memory constrained deployment. We further show that RoPE is crucial for MLA in small models: without it, MLA underperforms vanilla attention by 3-5%, but with RoPE, it surpasses vanilla by 2%. Inference benchmarks on NVIDIA A100 GPUs reveal that MLA with r=d/2 achieves a 1.4 times speedup over full-rank MLA while maintaining the memory savings. GPT-4 evaluations corroborate perplexity results, with ours achieving the highest quality scores (7.4/10) across grammar, creativity, and consistency metrics. Code and models will be released upon acceptance.

[120] OmniDRCA: Parallel Speech-Text Foundation Model via Dual-Resolution Speech Representations and Contrastive Alignment

Chao-Hong Tan,Qian Chen,Wen Wang,Chong Deng,Qinglin Zhang,Luyao Cheng,Hai Yu,Xin Zhang,Xiang Lv,Tianyu Zhao,Chong Zhang,Yukun Ma,Yafeng Chen,Hui Wang,Jiaqing Liu,Jieping Ye

Main category: cs.CL

TL;DR: OmniDRCA是一种基于联合自回归建模的并行语音-文本基础模型，通过双分辨率语音表示和对比跨模态对齐，实现了语音与文本的并行处理，并在口语问答任务中取得了新的SOTA性能。

Details

Motivation: 现有方法在语音生成与文本生成的协同处理上存在不足，OmniDRCA旨在通过并行建模和对比对齐提升模态间的相互感知能力。 Method: OmniDRCA采用联合自回归建模，结合双分辨率语音表示和对比跨模态对齐技术，并行处理语音与文本表示。 Result: 在口语问答基准测试中，OmniDRCA在并行联合语音-文本建模模型中达到了新的SOTA性能，并与交错模型竞争。 Conclusion: OmniDRCA展示了并行语音-文本建模的潜力，并可能扩展到全双工会话场景。 Abstract: Recent studies on end-to-end speech generation with large language models (LLMs) have attracted significant community attention, with multiple works extending text-based LLMs to generate discrete speech tokens. Existing approaches primarily fall into two categories: (1) Methods that generate discrete speech tokens independently without incorporating them into the LLM's autoregressive process, resulting in text generation being unaware of concurrent speech synthesis. (2) Models that generate interleaved or parallel speech-text tokens through joint autoregressive modeling, enabling mutual modality awareness during generation. This paper presents OmniDRCA, a parallel speech-text foundation model based on joint autoregressive modeling, featuring dual-resolution speech representations and contrastive cross-modal alignment. Our approach processes speech and text representations in parallel while enhancing audio comprehension through contrastive alignment. Experimental results on Spoken Question Answering benchmarks demonstrate that OmniDRCA establishes new state-of-the-art (SOTA) performance among parallel joint speech-text modeling based foundation models, and achieves competitive performance compared to interleaved models. Additionally, we explore the potential of extending the framework to full-duplex conversational scenarios.

[121] DIVE into MoE: Diversity-Enhanced Reconstruction of Large Language Models from Dense into Mixture-of-Experts

Yuchen Feng,Bowen Shen,Naibin Gu,Jiaxuan Zhao,Peng Fu,Zheng Lin,Weiping Wang

Main category: cs.CL

TL;DR: DIVE是一种多样性增强的重构方法，通过修剪和重组FFN模块，高效训练MoE LLMs，减少冗余并提升训练效率。

Details

Motivation: 现有MoE LLMs重构方法忽视专家多样性，导致冗余，DIVE旨在解决这一问题。 Method: DIVE包括领域亲和性挖掘、基于修剪的专家重构和高效再训练，具体为FFN模块的修剪与重组。 Result: 实验表明，DIVE在相同激活参数下优于现有方法，训练效率高且精度损失小。 Conclusion: DIVE通过增强专家多样性，显著提升MoE LLMs的训练效率和性能。 Abstract: Large language models (LLMs) with the Mixture-of-Experts (MoE) architecture achieve high cost-efficiency by selectively activating a subset of the parameters. Despite the inference efficiency of MoE LLMs, the training of extensive experts from scratch incurs substantial overhead, whereas reconstructing a dense LLM into an MoE LLM significantly reduces the training budget. However, existing reconstruction methods often overlook the diversity among experts, leading to potential redundancy. In this paper, we come up with the observation that a specific LLM exhibits notable diversity after being pruned on different calibration datasets, based on which we present a Diversity-Enhanced reconstruction method named DIVE. The recipe of DIVE includes domain affinity mining, pruning-based expert reconstruction, and efficient retraining. Specifically, the reconstruction includes pruning and reassembly of the feed-forward network (FFN) module. After reconstruction, we efficiently retrain the model on routers, experts and normalization modules. We implement DIVE on Llama-style LLMs with open-source training corpora. Experiments show that DIVE achieves training efficiency with minimal accuracy trade-offs, outperforming existing pruning and MoE reconstruction methods with the same number of activated parameters.

[122] Taming SQL Complexity: LLM-Based Equivalence Evaluation for Text-to-SQL

Qingyun Zeng,Simin Ma,Arash Niknafs,Ashish Basran,Carol Szabo

Main category: cs.CL

TL;DR: 本文探讨了利用大型语言模型（LLMs）评估生成SQL的语义等价性，分析了常见模式及挑战。

Details

Motivation: 评估生成SQL的语义等价性存在挑战，尤其是面对模糊查询和多解情况。 Method: 使用LLMs评估语义和‘弱’语义等价性，分析SQL等价与不等价模式。 Result: 总结了SQL等价性的常见模式及LLM评估的挑战。 Conclusion: LLMs在评估SQL语义等价性方面具有潜力，但仍需解决实际应用中的挑战。 Abstract: The rise of Large Language Models (LLMs) has significantly advanced Text-to-SQL (NL2SQL) systems, yet evaluating the semantic equivalence of generated SQL remains a challenge, especially given ambiguous user queries and multiple valid SQL interpretations. This paper explores using LLMs to assess both semantic and a more practical "weak" semantic equivalence. We analyze common patterns of SQL equivalence and inequivalence, discuss challenges in LLM-based evaluation.

[123] COGENT: A Curriculum-oriented Framework for Generating Grade-appropriate Educational Content

Zhengyuan Liu,Stella Xin Yin,Dion Hoe-Lian Goh,Nancy F. Chen

Main category: cs.CL

TL;DR: COGENT是一个面向课程的框架，用于生成适合年级的教育内容，解决了生成式AI在教育中的挑战，如课程对齐和阅读水平控制。

Details

Motivation: 生成式AI在教育中的应用存在挑战，如未能与课程标准和年级阅读水平保持一致，尤其是在STEM教育中。 Method: COGENT整合了科学概念、核心思想和学习目标，通过控制文本长度、词汇和句子复杂度来调整可读性，并采用“基于好奇心”的方法提升学生兴趣。 Result: 实验表明，COGENT生成的文本在适合年级的内容上优于或等同于人工参考材料。 Conclusion: COGENT为扩展高质量自适应学习资源提供了可行方法。 Abstract: While Generative AI has demonstrated strong potential and versatility in content generation, its application to educational contexts presents several challenges. Models often fail to align with curriculum standards and maintain grade-appropriate reading levels consistently. Furthermore, STEM education poses additional challenges in balancing scientific explanations with everyday language when introducing complex and abstract ideas and phenomena to younger students. In this work, we propose COGENT, a curriculum-oriented framework for generating grade-appropriate educational content. We incorporate three curriculum components (science concepts, core ideas, and learning objectives), control readability through length, vocabulary, and sentence complexity, and adopt a ``wonder-based'' approach to increase student engagement and interest. We conduct a multi-dimensional evaluation via both LLM-as-a-judge and human expert analysis. Experimental results show that COGENT consistently produces grade-appropriate passages that are comparable or superior to human references. Our work establishes a viable approach for scaling adaptive and high-quality learning resources.

[124] CoLMbo: Speaker Language Model for Descriptive Profiling

Massa Baali,Shuo Han,Syed Abdul Hannan,Purusottam Samal,Karanveer Singh,Soham Deshmukh,Rita Singh,Bhiksha Raj

Main category: cs.CL

TL;DR: CoLMbo是一种新型的说话人语言模型（SLM），通过结合说话人编码器和提示条件，能够生成详细的说话人描述，解决了传统说话人识别系统在提取人口统计属性方面的局限性。

Details

Motivation: 传统说话人识别系统仅能完成分类任务，无法生成详细的说话人特征或上下文丰富的描述，尤其是在提取人口统计属性（如方言、性别、年龄）方面表现不足。 Method: CoLMbo整合了说话人编码器和提示条件，通过用户定义的提示动态适应新的说话人特征，生成定制化的描述，包括方言变体和年龄相关特征。 Result: CoLMbo不仅提升了传统说话人分析能力，还在零样本场景中表现出色，适用于多样化数据集。 Conclusion: CoLMbo在说话人识别领域取得了显著进展，为生成详细说话人描述提供了创新解决方案。 Abstract: Speaker recognition systems are often limited to classification tasks and struggle to generate detailed speaker characteristics or provide context-rich descriptions. These models primarily extract embeddings for speaker identification but fail to capture demographic attributes such as dialect, gender, and age in a structured manner. This paper introduces CoLMbo, a Speaker Language Model (SLM) that addresses these limitations by integrating a speaker encoder with prompt-based conditioning. This allows for the creation of detailed captions based on speaker embeddings. CoLMbo utilizes user-defined prompts to adapt dynamically to new speaker characteristics and provides customized descriptions, including regional dialect variations and age-related traits. This innovative approach not only enhances traditional speaker profiling but also excels in zero-shot scenarios across diverse datasets, marking a significant advancement in the field of speaker recognition.

[125] Binary classification for perceived quality of headlines and links on worldwide news websites, 2018-2024

Austin McCutcheon,Thiago E. A. de Oliveira,Aleksandr Zheleznov,Chris Brogly

Main category: cs.CL

TL;DR: 研究探讨了如何自动区分低质量与高质量新闻标题/链接，评估了12种机器学习模型，发现传统集成方法和微调DistilBERT均表现良好，但后者训练时间更长。

Details

Motivation: 在线新闻的泛滥导致低质量新闻标题/链接广泛传播，研究旨在自动区分其质量。 Method: 使用包含57,544,214条新闻标题/链接的平衡数据集，提取115种语言特征，评估12种机器学习模型，包括传统集成方法和深度学习模型。 Result: 传统集成方法（如bagging分类器）表现良好（88.1%准确率），微调DistilBERT准确率最高（90.3%），但训练时间更长。 Conclusion: NLP特征结合传统分类器或深度学习模型均可有效区分新闻质量，但需权衡预测性能与训练时间。 Abstract: The proliferation of online news enables potential widespread publication of perceived low-quality news headlines/links. As a result, we investigated whether it was possible to automatically distinguish perceived lower-quality news headlines/links from perceived higher-quality headlines/links. We evaluated twelve machine learning models on a binary, balanced dataset of 57,544,214 worldwide news website links/headings from 2018-2024 (28,772,107 per class) with 115 extracted linguistic features. Binary labels for each text were derived from scores based on expert consensus regarding the respective news domain quality. Traditional ensemble methods, particularly the bagging classifier, had strong performance (88.1% accuracy, 88.3% F1, 80/20 train/test split). Fine-tuned DistilBERT achieved the highest accuracy (90.3%, 80/20 train/test split) but required more training time. The results suggest that both NLP features with traditional classifiers and deep learning models can effectively differentiate perceived news headline/link quality, with some trade-off between predictive performance and train time.

[126] Comparing human and LLM politeness strategies in free production

Haoran Zhao,Robert D. Hawkins

Main category: cs.CL

TL;DR: 大型语言模型（LLMs）在礼貌语言使用上与人类存在差异，虽然大模型能复现计算语用学中的关键偏好，但过度依赖负面礼貌策略可能导致误解。

Details

Motivation: 研究LLMs是否能在礼貌语言使用上像人类一样灵活应对不同语境，平衡信息性和社交目标。 Method: 通过比较人类和LLM在约束性和开放性任务中的回答，分析其礼貌策略的使用。 Result: 大模型（≥70B参数）能复现计算语用学偏好，但在正面语境中过度依赖负面礼貌策略。人类评估者更偏好LLM的开放性回答。 Conclusion: LLMs在礼貌策略上表现优秀，但与人类语用的微妙差异引发了对AI系统语用对齐的思考。 Abstract: Polite speech poses a fundamental alignment challenge for large language models (LLMs). Humans deploy a rich repertoire of linguistic strategies to balance informational and social goals -- from positive approaches that build rapport (compliments, expressions of interest) to negative strategies that minimize imposition (hedging, indirectness). We investigate whether LLMs employ a similarly context-sensitive repertoire by comparing human and LLM responses in both constrained and open-ended production tasks. We find that larger models ($\ge$70B parameters) successfully replicate key preferences from the computational pragmatics literature, and human evaluators surprisingly prefer LLM-generated responses in open-ended contexts. However, further linguistic analyses reveal that models disproportionately rely on negative politeness strategies even in positive contexts, potentially leading to misinterpretations. While modern LLMs demonstrate an impressive handle on politeness strategies, these subtle differences raise important questions about pragmatic alignment in AI systems.

[127] A Hierarchical Probabilistic Framework for Incremental Knowledge Tracing in Classroom Settings

Xinyi Gao,Qiucheng Wu,Yang Zhang,Xuechen Liu,Kaizhi Qian,Ying Xu,Shiyu Chang

Main category: cs.CL

TL;DR: 论文提出了一种基于知识树的概率知识追踪框架（KT$^2$），用于在低资源条件下通过层次化知识概念提升学生表现预测。

Details

Motivation: 现实课堂中知识追踪常面临数据稀疏和在线更新的挑战，现有方法难以应对。 Method: 采用隐马尔可夫树模型建模层次化知识概念，通过EM算法估计学生掌握程度，并支持增量更新。 Result: 实验表明KT$^2$在低资源在线设置中优于基线方法。 Conclusion: KT$^2$通过利用层次化知识概念，有效解决了低资源知识追踪问题。 Abstract: Knowledge tracing (KT) aims to estimate a student's evolving knowledge state and predict their performance on new exercises based on performance history. Many realistic classroom settings for KT are typically low-resource in data and require online updates as students' exercise history grows, which creates significant challenges for existing KT approaches. To restore strong performance under low-resource conditions, we revisit the hierarchical knowledge concept (KC) information, which is typically available in many classroom settings and can provide strong prior when data are sparse. We therefore propose Knowledge-Tree-based Knowledge Tracing (KT$^2$), a probabilistic KT framework that models student understanding over a tree-structured hierarchy of knowledge concepts using a Hidden Markov Tree Model. KT$^2$ estimates student mastery via an EM algorithm and supports personalized prediction through an incremental update mechanism as new responses arrive. Our experiments show that KT$^2$ consistently outperforms strong baselines in realistic online, low-resource settings.

[128] Token Constraint Decoding Improves Robustness on Question Answering for Large Language Models

Jui-Ming Yao,Hao-Yuan Chen,Zi-Xian Tang,Bing-Jia Tan,Sheng-Wei Peng,Bing-Cheng Xie,Shun-Feng Su

Main category: cs.CL

TL;DR: 论文提出了一种名为Token Constraint Decoding (TCD)的推理时算法，通过增强token级预测的一致性来提升大语言模型在噪声环境中的鲁棒性。实验表明，TCD能显著恢复因输入噪声而下降的性能。

Details

Motivation: 大语言模型在多选题问答任务中表现优异，但对输入扰动高度敏感，因此需要一种提升鲁棒性的方法。 Method: 提出TCD算法，通过强制token级预测对齐来增强鲁棒性，并结合提示工程优化。 Result: 在多个数据集上，TCD显著提升了模型性能，尤其是对较弱模型效果更明显，最高提升39%。 Conclusion: TCD是一种实用的、模型无关的方法，可提升语言模型在现实不完美条件下的推理稳定性。 Abstract: Large Language Models (LLMs) have demonstrated impressive performance on multiple-choice question answering (MCQA) benchmarks, yet they remain highly vulnerable to minor input perturbations. In this paper, we introduce and evaluate Token Constraint Decoding (TCD). This simple yet effective inference-time algorithm enforces alignment between token-level predictions to enhance robustness in noisy settings. Through extensive experiments on CommonsenseQA, MMLU, and MMLU-Pro, we show that TCD, especially when paired with prompt engineering (PE) fixes, significantly restores performance degraded by input noise, yielding up to +39\% absolute gains for weaker models like Gemma3 1B. Penalty sweep analyses further reveal that TCD implicitly regularizes overconfident outputs, with different models requiring distinct penalty schedules to maximize resilience. Our findings establish TCD as a practical, model-agnostic approach for improving reasoning stability under real-world imperfections and pave the way for more reliable deployment of LLMs in safety-critical or user-facing applications.

[129] PGDA-KGQA: A Prompt-Guided Generative Framework with Multiple Data Augmentation Strategies for Knowledge Graph Question Answering

Xiujun Zhou,Pingjian Zhang,Deyou Tang

Main category: cs.CL

TL;DR: PGDA-KGQA是一个基于提示引导的生成框架，通过多种数据增强策略解决KGQA任务中数据稀缺和多跳推理问题，显著提升了性能。

Details

Motivation: 解决KGQA任务中数据稀缺、多跳推理样本不足以及传统数据增强方法导致的语义失真问题。 Method: 采用提示引导的生成框架，结合单跳伪问题生成、语义保留问题重写和答案引导的反向路径探索，增强训练数据多样性。 Result: 在WebQSP和ComplexWebQuestions数据集上，F1、Hits@1和准确率分别提升2.8%、1.2%、3.1%和1.8%、1.1%、2.4%。 Conclusion: PGDA-KGQA通过数据增强显著提升了KGQA任务的性能，尤其在多跳推理和语义对齐方面表现突出。 Abstract: Knowledge Graph Question Answering (KGQA) is a crucial task in natural language processing that requires reasoning over knowledge graphs (KGs) to answer natural language questions. Recent methods utilizing large language models (LLMs) have shown remarkable semantic parsing capabilities but are limited by the scarcity of diverse annotated data and multi-hop reasoning samples. Traditional data augmentation approaches are focus mainly on single-hop questions and prone to semantic distortion, while LLM-based methods primarily address semantic distortion but usually neglect multi-hop reasoning, thus limiting data diversity. The scarcity of multi-hop samples further weakens models' generalization. To address these issues, we propose PGDA-KGQA, a prompt-guided generative framework with multiple data augmentation strategies for KGQA. At its core, PGDA-KGQA employs a unified prompt-design paradigm: by crafting meticulously engineered prompts that integrate the provided textual content, it leverages LLMs to generate large-scale (question, logical form) pairs for model training. Specifically, PGDA-KGQA enriches its training set by: (1) generating single-hop pseudo questions to improve the alignment of question semantics with KG relations; (2) applying semantic-preserving question rewriting to improve robustness against linguistic variations; (3) employing answer-guided reverse path exploration to create realistic multi-hop questions. By adopting an augment-generate-retrieve semantic parsing pipeline, PGDA-KGQA utilizes the augmented data to enhance the accuracy of logical form generation and thus improve answer retrieval performance. Experiments demonstrate that outperforms state-of-the-art methods on standard KGQA datasets, achieving improvements on WebQSP by 2.8%, 1.2%, and 3.1% and on ComplexWebQuestions by 1.8%, 1.1%, and 2.4% in F1, Hits@1, and Accuracy, respectively.

[130] Hidden in Plain Sight: Evaluation of the Deception Detection Capabilities of LLMs in Multimodal Settings

Md Messal Monem Miah,Adrita Anika,Xi Shi,Ruihong Huang

Main category: cs.CL

TL;DR: 该研究评估了大型语言模型（LLMs）和大型多模态模型（LMMs）在多个领域的自动欺骗检测能力，发现微调后的LLMs在文本欺骗检测任务中表现优异，而LMMs在多模态线索利用上存在困难。

Details

Motivation: 在数字化世界中，欺骗检测是一项关键且具有挑战性的任务，研究旨在评估LLMs和LMMs在此任务中的表现。 Method: 研究使用了三个数据集（RLTD、MU3D、OpSpam），比较了零样本和少样本方法，并分析了辅助特征和提示策略的影响。 Result: 微调后的LLMs在文本欺骗检测中达到最优性能，而LMMs未能充分利用多模态线索。 Conclusion: 研究揭示了LLMs在欺骗检测中的潜力与局限性，为实际应用提供了重要见解。 Abstract: Detecting deception in an increasingly digital world is both a critical and challenging task. In this study, we present a comprehensive evaluation of the automated deception detection capabilities of Large Language Models (LLMs) and Large Multimodal Models (LMMs) across diverse domains. We assess the performance of both open-source and commercial LLMs on three distinct datasets: real life trial interviews (RLTD), instructed deception in interpersonal scenarios (MU3D), and deceptive reviews (OpSpam). We systematically analyze the effectiveness of different experimental setups for deception detection, including zero-shot and few-shot approaches with random or similarity-based in-context example selection. Our results show that fine-tuned LLMs achieve state-of-the-art performance on textual deception detection tasks, while LMMs struggle to fully leverage cross-modal cues. Additionally, we analyze the impact of auxiliary features, such as non-verbal gestures and video summaries, and examine the effectiveness of different prompting strategies, including direct label generation and chain-of-thought reasoning. Our findings provide key insights into how LLMs process and interpret deceptive cues across modalities, highlighting their potential and limitations in real-world deception detection applications.

[131] Improved Supervised Fine-Tuning for Large Language Models to Mitigate Catastrophic Forgetting

Fei Ding,Baiqiao Wang

Main category: cs.CL

TL;DR: 提出了一种新的SFT方法，减少灾难性遗忘风险，同时提升任务性能。

Details

Motivation: 解决SFT导致的通用能力下降和灾难性遗忘问题。 Method: 重构基础模型的指令分布，多模型筛选数据，混合新数据进行SFT。 Result: 实验表明方法在保留通用能力的同时提升任务性能。 Conclusion: 新方法有效且成本低，适用于第三方实践者。 Abstract: Supervised Fine-Tuning (SFT), while enhancing large language models(LLMs)' instruction-following capabilities and domain-specific task adaptability, often diminishes their general capabilities. Moreover, due to the inaccessibility of original pre-training data, catastrophic forgetting tends to be exacerbated when third-party practitioners implement SFT on open-sourced models. To address this challenge, we propose a novel, more cost-effective SFT method which could effectively reduce the risk of catastrophic forgetting without access to original SFT data. Our approach begins by reconstructing the likely SFT instruction distribution of the base model, followed by a multi-model screening process to select optimal data, which is then mixed with new data for SFT. Experimental results demonstrate that our method preserves generalization capabilities in general domains while improving task-specific performance.

[132] GigaChat Family: Efficient Russian Language Modeling Through Mixture of Experts Architecture

GigaChat team,Mamedov Valentin,Evgenii Kosarev,Gregory Leleytner,Ilya Shchuckin,Valeriy Berezovskiy,Daniil Smirnov,Dmitry Kozlov,Sergei Averkiev,Lukyanenko Ivan,Aleksandr Proshunin,Ainur Israfilova,Ivan Baskov,Artem Chervyakov,Emil Shakirov,Mikhail Kolesov,Daria Khomich,Darya Latortseva,Sergei Porkhun,Yury Fedorov,Oleg Kutuzov,Polina Kudriavtseva,Sofiia Soldatova,Kolodin Egor,Stanislav Pyatkin,Dzmitry Menshykh,Grafov Sergei,Eldar Damirov,Karlov Vladimir,Ruslan Gaitukiev,Arkadiy Shatenov,Alena Fenogenova,Nikita Savushkin,Fedor Minkin

Main category: cs.CL

TL;DR: 该论文介绍了GigaChat系列俄语大语言模型，包括基础模型和指令调优版本，详细描述了架构、预训练过程及实验设计，并在俄语和英语基准测试中评估性能，与多语言模型对比。

Details

Motivation: 由于俄语基础模型开发资源需求大，研究旨在填补这一空白，推动俄语NLP研究和工业应用。 Method: 采用不同规模的模型架构，进行预训练和指令调优，通过实验优化设计，并在多语言基准测试中评估性能。 Result: GigaChat在俄语和英语任务中表现优异，并通过API、Telegram机器人和Web界面提供系统演示，同时开源三个模型。 Conclusion: GigaChat系列模型为俄语NLP提供了重要资源，支持研究和工业应用，未来可进一步扩展。 Abstract: Generative large language models (LLMs) have become crucial for modern NLP research and applications across various languages. However, the development of foundational models specifically tailored to the Russian language has been limited, primarily due to the significant computational resources required. This paper introduces the GigaChat family of Russian LLMs, available in various sizes, including base models and instruction-tuned versions. We provide a detailed report on the model architecture, pre-training process, and experiments to guide design choices. In addition, we evaluate their performance on Russian and English benchmarks and compare GigaChat with multilingual analogs. The paper presents a system demonstration of the top-performing models accessible via an API, a Telegram bot, and a Web interface. Furthermore, we have released three open GigaChat models in open-source (https://huggingface.co/ai-sage), aiming to expand NLP research opportunities and support the development of industrial solutions for the Russian language.

[133] UniToMBench: Integrating Perspective-Taking to Improve Theory of Mind in LLMs

Prameshwar Thiyagarajan,Vaishnavi Parimi,Shamant Sai,Soumil Garg,Zhangir Meirbek,Nitin Yarlagadda,Kevin Zhu,Chris Kim

Main category: cs.CL

TL;DR: UniToMBench是一个统一的基准测试，结合SimToM和TOMBENCH的优势，通过多交互任务设计和动态故事场景，系统评估和改进LLMs的心理理论能力。

Details

Motivation: 心理理论（ToM）对大型语言模型（LLMs）仍具挑战性，现有模型在预测人类心理状态时表现不佳，因此需要更全面的评估工具。 Method: UniToMBench整合了视角采技术和多样化的评估指标，基于1,000多个手工编写的情景数据集，设计多交互任务和动态故事场景。 Result: GPT-4o和GPT-4o Mini在情感和信念相关任务中表现优异（准确率>80%），但在知识型任务中表现不稳定。 Conclusion: UniToMBench揭示了当前LLMs在ToM任务中的优势与局限，为未来研究提供了全面的评估工具。 Abstract: Theory of Mind (ToM), the ability to understand the mental states of oneself and others, remains a challenging area for large language models (LLMs), which often fail to predict human mental states accurately. In this paper, we introduce UniToMBench, a unified benchmark that integrates the strengths of SimToM and TOMBENCH to systematically improve and assess ToM capabilities in LLMs by integrating multi-interaction task designs and evolving story scenarios. Supported by a custom dataset of over 1,000 hand-written scenarios, UniToMBench combines perspective-taking techniques with diverse evaluation metrics to better stimulate social cognition in LLMs. Through evaluation, we observe that while models like GPT-4o and GPT-4o Mini show consistently high accuracy in tasks involving emotional and belief-related scenarios, with results usually above 80%, there is significant variability in their performance across knowledge-based tasks. These results highlight both the strengths and limitations of current LLMs in ToM-related tasks, underscoring the value of UniToMBench as a comprehensive tool for future development. Our code is publicly available here: https://github.com/Shamant/unifiedtombenchmark.

[134] Towards Bridging the Reward-Generation Gap in Direct Alignment Algorithms

Zeguan Xiao,Yun Chen,Guanhua Chen

Main category: cs.CL

TL;DR: 论文提出了一种名为POET的方法，通过截断偏好和非偏好响应至相同长度，解决了直接对齐算法（DAAs）中的奖励生成差距问题，显著提升了模型性能。

Details

Motivation: 直接对齐算法（如DPO和SimPO）在训练目标与推理生成性能之间存在不匹配（奖励生成差距），论文旨在解决这一问题。 Method: 提出Prefix-Oriented Equal-length Training（POET），通过截断响应至相同长度，优化DAAs目标，使其更关注前缀标记。 Result: 实验表明，POET在DPO和SimPO上表现优于标准实现，AlpacaEval 2提升15.6分，下游任务普遍改进。 Conclusion: POET有效解决了奖励优化与生成性能的不匹配问题，为DAAs提供了改进方向。 Abstract: Direct Alignment Algorithms (DAAs), such as Direct Preference Optimization (DPO) and Simple Preference Optimization (SimPO), have emerged as efficient alternatives to Reinforcement Learning from Human Feedback (RLHF) algorithms for aligning large language models (LLMs) with human preferences. However, DAAs suffer from a fundamental limitation we identify as the "reward-generation gap" -- a misalignment between optimization objectives during training and actual generation performance during inference. In this paper, we find a contributor to the reward-generation gap is the mismatch between the inherent importance of prefix tokens during the LLM generation process and how this importance is reflected in the implicit reward functions of DAAs. To bridge the gap, we introduce a simple yet effective approach called Prefix-Oriented Equal-length Training (POET), which truncates both preferred and dispreferred responses to match the shorter one's length. Training with POET, where both responses in each sample are truncated to equal length, resulting in diverse truncated lengths across samples, the optimization of DAAs objective is implicitly constrained to converge across all positions, thus paying more attention to prefix tokens than the standard DAAs. We conduct experiments with DPO and SimPO, two representative DAAs, demonstrating that POET improves over their standard implementations, achieving up to 15.6 points in AlpacaEval 2 and overall improvements across downstream tasks. Our results highlight the importance of addressing the misalignment between reward optimization and generation performance in DAAs.

[135] Bridging Online Behavior and Clinical Insight: A Longitudinal LLM-based Study of Suicidality on YouTube Reveals Novel Digital Markers

Ilanit Sobol,Shir Lissak,Refael Tikochinski,Tal Nakash,Anat Brunstein Klomek,Eyal Fruchter,Roi Reichart

Main category: cs.CL

TL;DR: 研究通过分析YouTube上自杀未遂者的视频内容，结合计算模型和专家知识，揭示了自杀行为在社交媒体上的表现及其与临床知识的差异。

Details

Motivation: 自杀是西方国家的主要死因之一，社交媒体数据为研究自杀行为提供了新视角。 Method: 采用计算自下而上、混合和专家自上而下三种方法，分析181个自杀未遂者和134个对照者的YouTube频道数据。 Result: 发现五个与自杀未遂相关的主题，其中两个随时间变化显著；专家未识别的平台特定指标（如YouTube参与度）具有价值。 Conclusion: 综合方法提供了对自杀行为的深入理解，连接了数字行为与临床知识。 Abstract: Suicide remains a leading cause of death in Western countries, underscoring the need for new research approaches. As social media becomes central to daily life, digital footprints offer valuable insight into suicidal behavior. Focusing on individuals who attempted suicide while uploading videos to their channels, we investigate: How do suicidal behaviors manifest on YouTube, and how do they differ from expert knowledge? We applied complementary approaches: computational bottom-up, hybrid, and expert-driven top-down, on a novel longitudinal dataset of 181 YouTube channels from individuals with life-threatening attempts, alongside 134 control channels. In the bottom-up approach, we applied LLM-based topic modeling to identify behavioral indicators. Of 166 topics, five were associated with suicide-attempt, with two also showing temporal attempt-related changes ($p<.01$) - Mental Health Struggles ($+0.08$)* and YouTube Engagement ($+0.1$)*. In the hybrid approach, a clinical expert reviewed LLM-derived topics and flagged 19 as suicide-related. However, none showed significant attempt-related temporal effects beyond those identified bottom-up. Notably, YouTube Engagement, a platform-specific indicator, was not flagged by the expert, underscoring the value of bottom-up discovery. In the top-down approach, psychological assessment of suicide attempt narratives revealed that the only significant difference between individuals who attempted before and those attempted during their upload period was the motivation to share this experience: the former aimed to Help Others ($\beta=-1.69$, $p<.01$), while the latter framed it as part of their Personal Recovery ($\beta=1.08$, $p<.01$). By integrating these approaches, we offer a nuanced understanding of suicidality, bridging digital behavior and clinical insights. * Within-group changes in relation to the suicide attempt.

[136] Give Me FP32 or Give Me Death? Challenges and Solutions for Reproducible Reasoning

Jiayi Yuan,Hao Li,Xinheng Ding,Wenya Xie,Yu-Jhe Li,Wentian Zhao,Kun Wan,Jing Shi,Xia Hu,Zirui Liu

Main category: cs.CL

TL;DR: 研究发现大语言模型（LLM）的性能可重复性脆弱，系统配置（如GPU数量、版本和评估批次大小）会导致显著差异，尤其是在推理模型中。

Details

Motivation: 探讨LLM性能评估的可重复性问题，揭示数值精度对模型输出的影响。 Method: 通过控制实验分析不同硬件、软件和精度设置下模型输出的差异，并提出轻量级推理管道LayerCast。 Result: 实验显示，推理模型的准确性和响应长度因配置差异可变化9%和9,000个token，根源在于浮点运算的非结合性。 Conclusion: 数值精度对LLM推理的可重复性至关重要，但常被忽视；LayerCast在内存效率和数值稳定性间取得平衡。 Abstract: Large Language Models (LLMs) are now integral across various domains and have demonstrated impressive performance. Progress, however, rests on the premise that benchmark scores are both accurate and reproducible. We demonstrate that the reproducibility of LLM performance is fragile: changing system configuration such as evaluation batch size, GPU count, and GPU version can introduce significant difference in the generated responses. This issue is especially pronounced in reasoning models, where minor rounding differences in early tokens can cascade into divergent chains of thought, ultimately affecting accuracy. For instance, under bfloat16 precision with greedy decoding, a reasoning model like DeepSeek-R1-Distill-Qwen-7B can exhibit up to 9% variation in accuracy and 9,000 tokens difference in response length due to differences in GPU count, type, and evaluation batch size. We trace the root cause of this variability to the non-associative nature of floating-point arithmetic under limited numerical precision. This work presents the first systematic investigation into how numerical precision affects reproducibility in LLM inference. Through carefully controlled experiments across various hardware, software, and precision settings, we quantify when and how model outputs diverge. Our analysis reveals that floating-point precision -- while critical for reproducibility -- is often neglected in evaluation practices. Inspired by this, we develop a lightweight inference pipeline, dubbed LayerCast, that stores weights in 16-bit precision but performs all computations in FP32, balancing memory efficiency with numerical stability. Code is available at https://github.com/nanomaoli/llm_reproducibility.

[137] TransXSSM: A Hybrid Transformer State Space Model with Unified Rotary Position Embedding

Bingheng Wu,Jingze Shi,Yifan Wu,Nan Tang,Yuyu Luo

Main category: cs.CL

TL;DR: 提出了一种统一的旋转位置嵌入方法（RoPE），解决了Transformer和状态空间模型（SSM）在位置编码上的不兼容问题，并提出了一个混合架构（model），在性能和效率上均优于标准Transformer。

Details

Motivation: Transformer和SSM在长序列建模中各有优势，但两者的位置编码机制不兼容，导致混合架构性能不佳。 Method: 提出统一的RoPE方法，为自注意力和状态空间组件提供一致的位置编码，并基于此构建混合架构model。 Result: 在4K序列长度下，model的训练和推理速度分别比标准Transformer快42.3%和29.5%，语言建模任务中准确率提升4%以上，且扩展性更强。 Conclusion: 统一的位置编码解决了混合模型中的位置不兼容问题，实现了高效且高性能的长上下文建模。 Abstract: Transformers exhibit proficiency in capturing long-range dependencies, whereas State Space Models (SSMs) facilitate linear-time sequence modeling. Notwithstanding their synergistic potential, the integration of these architectures presents a significant challenge, primarily attributable to a fundamental incongruity in their respective positional encoding mechanisms: Transformers rely on explicit Rotary Position Embeddings (RoPE), while SSMs leverage implicit positional representations via convolutions. This divergence often precipitates discontinuities and suboptimal performance. To address this impediment, we propose a unified rotary position embedding (\textbf{\ourRoPE}) methodology, thereby establishing a consistent positional encoding framework for both self-attention and state-space components. Using this \ourRoPE, we introduce \textbf{\model}, a hybrid architecture that coherently integrates the Transformer and SSM layers under this unified positional encoding scheme. At a 4K sequence length, \model exhibits training and inference speeds that are \textbf{42.3\% and 29.5\% faster}, respectively, relative to standard Transformer models. It also delivers higher accuracy: under comparable settings, it surpasses a Transformer baseline by over 4\% on language modeling benchmarks. \model furthermore scales more effectively: \model-1.3B gains \textbf{7.22\%} in average accuracy over its 320M version (versus about 6\% gains for equivalent Transformers or SSMs). Our results show that unified positional encoding resolves positional incompatibility in hybrid models, enabling efficient, high-performance long-context modeling.

[138] ReasonMed: A 370K Multi-Agent Generated Dataset for Advancing Medical Reasoning

Yu Sun,Xingyu Qian,Weiwen Xu,Hao Zhang,Chenghao Xiao,Long Li,Yu Rong,Wenbing Huang,Qifeng Bai,Tingyang Xu

Main category: cs.CL

TL;DR: ReasonMed是一个大规模医学推理数据集，通过多智能体验证和优化过程构建，用于提升LLMs在医学问答中的表现。结合详细推理和简洁答案摘要的训练策略，ReasonMed-7B模型在性能上超越了同类模型。

Details

Motivation: 探索推理型LLMs在知识密集型医学问答中的潜力，填补现有研究的空白。 Method: 构建ReasonMed数据集（370k高质量样本），通过多智能体验证和优化过程（包括Error Refiner）提升推理路径质量，并研究最佳训练策略（结合详细推理与简洁答案摘要）。 Result: ReasonMed-7B模型在性能上超越同类模型（提升4.17%），并在PubMedQA上超过LLaMA3.1-70B（提升4.60%）。 Conclusion: ReasonMed数据集和训练策略显著提升了LLMs在医学推理任务中的表现，为未来研究提供了新基准。 Abstract: Though reasoning-based large language models (LLMs) have excelled in mathematics and programming, their capabilities in knowledge-intensive medical question answering remain underexplored. To address this, we introduce ReasonMed, the largest medical reasoning dataset, comprising 370k high-quality examples distilled from 1.7 million initial reasoning paths generated by various LLMs. ReasonMed is constructed through a \textit{multi-agent verification and refinement process}, where we design an \textit{Error Refiner} to enhance the reasoning paths by identifying and correcting error-prone steps flagged by a verifier. Leveraging ReasonMed, we systematically investigate best practices for training medical reasoning models and find that combining detailed Chain-of-Thought (CoT) reasoning with concise answer summaries yields the most effective fine-tuning strategy. Based on this strategy, we train ReasonMed-7B, which sets a new benchmark for sub-10B models, outperforming the prior best by 4.17\% and even exceeding LLaMA3.1-70B on PubMedQA by 4.60\%.

[139] KG-Infused RAG: Augmenting Corpus-Based RAG with External Knowledge Graphs

Dingjun Wu,Yukun Yan,Zhenghao Liu,Zhiyuan Liu,Maosong Sun

Main category: cs.CL

TL;DR: KG-Infused RAG通过结合知识图谱（KG）和RAG系统，利用认知启发的扩散激活机制提升事实准确性，实验证明其在多个QA基准上优于传统RAG方法。

Details

Motivation: 现有RAG方法依赖单一知识源且缺乏认知启发的知识激活机制，限制了其性能。 Method: 提出KG-Infused RAG框架，整合KG实现扩散激活，通过查询扩展和多源检索增强生成。 Result: 在五个QA基准上，KG-Infused RAG性能提升3.8%至13.8%，且能作为插件进一步提升Self-RAG效果。 Conclusion: KG-Infused RAG是一种高效、可扩展的增强模块，适用于基于语料的RAG方法。 Abstract: Retrieval-Augmented Generation (RAG) improves factual accuracy by grounding responses in external knowledge. However, existing methods typically rely on a single source, either unstructured text or structured knowledge. Moreover, they lack cognitively inspired mechanisms for activating relevant knowledge. To address these issues, we propose KG-Infused RAG, a framework that integrates KGs into RAG systems to implement spreading activation, a cognitive process that enables concept association and inference. KG-Infused RAG retrieves KG facts, expands the query accordingly, and enhances generation by combining corpus passages with structured facts, enabling interpretable, multi-source retrieval grounded in semantic structure. We further improve KG-Infused RAG via preference learning on sampled key stages in the pipeline. Experiments on five QA benchmarks show that KG-Infused RAG consistently outperforms vanilla RAG (by 3.8% to 13.8%). Additionally, when integrated into Self-RAG, KG-Infused RAG brings further performance gains, demonstrating its effectiveness and versatility as a plug-and-play enhancement module for corpus-based RAG methods.

[140] MEDUSA: A Multimodal Deep Fusion Multi-Stage Training Framework for Speech Emotion Recognition in Naturalistic Conditions

Georgios Chatzichristodoulou,Despoina Kosmopoulou,Antonios Kritikos,Anastasia Poulopoulou,Efthymios Georgiou,Athanasios Katsamanis,Vassilis Katsouros,Alexandros Potamianos

Main category: cs.CL

TL;DR: MEDUSA是一个多模态框架，通过四阶段训练流程有效处理类别不平衡和情感模糊问题，在自然条件下的语音情感识别任务中表现优异。

Details

Motivation: 由于人类情感的主观性及其在自然条件下的不均衡表现，语音情感识别（SER）具有挑战性。 Method: MEDUSA采用四阶段训练流程，前两阶段训练基于DeepSER的集成分类器，后两阶段优化可训练的元分类器，结合了多种正则化和多任务学习技术。 Result: MEDUSA在Interspeech 2025挑战赛的情感识别任务中排名第一。 Conclusion: MEDUSA通过多模态和集成学习方法，显著提升了自然条件下情感识别的性能。 Abstract: SER is a challenging task due to the subjective nature of human emotions and their uneven representation under naturalistic conditions. We propose MEDUSA, a multimodal framework with a four-stage training pipeline, which effectively handles class imbalance and emotion ambiguity. The first two stages train an ensemble of classifiers that utilize DeepSER, a novel extension of a deep cross-modal transformer fusion mechanism from pretrained self-supervised acoustic and linguistic representations. Manifold MixUp is employed for further regularization. The last two stages optimize a trainable meta-classifier that combines the ensemble predictions. Our training approach incorporates human annotation scores as soft targets, coupled with balanced data sampling and multitask learning. MEDUSA ranked 1st in Task 1: Categorical Emotion Recognition in the Interspeech 2025: Speech Emotion Recognition in Naturalistic Conditions Challenge.

[141] Gender Bias in English-to-Greek Machine Translation

Eleni Gkovedarou,Joke Daems,Luna De Bruyne

Main category: cs.CL

TL;DR: 研究调查了Google Translate和DeepL在英语到希腊语翻译中的性别偏见，发现两者在性别明确时表现良好，但在性别未指定时难以实现包容性或中性翻译。GPT-4o显示出潜力，但仍存在残余偏见。

Details

Motivation: 随着对包容性语言需求的增加，机器翻译系统可能强化性别刻板印象的问题引发关注。 Method: 研究使用GendEL数据集（240个句子），分析两种商业MT系统和GPT-4o在性别偏见三个方面的表现。 Result: 两种MT系统在性别明确时表现良好（DeepL优于Google Translate和GPT-4o），但在性别未指定时表现不佳。GPT-4o在大多数模糊情况下生成合适的性别化或中性替代方案。 Conclusion: 商业MT系统仍需改进以实现性别包容性翻译，GPT-4o显示出潜力但需进一步优化。 Abstract: As the demand for inclusive language increases, concern has grown over the susceptibility of machine translation (MT) systems to reinforce gender stereotypes. This study investigates gender bias in two commercial MT systems, Google Translate and DeepL, focusing on the understudied English-to-Greek language pair. We address three aspects of gender bias: i) male bias, ii) occupational stereotyping, and iii) errors in anti-stereotypical translations. Additionally, we explore the potential of prompted GPT-4o as a bias mitigation tool that provides both gender-explicit and gender-neutral alternatives when necessary. To achieve this, we introduce GendEL, a manually crafted bilingual dataset of 240 gender-ambiguous and unambiguous sentences that feature stereotypical occupational nouns and adjectives. We find persistent gender bias in translations by both MT systems; while they perform well in cases where gender is explicitly defined, with DeepL outperforming both Google Translate and GPT-4o in feminine gender-unambiguous sentences, they are far from producing gender-inclusive or neutral translations when the gender is unspecified. GPT-4o shows promise, generating appropriate gendered and neutral alternatives for most ambiguous cases, though residual biases remain evident.

[142] Towards Open Foundation Language Model and Corpus for Macedonian: A Low-Resource Language

Stefan Krsteski,Matea Tashkovska,Borjan Sazdov,Hristijan Gjoreski,Branislav Gerazov

Main category: cs.CL

TL;DR: 论文通过构建马其顿语的大规模语料库、指令数据集和评测套件，训练了一个8B参数的LLM模型，显著提升了低资源语言的性能。

Details

Motivation: 解决低资源语言（如马其顿语）在LLM应用中的局限性，促进相关研究和实际应用。 Method: 收集40GB马其顿语语料库（3.5B词）、106k指令数据集，构建评测套件，并训练8B参数模型domestic-yak。 Result: domestic-yak在8B参数范围内表现最优，性能接近10倍大模型，且被母语者认为语法和文化适应性更佳。 Conclusion: 公开的数据集、代码和模型权重为低资源语言的LLM研究奠定了基础。 Abstract: The increase in technological adoption worldwide comes with demands for novel tools to be used by the general population. Large Language Models (LLMs) provide a great opportunity in this respect, but their capabilities remain limited for low-resource languages, restricting applications in countries where such languages are spoken. We create several resources to facilitate the adoption of LLMs and to support research advancements for Macedonian. We collect the largest Macedonian corpus to date, consisting of 40GB of textual data and totaling 3.5B words. To support conversational applications, we collect a 106k-instance instruction dataset, carefully built to be culturally grounded. For evaluation, we construct a Macedonian evaluation suite covering seven benchmarks. Finally, we train domestic-yak, a state-of-the-art 8B-parameter model, on our curated datasets and evaluate it against eight baseline models using the newly constructed benchmark suite. Our model outperforms all existing models in the 8B parameter range across all benchmarks, and achieves performance comparable to models up to 10x larger. Furthermore, a qualitative analysis with native speakers reveals that our model is preferred over larger counterparts, receiving higher ratings for grammatical correctness and cultural appropriateness. All datasets, code, and model weights are openly released, setting a foundation for advancing LLMs in similarly underrepresented languages. These resources are publicly available at github.com/LVSTCK for source code, and at huggingface.co/LVSTCK for pretrained model weights and data.

[143] From Symbolic to Neural and Back: Exploring Knowledge Graph-Large Language Model Synergies

Blaž Škrlj,Boshko Koloski,Senja Pollak,Nada Lavrač

Main category: cs.CL

TL;DR: 综述探讨了知识图谱（KGs）与大型语言模型（LLMs）的结合，分为KG增强LLMs和LLM增强KGs两类，分析了其优势与不足，并提出了未来研究方向。

Details

Motivation: 整合结构化知识以增强LLMs的事实基础和推理能力，同时利用LLMs优化KGs的构建与查询。 Method: 系统分类现有方法，分为KG-enhanced LLMs和LLM-augmented KGs，并分析其技术细节与效果。 Result: 揭示了结构化知识整合的互惠性，强调了可扩展性、计算效率和数据质量的重要性。 Conclusion: 未来研究应关注神经符号整合、动态KG更新、数据可靠性和伦理问题，以推动更复杂的知识任务处理。 Abstract: Integrating structured knowledge from Knowledge Graphs (KGs) into Large Language Models (LLMs) enhances factual grounding and reasoning capabilities. This survey paper systematically examines the synergy between KGs and LLMs, categorizing existing approaches into two main groups: KG-enhanced LLMs, which improve reasoning, reduce hallucinations, and enable complex question answering; and LLM-augmented KGs, which facilitate KG construction, completion, and querying. Through comprehensive analysis, we identify critical gaps and highlight the mutual benefits of structured knowledge integration. Compared to existing surveys, our study uniquely emphasizes scalability, computational efficiency, and data quality. Finally, we propose future research directions, including neuro-symbolic integration, dynamic KG updating, data reliability, and ethical considerations, paving the way for intelligent systems capable of managing more complex real-world knowledge tasks.

[144] Memorization in Language Models through the Lens of Intrinsic Dimension

Stefan Arnold

Main category: cs.CL

TL;DR: 研究发现语言模型中的序列内在维度（ID）对记忆行为有抑制作用，高ID序列比低ID序列更难被记忆，尤其是在过参数化模型和稀疏暴露情况下。

Details

Motivation: 语言模型在训练时可能无意中记忆数据并在生成时泄露隐私或知识产权，但现有研究对潜在结构如何调节记忆率知之甚少。 Method: 研究通过内在维度（ID）作为序列潜在空间结构复杂性的几何代理，探讨其对记忆行为的调节作用。 Result: 高ID序列比低ID序列更难被记忆，尤其是在过参数化模型和稀疏暴露情况下。 Conclusion: 研究揭示了模型规模、数据暴露和结构复杂性在记忆行为中的交互作用。 Abstract: Language Models (LMs) are prone to memorizing parts of their data during training and unintentionally emitting them at generation time, raising concerns about privacy leakage and disclosure of intellectual property. While previous research has identified properties such as context length, parameter size, and duplication frequency, as key drivers of unintended memorization, little is known about how the latent structure modulates this rate of memorization. We investigate the role of Intrinsic Dimension (ID), a geometric proxy for the structural complexity of a sequence in latent space, in modulating memorization. Our findings suggest that ID acts as a suppressive signal for memorization: compared to low-ID sequences, high-ID sequences are less likely to be memorized, particularly in overparameterized models and under sparse exposure. These findings highlight the interaction between scale, exposure, and complexity in shaping memorization.

[145] Benchmarking Debiasing Methods for LLM-based Parameter Estimates

Nicolas Audinet de Pieuchon,Adel Daoud,Connor T. Jerzak,Moa Johansson,Richard Johansson

Main category: cs.CL

TL;DR: 论文研究了大型语言模型（LLMs）在文本标注中的偏差问题，比较了两种去偏方法（DSL和PPI）在有限样本下的表现，发现DSL在偏差减少和效率上通常优于PPI，但稳定性较差。

Details

Motivation: LLMs标注文本时存在不一致性，可能影响下游统计估计的准确性，需要有效的去偏方法。 Method: 比较了Design-based Supervised Learning (DSL)和Prediction-Powered Inference (PPI)两种方法在不同任务和样本量下的表现。 Result: DSL在偏差减少和效率上通常优于PPI，但稳定性较差；两种方法在大样本下均表现良好。 Conclusion: 去偏方法存在偏差-方差权衡，需要更多研究量化其在有限样本下的效率。 Abstract: Large language models (LLMs) offer an inexpensive yet powerful way to annotate text, but are often inconsistent when compared with experts. These errors can bias downstream estimates of population parameters such as regression coefficients and causal effects. To mitigate this bias, researchers have developed debiasing methods such as Design-based Supervised Learning (DSL) and Prediction-Powered Inference (PPI), which promise valid estimation by combining LLM annotations with a limited number of expensive expert annotations. Although these methods produce consistent estimates under theoretical assumptions, it is unknown how they compare in finite samples of sizes encountered in applied research. We make two contributions: First, we study how each method's performance scales with the number of expert annotations, highlighting regimes where LLM bias or limited expert labels significantly affect results. Second, we compare DSL and PPI across a range of tasks, finding that although both achieve low bias with large datasets, DSL often outperforms PPI on bias reduction and empirical efficiency, but its performance is less consistent across datasets. Our findings indicate that there is a bias-variance tradeoff at the level of debiasing methods, calling for more research on developing metrics for quantifying their efficiency in finite samples.

[146] Modeling Probabilistic Reduction using Information Theory and Naive Discriminative Learning

Anna Stein,Kevin Tang

Main category: cs.CL

TL;DR: 比较基于信息论的概率预测器与朴素判别学习（NDL）预测器在声学词时长建模中的表现，发现N-gram模型优于NDL模型，但信息论公式能提升NDL性能。

Details

Motivation: 探讨NDL因其认知动机是否在声学词时长建模中更有效，并研究信息论公式对NDL的改进潜力。 Method: 使用Buckeye语料库比较三种模型：信息论公式增强的NDL、传统NDL和N-gram概率预测器。 Result: N-gram模型表现最佳，但信息论公式能提升NDL性能，挑战了NDL的认知优势假设。 Conclusion: 需结合频率、上下文可预测性及平均可预测性，并融合信息论指标与判别学习信息以优化声学缩减建模。 Abstract: This study compares probabilistic predictors based on information theory with Naive Discriminative Learning (NDL) predictors in modeling acoustic word duration, focusing on probabilistic reduction. We examine three models using the Buckeye corpus: one with NDL-derived predictors using information-theoretic formulas, one with traditional NDL predictors, and one with N-gram probabilistic predictors. Results show that the N-gram model outperforms both NDL models, challenging the assumption that NDL is more effective due to its cognitive motivation. However, incorporating information-theoretic formulas into NDL improves model performance over the traditional model. This research highlights a) the need to incorporate not only frequency and contextual predictability but also average contextual predictability, and b) the importance of combining information-theoretic metrics of predictability and information derived from discriminative learning in modeling acoustic reduction.

[147] Using Sign Language Production as Data Augmentation to enhance Sign Language Translation

Harry Walsh,Maksym Ivashechkin,Richard Bowden

Main category: cs.CL

TL;DR: 利用手语生成技术增强手语翻译模型性能，通过骨架生成、拼接和生成模型（SignGAN、SignSplat）提升数据集多样性，性能提升达19%。

Details

Motivation: 手语数据稀缺且收集困难，影响手语翻译模型的性能。 Method: 采用骨架生成、拼接和两种生成模型（SignGAN、SignSplat）生成多样化的手语数据。 Result: 手语翻译模型性能提升达19%。 Conclusion: 该方法能有效增强数据集，提升手语翻译模型性能，适用于资源受限环境。 Abstract: Machine learning models fundamentally rely on large quantities of high-quality data. Collecting the necessary data for these models can be challenging due to cost, scarcity, and privacy restrictions. Signed languages are visual languages used by the deaf community and are considered low-resource languages. Sign language datasets are often orders of magnitude smaller than their spoken language counterparts. Sign Language Production is the task of generating sign language videos from spoken language sentences, while Sign Language Translation is the reverse translation task. Here, we propose leveraging recent advancements in Sign Language Production to augment existing sign language datasets and enhance the performance of Sign Language Translation models. For this, we utilize three techniques: a skeleton-based approach to production, sign stitching, and two photo-realistic generative models, SignGAN and SignSplat. We evaluate the effectiveness of these techniques in enhancing the performance of Sign Language Translation models by generating variation in the signer's appearance and the motion of the skeletal data. Our results demonstrate that the proposed methods can effectively augment existing datasets and enhance the performance of Sign Language Translation models by up to 19%, paving the way for more robust and accurate Sign Language Translation systems, even in resource-constrained environments.

[148] Learning Efficient and Generalizable Graph Retriever for Knowledge-Graph Question Answering

Tianjun Yao,Haoxuan Li,Zhiqiang Shen,Pan Li,Tongliang Liu,Kun Zhang

Main category: cs.CL

TL;DR: RAPL是一种新颖的框架，通过两阶段标注、模型无关的图变换和路径推理策略，提升了知识图谱问答中的检索效率和泛化能力。

Details

Motivation: 解决现有检索增强生成方法在知识图谱问答中依赖非结构化文本导致的解释性和结构化推理受限问题。 Method: 提出RAPL框架，包括两阶段标注策略、模型无关的图变换方法和路径推理策略。 Result: RAPL在性能上超越现有方法2.66%-20.34%，显著缩小了不同规模LLM间的性能差距。 Conclusion: RAPL通过结构化检索和推理策略，显著提升了知识图谱问答的效率和泛化能力。 Abstract: Large Language Models (LLMs) have shown strong inductive reasoning ability across various domains, but their reliability is hindered by the outdated knowledge and hallucinations. Retrieval-Augmented Generation mitigates these issues by grounding LLMs with external knowledge; however, most existing RAG pipelines rely on unstructured text, limiting interpretability and structured reasoning. Knowledge graphs, which represent facts as relational triples, offer a more structured and compact alternative. Recent studies have explored integrating knowledge graphs with LLMs for knowledge graph question answering (KGQA), with a significant proportion adopting the retrieve-then-reasoning paradigm. In this framework, graph-based retrievers have demonstrated strong empirical performance, yet they still face challenges in generalization ability. In this work, we propose RAPL, a novel framework for efficient and effective graph retrieval in KGQA. RAPL addresses these limitations through three aspects: (1) a two-stage labeling strategy that combines heuristic signals with parametric models to provide causally grounded supervision; (2) a model-agnostic graph transformation approach to capture both intra- and inter-triple interactions, thereby enhancing representational capacity; and (3) a path-based reasoning strategy that facilitates learning from the injected rational knowledge, and supports downstream reasoner through structured inputs. Empirically, RAPL outperforms state-of-the-art methods by $2.66\%-20.34\%$, and significantly reduces the performance gap between smaller and more powerful LLM-based reasoners, as well as the gap under cross-dataset settings, highlighting its superior retrieval capability and generalizability. Codes are available at: https://github.com/tianyao-aka/RAPL.

[149] Bridging the Gap Between Open-Source and Proprietary LLMs in Table QA

Nikolas Evkarpidi,Elena Tutubalina

Main category: cs.CL

TL;DR: 本文提出了一种用于SemEval 2025 Task 8的问答系统，结合了文本到SQL/代码生成、自校正机制和检索增强生成（RAG），并通过LLM协调。系统在比赛中排名前13，准确率达80%。

Details

Motivation: 解决表格数据问答任务中的挑战，提升开源模型的性能。 Method: 集成文本到SQL/代码生成、自校正、RAG和端到端模块，由LLM协调。 Result: 比赛排名前13，准确率80%，性能接近专有LLM。 Conclusion: 系统显著提升了开源模型的准确性，代码已开源。 Abstract: This paper presents a system developed for SemEval 2025 Task 8: Question Answering (QA) over tabular data. Our approach integrates several key components: text-to-SQL and text-to-code generation modules, a self-correction mechanism, and a retrieval-augmented generation (RAG). Additionally, it includes an end-to-end (E2E) module, all orchestrated by a large language model (LLM). Through ablation studies, we analyzed the effects of different parts of our pipeline and identified the challenges that are still present in this field. During the evaluation phase of the competition, our solution achieved an accuracy of 80%, resulting in a top-13 ranking among the 38 participating teams. Our pipeline demonstrates a significant improvement in accuracy for open-source models and achieves a performance comparable to proprietary LLMs in QA tasks over tables. The code is available at GitHub repository.

[150] Query-Level Uncertainty in Large Language Models

Lihu Chen,Gaël Varoquaux

Main category: cs.CL

TL;DR: 提出了一种通过查询级不确定性检测知识边界的方法，利用无训练的“内部置信度”机制，提升模型自适应推理能力。

Details

Motivation: 为了让大语言模型明确自身知识边界，识别已知与未知查询，从而支持自适应推理（如RAG、深度思考或弃权机制），推动高效可信AI的发展。 Method: 提出了一种无训练的“内部置信度”方法，通过层和令牌的自评估检测查询级不确定性。 Result: 在事实问答和数学推理任务中，内部置信度优于多个基线方法，并能高效支持RAG和模型级联，降低推理成本。 Conclusion: 该方法有效检测知识边界，提升模型自适应能力，同时降低推理成本，具有实际应用价值。 Abstract: It is important for Large Language Models to be aware of the boundary of their knowledge, the mechanism of identifying known and unknown queries. This type of awareness can help models perform adaptive inference, such as invoking RAG, engaging in slow and deep thinking, or adopting the abstention mechanism, which is beneficial to the development of efficient and trustworthy AI. In this work, we propose a method to detect knowledge boundaries via Query-Level Uncertainty, which aims to determine if the model is able to address a given query without generating any tokens. To this end, we introduce a novel and training-free method called \emph{Internal Confidence}, which leverages self-evaluations across layers and tokens. Empirical results on both factual QA and mathematical reasoning tasks demonstrate that our internal confidence can outperform several baselines. Furthermore, we showcase that our proposed method can be used for efficient RAG and model cascading, which is able to reduce inference costs while maintaining performance.

[151] Is Fine-Tuning an Effective Solution? Reassessing Knowledge Editing for Unstructured Data

Hao Xiong,Chuanyuan Tan,Wenliang Chen

Main category: cs.CL

TL;DR: 论文提出了一种针对大语言模型（LLMs）的无结构化知识编辑（UKE）方法，解决了现有方法在局部性评估和微调失败方面的问题，并通过实验验证了其优越性。

Details

Motivation: 无结构化知识编辑（UKE）对于更新大语言模型的知识至关重要，但现有方法缺乏局部性评估且微调方法存在异常失败问题。 Method: 构建了两个数据集（UnKEBench-Loc和AKEW-Loc），并分析了影响微调方法的四个因素，提出了一种优化的微调方法（FT-UKE）。 Result: 实验表明，FT-UKE在单次和批量编辑场景下均优于现有方法，性能优势随批量增大而增加。 Conclusion: FT-UKE是一种高效的无结构化知识编辑方法，为未来研究提供了训练方案。 Abstract: Unstructured Knowledge Editing (UKE) is crucial for updating the relevant knowledge of large language models (LLMs). It focuses on unstructured inputs, such as long or free-form texts, which are common forms of real-world knowledge. Although previous studies have proposed effective methods and tested them, some issues exist: (1) Lack of Locality evaluation for UKE, and (2) Abnormal failure of fine-tuning (FT) based methods for UKE. To address these issues, we first construct two datasets, UnKEBench-Loc and AKEW-Loc (CF), by extending two existing UKE datasets with locality test data from the unstructured and structured views. This enables a systematic evaluation of the Locality of post-edited models. Furthermore, we identify four factors that may affect the performance of FT-based methods. Based on these factors, we conduct experiments to determine how the well-performing FT-based methods should be trained for the UKE task, providing a training recipe for future research. Our experimental results indicate that the FT-based method with the optimal setting (FT-UKE) is surprisingly strong, outperforming the existing state-of-the-art (SOTA). In batch editing scenarios, FT-UKE shows strong performance as well, with its advantage over SOTA methods increasing as the batch size grows, expanding the average metric lead from +6.78% to +10.80%

[152] Inv-Entropy: A Fully Probabilistic Framework for Uncertainty Quantification in Language Models

Haoyi Song,Ruihan Ji,Naichen Shi,Fan Lai,Raed Al Kontar

Main category: cs.CL

TL;DR: 本文提出了一种基于概率框架的不确定性量化方法，通过双随机游走和逆模型定义新的不确定性度量Inv-Entropy，并引入遗传算法扰动策略GAAP和评估指标TSU，实验表明其优于现有方法。

Details

Motivation: 大型语言模型（LLMs）的可靠部署需要有效的不确定性量化（UQ），但现有方法缺乏概率基础。本文旨在填补这一理论空白并提供实用框架。 Method: 提出双随机游走视角，将输入-输出对建模为两个马尔可夫链，并基于语义相似性定义转移概率。进一步提出基于逆模型的概率框架，通过系统扰动量化不确定性。 Result: 实验证明，提出的Inv-Entropy在语义不确定性量化中优于现有方法。 Conclusion: 本文提供了一个灵活且理论基础扎实的不确定性量化框架，适用于多种LLM应用场景。 Abstract: Large language models (LLMs) have transformed natural language processing, but their reliable deployment requires effective uncertainty quantification (UQ). Existing UQ methods are often heuristic and lack a probabilistic foundation. This paper begins by providing a theoretical justification for the role of perturbations in UQ for LLMs. We then introduce a dual random walk perspective, modeling input-output pairs as two Markov chains with transition probabilities defined by semantic similarity. Building on this, we propose a fully probabilistic framework based on an inverse model, which quantifies uncertainty by evaluating the diversity of the input space conditioned on a given output through systematic perturbations. Within this framework, we define a new uncertainty measure, Inv-Entropy. A key strength of our framework is its flexibility: it supports various definitions of uncertainty measures, embeddings, perturbation strategies, and similarity metrics. We also propose GAAP, a perturbation algorithm based on genetic algorithms, which enhances the diversity of sampled inputs. In addition, we introduce a new evaluation metric, Temperature Sensitivity of Uncertainty (TSU), which directly assesses uncertainty without relying on correctness as a proxy. Extensive experiments demonstrate that Inv-Entropy outperforms existing semantic UQ methods. The code to reproduce the results can be found at https://github.com/UMDataScienceLab/Uncertainty-Quantification-for-LLMs.

[153] ComfyUI-R1: Exploring Reasoning Models for Workflow Generation

Zhenran Xu,Yiyu Wang,Xue Yang,Longyue Wang,Weihua Luo,Kaifu Zhang,Baotian Hu,Min Zhang

Main category: cs.CL

TL;DR: ComfyUI-R1是一个用于自动生成AI工作流的大型推理模型，通过两阶段训练框架（CoT微调和强化学习）显著提升了工作流的格式有效性和结构完整性，优于现有方法。

Details

Motivation: AI生成内容的工作流定制需要专业知识，学习曲线陡峭，ComfyUI-R1旨在通过自动化解决这一问题。 Method: 使用4K工作流数据集构建长链推理数据，通过CoT微调和强化学习训练7B参数模型。 Result: 模型格式有效性达97%，在节点和图形级别F1分数上显著优于GPT-4o和Claude系列。 Conclusion: 长链推理和代码化工作流在AI艺术创作中具有潜力，ComfyUI-R1展示了其优势。 Abstract: AI-generated content has evolved from monolithic models to modular workflows, particularly on platforms like ComfyUI, enabling customization in creative pipelines. However, crafting effective workflows requires great expertise to orchestrate numerous specialized components, presenting a steep learning curve for users. To address this challenge, we introduce ComfyUI-R1, the first large reasoning model for automated workflow generation. Starting with our curated dataset of 4K workflows, we construct long chain-of-thought (CoT) reasoning data, including node selection, workflow planning, and code-level workflow representation. ComfyUI-R1 is trained through a two-stage framework: (1) CoT fine-tuning for cold start, adapting models to the ComfyUI domain; (2) reinforcement learning for incentivizing reasoning capability, guided by a fine-grained rule-metric hybrid reward, ensuring format validity, structural integrity, and node-level fidelity. Experiments show that our 7B-parameter model achieves a 97\% format validity rate, along with high pass rate, node-level and graph-level F1 scores, significantly surpassing prior state-of-the-art methods that employ leading closed-source models such as GPT-4o and Claude series. Further analysis highlights the critical role of the reasoning process and the advantage of transforming workflows into code. Qualitative comparison reveals our strength in synthesizing intricate workflows with diverse nodes, underscoring the potential of long CoT reasoning in AI art creation.

[154] Do LLMs Give Psychometrically Plausible Responses in Educational Assessments?

Andreas Säuberli,Diego Frassinelli,Barbara Plank

Main category: cs.CL

TL;DR: 评估大型语言模型（LLMs）在多项选择题测试中的反应是否类似人类行为，以加速测试开发。

Details

Motivation: 测试开发通常需要大量人类参与者进行试点研究，如果LLMs能模拟人类反应行为，可显著提高效率。 Method: 基于心理测量学的经典测试理论和项目反应理论，评估18种指令调优LLMs在阅读、美国历史和经济学科目中的反应。 Result: 较大模型过于自信，但通过温度校准后反应分布更接近人类；LLMs在阅读理解题目中与人类相关性较高，但整体相关性不强。 Conclusion: LLMs在零样本设置下不适合用于教育评估的试点研究。 Abstract: Knowing how test takers answer items in educational assessments is essential for test development, to evaluate item quality, and to improve test validity. However, this process usually requires extensive pilot studies with human participants. If large language models (LLMs) exhibit human-like response behavior to test items, this could open up the possibility of using them as pilot participants to accelerate test development. In this paper, we evaluate the human-likeness or psychometric plausibility of responses from 18 instruction-tuned LLMs with two publicly available datasets of multiple-choice test items across three subjects: reading, U.S. history, and economics. Our methodology builds on two theoretical frameworks from psychometrics which are commonly used in educational assessment, classical test theory and item response theory. The results show that while larger models are excessively confident, their response distributions can be more human-like when calibrated with temperature scaling. In addition, we find that LLMs tend to correlate better with humans in reading comprehension items compared to other subjects. However, the correlations are not very strong overall, indicating that LLMs should not be used for piloting educational assessments in a zero-shot setting.

[155] CoRT: Code-integrated Reasoning within Thinking

Chengpeng Li,Zhengyang Tang,Ziniu Li,Mingfeng Xue,Keqin Bao,Tian Ding,Ruoyu Sun,Benyou Wang,Xiang Wang,Junyang Lin,Dayiheng Liu

Main category: cs.CL

TL;DR: CoRT是一个后训练框架，旨在教大型推理模型（LRMs）高效利用代码解释器（CI），通过Hint-Engineering合成数据优化模型与CI的交互，显著提升数学推理能力。

Details

Motivation: 大型推理模型（如o1和DeepSeek-R1）在复杂数学运算中效率低或准确性不足，直接结合外部计算工具（如代码解释器）效果不佳。 Method: 通过Hint-Engineering合成代码集成推理数据，结合监督微调、拒绝微调和强化学习，对1.5B至32B参数的模型进行后训练。 Result: Hint-Engineering模型在五个数学推理数据集上分别提升了32B和1.5B模型的4%和8%绝对性能，并显著减少推理所需的token数量。 Conclusion: CoRT框架有效提升了LRMs与CI的交互效率，显著改善了数学推理能力。 Abstract: Large Reasoning Models (LRMs) like o1 and DeepSeek-R1 have shown remarkable progress in natural language reasoning with long chain-of-thought (CoT), yet they remain inefficient or inaccurate when handling complex mathematical operations. Addressing these limitations through computational tools (e.g., computation libraries and symbolic solvers) is promising, but it introduces a technical challenge: Code Interpreter (CI) brings external knowledge beyond the model's internal text representations, thus the direct combination is not efficient. This paper introduces CoRT, a post-training framework for teaching LRMs to leverage CI effectively and efficiently. As a first step, we address the data scarcity issue by synthesizing code-integrated reasoning data through Hint-Engineering, which strategically inserts different hints at appropriate positions to optimize LRM-CI interaction. We manually create 30 high-quality samples, upon which we post-train models ranging from 1.5B to 32B parameters, with supervised fine-tuning, rejection fine-tuning and reinforcement learning. Our experimental results demonstrate that Hint-Engineering models achieve 4\% and 8\% absolute improvements on DeepSeek-R1-Distill-Qwen-32B and DeepSeek-R1-Distill-Qwen-1.5B respectively, across five challenging mathematical reasoning datasets. Furthermore, Hint-Engineering models use about 30\% fewer tokens for the 32B model and 50\% fewer tokens for the 1.5B model compared with the natural language models. The models and code are available at https://github.com/ChengpengLi1003/CoRT.

[156] EmoNet-Voice: A Fine-Grained, Expert-Verified Benchmark for Speech Emotion Detection

Christoph Schuhmann,Robert Kaczmarczyk,Gollam Rabby,Felix Friedrich,Maurice Kraus,Kourosh Nadi,Huu Nguyen,Kristian Kersting,Sören Auer

Main category: cs.CL

TL;DR: EmoNet-Voice是一个新的语音情感检测资源，包含大规模预训练数据集和专家标注的基准数据集，用于评估AI系统在40种情感类别上的细粒度识别能力。

Details

Motivation: 当前语音情感识别数据集在情感粒度、隐私问题或依赖表演数据方面存在局限，需要更全面的评估资源。 Method: 通过合成音频片段模拟特定情感场景，结合心理学专家的强度标注，构建EmoNet-Voice数据集，并开发Empathic Insight Voice模型。 Result: 新模型在语音情感识别上达到与人类专家高度一致的标准，且高唤醒情感（如愤怒）比低唤醒状态（如专注）更易检测。 Conclusion: EmoNet-Voice为语音情感识别提供了更全面、隐私保护的评估工具，推动了AI情感理解能力的发展。 Abstract: The advancement of text-to-speech and audio generation models necessitates robust benchmarks for evaluating the emotional understanding capabilities of AI systems. Current speech emotion recognition (SER) datasets often exhibit limitations in emotional granularity, privacy concerns, or reliance on acted portrayals. This paper introduces EmoNet-Voice, a new resource for speech emotion detection, which includes EmoNet-Voice Big, a large-scale pre-training dataset (featuring over 4,500 hours of speech across 11 voices, 40 emotions, and 4 languages), and EmoNet-Voice Bench, a novel benchmark dataset with human expert annotations. EmoNet-Voice is designed to evaluate SER models on a fine-grained spectrum of 40 emotion categories with different levels of intensities. Leveraging state-of-the-art voice generation, we curated synthetic audio snippets simulating actors portraying scenes designed to evoke specific emotions. Crucially, we conducted rigorous validation by psychology experts who assigned perceived intensity labels. This synthetic, privacy-preserving approach allows for the inclusion of sensitive emotional states often absent in existing datasets. Lastly, we introduce Empathic Insight Voice models that set a new standard in speech emotion recognition with high agreement with human experts. Our evaluations across the current model landscape exhibit valuable findings, such as high-arousal emotions like anger being much easier to detect than low-arousal states like concentration.

[157] Error-Guided Pose Augmentation: Enhancing Rehabilitation Exercise Assessment through Targeted Data Generation

Omar Sherif,Ali Hamdi

Main category: cs.CL

TL;DR: 论文提出了一种名为EGPA的方法，通过模拟临床相关运动错误生成合成骨骼数据，结合注意力图卷积网络，显著提升了康复评估的准确性和可解释性。

Details

Motivation: 现有康复评估系统存在数据不平衡和难以检测细微运动错误的问题，需要一种更有效的方法来提升评估质量。 Method: 提出Error-Guided Pose Augmentation (EGPA)方法，模拟临床运动错误生成合成数据，并结合注意力图卷积网络进行训练。 Result: 实验显示，EGPA将平均绝对误差降低27.6%，错误分类准确率提升45.8%，模型能聚焦临床关键关节和运动阶段。 Conclusion: EGPA为临床和家庭康复中的自动运动质量评估提供了一种有效且可解释的方法。 Abstract: Effective rehabilitation assessment is essential for monitoring patient progress, particularly in home-based settings. Existing systems often face challenges such as data imbalance and difficulty detecting subtle movement errors. This paper introduces Error-Guided Pose Augmentation (EGPA), a method that generates synthetic skeleton data by simulating clinically relevant movement mistakes. Unlike standard augmentation techniques, EGPA targets biomechanical errors observed in rehabilitation. Combined with an attention-based graph convolutional network, EGPA improves performance across multiple evaluation metrics. Experiments demonstrate reductions in mean absolute error of up to 27.6 percent and gains in error classification accuracy of 45.8 percent. Attention visualizations show that the model learns to focus on clinically significant joints and movement phases, enhancing both accuracy and interpretability. EGPA offers a promising approach for improving automated movement quality assessment in both clinical and home-based rehabilitation contexts.

[158] Dataset of News Articles with Provenance Metadata for Media Relevance Assessment

Tomas Peterka,Matyas Bohacek

Main category: cs.CL

TL;DR: 论文提出了一种检测新闻图片来源相关性的方法，并构建了一个数据集，评估了六种大型语言模型在位置和时间相关性任务上的表现。

Details

Motivation: 当前检测媒体操纵的方法仅关注图像语义与文本的匹配，忽略了来源相关性，导致操纵行为被遗漏。 Method: 构建了News Media Provenance Dataset数据集，并设计了位置来源相关性（LOR）和时间来源相关性（DTOR）两项任务，评估了六种LLM的零样本表现。 Result: LOR任务表现良好，但DTOR任务表现较差，表明需要进一步优化。 Conclusion: 未来需开发专门架构以提升DTOR任务的性能。 Abstract: Out-of-context and misattributed imagery is the leading form of media manipulation in today's misinformation and disinformation landscape. The existing methods attempting to detect this practice often only consider whether the semantics of the imagery corresponds to the text narrative, missing manipulation so long as the depicted objects or scenes somewhat correspond to the narrative at hand. To tackle this, we introduce News Media Provenance Dataset, a dataset of news articles with provenance-tagged images. We formulate two tasks on this dataset, location of origin relevance (LOR) and date and time of origin relevance (DTOR), and present baseline results on six large language models (LLMs). We identify that, while the zero-shot performance on LOR is promising, the performance on DTOR hinders, leaving room for specialized architectures and future work.

[159] Causal Sufficiency and Necessity Improves Chain-of-Thought Reasoning

Xiangning Yu,Zhuohan Wang,Linyi Yang,Haoxuan Li,Anjie Liu,Xiao Xue,Jun Wang,Mengyue Yang

Main category: cs.CL

TL;DR: 论文提出了一种因果框架，通过充分性和必要性双重视角优化CoT推理，自动增删推理步骤，提升效率且不牺牲准确性。

Details

Motivation: 解决CoT推理中步骤的充分性和必要性不足的问题，提升LLM复杂推理能力。 Method: 采用因果框架，结合充分性和必要性的概率分析，量化推理步骤对结果的影响，自动优化步骤。 Result: 在数学和常识推理任务中显著提升效率，减少token使用，同时保持准确性。 Conclusion: 该框架为提升LLM推理性能和成本效益提供了有效方向。 Abstract: Chain-of-Thought (CoT) prompting plays an indispensable role in endowing large language models (LLMs) with complex reasoning capabilities. However, CoT currently faces two fundamental challenges: (1) Sufficiency, which ensures that the generated intermediate inference steps comprehensively cover and substantiate the final conclusion; and (2) Necessity, which identifies the inference steps that are truly indispensable for the soundness of the resulting answer. We propose a causal framework that characterizes CoT reasoning through the dual lenses of sufficiency and necessity. Incorporating causal Probability of Sufficiency and Necessity allows us not only to determine which steps are logically sufficient or necessary to the prediction outcome, but also to quantify their actual influence on the final reasoning outcome under different intervention scenarios, thereby enabling the automated addition of missing steps and the pruning of redundant ones. Extensive experimental results on various mathematical and commonsense reasoning benchmarks confirm substantial improvements in reasoning efficiency and reduced token usage without sacrificing accuracy. Our work provides a promising direction for improving LLM reasoning performance and cost-effectiveness.

[160] Attention Head Embeddings with Trainable Deep Kernels for Hallucination Detection in LLMs

Rodion Oblovatny,Alexandra Bazarova,Alexey Zaytsev

Main category: cs.CL

TL;DR: 提出了一种通过分析提示与响应隐藏状态分布的概率差异来检测大型语言模型幻觉的新方法，发现幻觉响应与提示的偏差较小，并基于此提出了一种无需外部知识的模型内检测方法。

Details

Motivation: 大型语言模型（LLMs）常产生幻觉（不真实或无依据的响应），现有方法依赖外部知识或辅助模型，缺乏高效且自洽的解决方案。 Method: 通过计算提示与响应隐藏状态分布的距离作为幻觉分数，并采用可学习的深度核函数捕捉分布间的细微差异。 Result: 该方法在多个基准测试中表现优于现有基线，即使不训练核函数也保持竞争力。 Conclusion: 提出了一种高效、可扩展的幻觉检测方法，无需外部资源，性能优异。 Abstract: We present a novel approach for detecting hallucinations in large language models (LLMs) by analyzing the probabilistic divergence between prompt and response hidden-state distributions. Counterintuitively, we find that hallucinated responses exhibit smaller deviations from their prompts compared to grounded responses, suggesting that hallucinations often arise from superficial rephrasing rather than substantive reasoning. Leveraging this insight, we propose a model-intrinsic detection method that uses distributional distances as principled hallucination scores, eliminating the need for external knowledge or auxiliary models. To enhance sensitivity, we employ deep learnable kernels that automatically adapt to capture nuanced geometric differences between distributions. Our approach outperforms existing baselines, demonstrating state-of-the-art performance on several benchmarks. The method remains competitive even without kernel training, offering a robust, scalable solution for hallucination detection.

[161] The Emergence of Abstract Thought in Large Language Models Beyond Any Language

Yuxin Chen,Yiran Zhao,Yang Zhang,An Zhang,Kenji Kawaguchi,Shafiq Joty,Junnan Li,Tat-Seng Chua,Michael Qizhe Shieh,Wenxuan Zhang

Main category: cs.CL

TL;DR: 研究发现大语言模型（LLMs）在发展中逐渐形成语言无关的核心参数空间，支持跨语言抽象思维。

Details

Motivation: 挑战LLMs以英语为思考语言的假设，探索其多语言能力的本质。 Method: 识别语言相关神经元（共享与专属），提出针对不同发展阶段的语言无关训练策略。 Result: 共享神经元比例与功能重要性增加，专属神经元影响力减弱，形成核心语言无关参数空间。 Conclusion: LLMs的抽象思维基于语言无关参数空间，神经元特定训练策略有效支持多语言能力发展。 Abstract: As large language models (LLMs) continue to advance, their capacity to function effectively across a diverse range of languages has shown marked improvement. Preliminary studies observe that the hidden activations of LLMs often resemble English, even when responding to non-English prompts. This has led to the widespread assumption that LLMs may "think" in English. However, more recent results showing strong multilingual performance, even surpassing English performance on specific tasks in other languages, challenge this view. In this work, we find that LLMs progressively develop a core language-agnostic parameter space-a remarkably small subset of parameters whose deactivation results in significant performance degradation across all languages. This compact yet critical set of parameters underlies the model's ability to generalize beyond individual languages, supporting the emergence of abstract thought that is not tied to any specific linguistic system. Specifically, we identify language-related neurons-those are consistently activated during the processing of particular languages, and categorize them as either shared (active across multiple languages) or exclusive (specific to one). As LLMs undergo continued development over time, we observe a marked increase in both the proportion and functional importance of shared neurons, while exclusive neurons progressively diminish in influence. These shared neurons constitute the backbone of the core language-agnostic parameter space, supporting the emergence of abstract thought. Motivated by these insights, we propose neuron-specific training strategies tailored to LLMs' language-agnostic levels at different development stages. Experiments across diverse LLM families support our approach.

[162] PersonaLens: A Benchmark for Personalization Evaluation in Conversational AI Assistants

Zheng Zhao,Clara Vania,Subhradeep Kayal,Naila Khan,Shay B. Cohen,Emine Yilmaz

Main category: cs.CL

TL;DR: 论文介绍了PersonaLens，一个用于评估任务导向AI助手个性化能力的综合基准，揭示了现有LLM助手在个性化能力上的显著差异。

Details

Motivation: 现有个性化基准未能全面评估任务导向AI助手的个性化能力，因此需要更全面的评估工具。 Method: 提出PersonaLens基准，包含多样化用户配置和两个基于LLM的代理（用户代理和评判代理），用于评估个性化、响应质量和任务成功。 Result: 实验显示当前LLM助手在个性化能力上存在显著差异。 Conclusion: PersonaLens为提升对话AI系统的个性化能力提供了重要见解。 Abstract: Large language models (LLMs) have advanced conversational AI assistants. However, systematically evaluating how well these assistants apply personalization--adapting to individual user preferences while completing tasks--remains challenging. Existing personalization benchmarks focus on chit-chat, non-conversational tasks, or narrow domains, failing to capture the complexities of personalized task-oriented assistance. To address this, we introduce PersonaLens, a comprehensive benchmark for evaluating personalization in task-oriented AI assistants. Our benchmark features diverse user profiles equipped with rich preferences and interaction histories, along with two specialized LLM-based agents: a user agent that engages in realistic task-oriented dialogues with AI assistants, and a judge agent that employs the LLM-as-a-Judge paradigm to assess personalization, response quality, and task success. Through extensive experiments with current LLM assistants across diverse tasks, we reveal significant variability in their personalization capabilities, providing crucial insights for advancing conversational AI systems.

[163] Aspect-Based Opinion Summarization with Argumentation Schemes

Wendi Zhou,Ameer Saadat-Yazd,Nadin Kokciyan

Main category: cs.CL

TL;DR: 本文提出了一种名为ASESUM的新型摘要系统，能够从产品关键角度总结观点，并适应不同领域，无需预定义方面。

Details

Motivation: 在线购物中，顾客难以手动处理大量评论并总结主要观点，因此需要自动化意见摘要系统。 Method: ASESUM框架通过提取面向方面的论点并衡量其显著性和有效性，总结产品关键方面的观点。 Result: 实验表明，ASESUM在捕捉原始评论的多样化观点方面优于新旧方法。 Conclusion: ASESUM是一种有效的自动化意见摘要系统，能够适应不同领域并提供有证据支持的摘要。 Abstract: Reviews are valuable resources for customers making purchase decisions in online shopping. However, it is impractical for customers to go over the vast number of reviews and manually conclude the prominent opinions, which prompts the need for automated opinion summarization systems. Previous approaches, either extractive or abstractive, face challenges in automatically producing grounded aspect-centric summaries. In this paper, we propose a novel summarization system that not only captures predominant opinions from an aspect perspective with supporting evidence, but also adapts to varying domains without relying on a pre-defined set of aspects. Our proposed framework, ASESUM, summarizes viewpoints relevant to the critical aspects of a product by extracting aspect-centric arguments and measuring their salience and validity. We conduct experiments on a real-world dataset to demonstrate the superiority of our approach in capturing diverse perspectives of the original reviews compared to new and existing methods.

[164] VerIF: Verification Engineering for Reinforcement Learning in Instruction Following

Hao Peng,Yunjia Qi,Xiaozhi Wang,Bin Xu,Lei Hou,Juanzi Li

Main category: cs.CL

TL;DR: 论文提出了一种结合规则和LLM的验证方法VerIF，用于增强指令跟随的强化学习，并在实验中取得了显著效果。

Details

Motivation: 探索指令跟随中强化学习的验证挑战，提出一种有效的验证方法以提升模型性能。 Method: 结合规则代码验证和基于大语言模型（如QwQ-32B）的验证方法VerIF，并构建高质量数据集VerInstruct。 Result: 在多个指令跟随基准测试中取得显著改进，模型性能达到同类最佳，且不影响通用能力。 Conclusion: VerIF可集成到现有强化学习方法中，提升整体模型性能，相关资源已开源。 Abstract: Reinforcement learning with verifiable rewards (RLVR) has become a key technique for enhancing large language models (LLMs), with verification engineering playing a central role. However, best practices for RL in instruction following remain underexplored. In this work, we explore the verification challenge in RL for instruction following and propose VerIF, a verification method that combines rule-based code verification with LLM-based verification from a large reasoning model (e.g., QwQ-32B). To support this approach, we construct a high-quality instruction-following dataset, VerInstruct, containing approximately 22,000 instances with associated verification signals. We apply RL training with VerIF to two models, achieving significant improvements across several representative instruction-following benchmarks. The trained models reach state-of-the-art performance among models of comparable size and generalize well to unseen constraints. We further observe that their general capabilities remain unaffected, suggesting that RL with VerIF can be integrated into existing RL recipes to enhance overall model performance. We have released our datasets, codes, and models to facilitate future research at https://github.com/THU-KEG/VerIF.

[165] Query-Focused Retrieval Heads Improve Long-Context Reasoning and Re-ranking

Wuwei Zhang,Fangcong Yin,Howard Yen,Danqi Chen,Xi Ye

Main category: cs.CL

TL;DR: 论文提出QRHEAD和QR-RETRIEVER，通过聚焦查询的注意力机制提升长上下文信息检索性能，在多项任务中表现优异。

Details

Motivation: 现有检索头（retrieval heads）在长上下文语言模型中表现有限，需改进以提升检索效率。 Method: 通过聚合输入查询的注意力分数识别QRHEAD，并开发QR-RETRIEVER作为高效检索器。 Result: 在LongMemEval和CLIPPER任务中性能提升超10%，BEIR基准测试中零样本表现优于RankGPT。 Conclusion: QRHEAD和QR-RETRIEVER为通用检索器，同时增强了对长上下文能力的可解释性。 Abstract: Recent work has identified retrieval heads (Wu et al., 2025b), a subset of attention heads responsible for retrieving salient information in long-context language models (LMs), as measured by their copy-paste behavior in Needle-in-a-Haystack tasks. In this paper, we introduce QRHEAD (Query-Focused Retrieval Head), an improved set of attention heads that enhance retrieval from long context. We identify QRHEAD by aggregating attention scores with respect to the input query, using a handful of examples from real-world tasks (e.g., long-context QA). We further introduce QR- RETRIEVER, an efficient and effective retriever that uses the accumulated attention mass of QRHEAD as retrieval scores. We use QR- RETRIEVER for long-context reasoning by selecting the most relevant parts with the highest retrieval scores. On multi-hop reasoning tasks LongMemEval and CLIPPER, this yields over 10% performance gains over full context and outperforms strong dense retrievers. We also evaluate QRRETRIEVER as a re-ranker on the BEIR benchmark and find that it achieves strong zero-shot performance, outperforming other LLM-based re-rankers such as RankGPT. Further analysis shows that both the querycontext attention scoring and task selection are crucial for identifying QRHEAD with strong downstream utility. Overall, our work contributes a general-purpose retriever and offers interpretability insights into the long-context capabilities of LMs.

[166] Resa: Transparent Reasoning Models via SAEs

Shangshang Wang,Julian Asilis,Ömer Faruk Akgül,Enes Burak Bilgin,Ollie Liu,Deqing Fu,Willie Neiswanger

Main category: cs.CL

TL;DR: Resa是一种1.5B推理模型家族，通过稀疏自编码器调优（SAE-Tuning）方法高效提取语言模型的推理能力，显著降低成本和时间。

Details

Motivation: 研究如何高效且低成本地从语言模型中提取和增强推理能力。 Method: 使用稀疏自编码器（SAE）从源模型中捕获推理能力，并指导目标模型的监督微调，无需推理轨迹。 Result: SAE-Tuning在降低2000倍成本和450倍时间的情况下，保留了97%的推理性能；在轻量RL训练后，能以约1美元成本实现显著推理提升。 Conclusion: SAE-Tuning提取的推理能力具有通用性和模块化特性，验证了其高效性和可扩展性。 Abstract: How cost-effectively can we elicit strong reasoning in language models by leveraging their underlying representations? We answer this question with Resa, a family of 1.5B reasoning models trained via a novel and efficient sparse autoencoder tuning (SAE-Tuning) procedure. This method first trains an SAE to capture reasoning abilities from a source model, and then uses the trained SAE to guide a standard supervised fine-tuning process to elicit such abilities in a target model, all using verified question-answer data without any reasoning traces. Notably, when applied to certain base models before further RL post-training, SAE-Tuning retains >97% of its RL-trained counterpart's reasoning performance while reducing training costs by >2000x to roughly \$1 and training time by >450x to around 20 minutes. Furthermore, when applied to lightly RL-trained models (e.g., within 1 hour on 2 GPUs), it enables reasoning performance such as 43.33% Pass@1 on AIME24 and 90% Pass@1 on AMC23 for only around \$1 additional cost. Surprisingly, the reasoning abilities extracted via SAEs are potentially both generalizable and modular. Generality means abilities extracted from one dataset still elevate performance on a larger and overlapping corpus. Modularity means abilities extracted from Qwen or Qwen-Math can be attached to the R1-Distill model at test time, without any retraining, and yield comparable gains. Extensive ablations validate these findings and all artifacts are fully open-sourced.

Hillary Dawkins,Kathleen C. Fraser,Svetlana Kiritchenko

Main category: cs.CL

TL;DR: 论文研究了社交媒体上AI生成文本的检测问题，指出由于文本短且语言非正式，检测难度大。通过构建大规模数据集和实验，发现若攻击者不公开其微调模型，检测效果显著下降。

Details

Motivation: 社交媒体是网络影响力活动的重要攻击载体，AI生成内容可能被用于支持或反对特定政策或事件，因此检测此类内容至关重要。 Method: 研究团队以威胁者的视角构建了505,159条AI生成的社交媒体帖子数据集，涵盖11个争议话题，并测试了在不同假设下的检测效果。 Result: 实验表明，若攻击者不公开微调模型，检测效果大幅下降，人类研究也证实了这一结果。消融实验揭示了检测算法对微调LLM的脆弱性。 Conclusion: 研究结果表明，微调LLM的普遍应用对检测领域提出了挑战，需进一步改进检测方法以应对实际威胁。 Abstract: Detecting AI-generated text is a difficult problem to begin with; detecting AI-generated text on social media is made even more difficult due to the short text length and informal, idiosyncratic language of the internet. It is nonetheless important to tackle this problem, as social media represents a significant attack vector in online influence campaigns, which may be bolstered through the use of mass-produced AI-generated posts supporting (or opposing) particular policies, decisions, or events. We approach this problem with the mindset and resources of a reasonably sophisticated threat actor, and create a dataset of 505,159 AI-generated social media posts from a combination of open-source, closed-source, and fine-tuned LLMs, covering 11 different controversial topics. We show that while the posts can be detected under typical research assumptions about knowledge of and access to the generating models, under the more realistic assumption that an attacker will not release their fine-tuned model to the public, detectability drops dramatically. This result is confirmed with a human study. Ablation experiments highlight the vulnerability of various detection algorithms to fine-tuned LLMs. This result has implications across all detection domains, since fine-tuning is a generally applicable and realistic LLM use case.

[168] Step-by-step Instructions and a Simple Tabular Output Format Improve the Dependency Parsing Accuracy of LLMs

Hiroshi Matsuda,Chunpeng Ma,Masayuki Asahara

Main category: cs.CL

TL;DR: 提出了一种基于分步指令策略的依赖解析方法，通过词性标注和简化输出格式，在17种语言上实现了最高准确率，并展示了多语言微调的优势。

Details

Motivation: 标准提示方法在依赖解析中难以生成结构有效且准确的输出，需要改进。 Method: 采用分步指令策略，先进行词性标注，再预测句法头和依赖标签，并使用简化的CoNLL-U格式输出。 Result: 在17种语言的Universal Dependencies数据集上达到最高准确率，且无幻觉或污染；多语言微调提升了跨语言泛化性能。 Conclusion: 分步推理策略在基于LLM的解析中有效，提供了一种可扩展且格式一致的替代方案。 Abstract: Recent advances in large language models (LLMs) have enabled impressive performance in various tasks. However, standard prompting often struggles to produce structurally valid and accurate outputs, especially in dependency parsing. We propose a novel step-by-step instruction strategy, where universal part-of-speech tagging precedes the prediction of syntactic heads and dependency labels, and a simplified CoNLL-U like output format, our method achieves state-of-the-art accuracy on Universal Dependencies datasets across 17 languages without hallucination or contamination. We further show that multilingual fine-tuning simultaneously improves cross-language generalization performance. Our results highlight the effectiveness of explicit reasoning steps in LLM-based parsing and offer a scalable, format-consistent alternative to bracket-based approaches.

[169] Large Language Models for Toxic Language Detection in Low-Resource Balkan Languages

Amel Muminovic,Amela Kadric Muminovic

Main category: cs.CL

TL;DR: 研究评估了大型语言模型在塞尔维亚语、克罗地亚语和波斯尼亚语中处理有毒评论的能力，发现通过添加简短上下文片段可显著提升检测效果。

Details

Motivation: 在线有毒语言对资源有限的地区造成实际伤害，尤其是缺乏标注数据的语言环境。 Method: 构建并手动标注了4,500条YouTube和TikTok评论数据集，测试了四种模型（GPT-3.5 Turbo、GPT-4.1、Gemini 1.5 Pro和Claude 3 Opus）在零样本和上下文增强模式下的表现。 Result: 上下文增强模式平均提升召回率0.12，F1分数最高提升0.10；Gemini在上下文增强模式下表现最佳（F1=0.82，准确率=0.82）。 Conclusion: 通过改进提示设计和阈值校准，可以在低资源语言环境中有效提升有毒语言检测效果。 Abstract: Online toxic language causes real harm, especially in regions with limited moderation tools. In this study, we evaluate how large language models handle toxic comments in Serbian, Croatian, and Bosnian, languages with limited labeled data. We built and manually labeled a dataset of 4,500 YouTube and TikTok comments drawn from videos across diverse categories, including music, politics, sports, modeling, influencer content, discussions of sexism, and general topics. Four models (GPT-3.5 Turbo, GPT-4.1, Gemini 1.5 Pro, and Claude 3 Opus) were tested in two modes: zero-shot and context-augmented. We measured precision, recall, F1 score, accuracy and false positive rates. Including a short context snippet raised recall by about 0.12 on average and improved F1 score by up to 0.10, though it sometimes increased false positives. The best balance came from Gemini in context-augmented mode, reaching an F1 score of 0.82 and accuracy of 0.82, while zero-shot GPT-4.1 led on precision and had the lowest false alarms. We show how adding minimal context can improve toxic language detection in low-resource settings and suggest practical strategies such as improved prompt design and threshold calibration. These results show that prompt design alone can yield meaningful gains in toxicity detection for underserved Balkan language communities.

[170] From Judgment to Interference: Early Stopping LLM Harmful Outputs via Streaming Content Monitoring

Yang Li,Qiang Sheng,Yehan Yang,Xueyao Zhang,Juan Cao

Main category: cs.CL

TL;DR: 论文提出了一种支持部分检测的数据和模型解决方案，通过构建FineHarm数据集和提出流式内容监控器（SCM），显著提高了检测效率和性能。

Details

Motivation: 现有审核器主要采用完全检测，导致高延迟；部分检测虽能减少延迟，但因训练与推理的差距导致性能下降。 Method: 构建FineHarm数据集（29K对提示-响应），提出SCM模型，采用响应和标记级双重监督训练。 Result: SCM仅需查看响应前18%的标记即可达到与完全检测相当的F1分数（0.95+），并能提升安全对齐效果。 Conclusion: SCM是一种高效的流式检测方法，显著降低了延迟并提高了性能。 Abstract: Though safety alignment has been applied to most large language models (LLMs), LLM service providers generally deploy a subsequent moderation as the external safety guardrail in real-world products. Existing moderators mainly practice a conventional full detection, which determines the harmfulness based on the complete LLM output, causing high service latency. Recent works pay more attention to partial detection where moderators oversee the generation midway and early stop the output if harmfulness is detected, but they directly apply moderators trained with the full detection paradigm to incomplete outputs, introducing a training-inference gap that lowers the performance. In this paper, we explore how to form a data-and-model solution that natively supports partial detection. For the data, we construct FineHarm, a dataset consisting of 29K prompt-response pairs with fine-grained annotations to provide reasonable supervision for token-level training. Then, we propose the streaming content monitor, which is trained with dual supervision of response- and token-level labels and can follow the output stream of LLM to make a timely judgment of harmfulness. Experiments show that SCM gains 0.95+ in macro F1 score that is comparable to full detection, by only seeing the first 18% of tokens in responses on average. Moreover, the SCM can serve as a pseudo-harmfulness annotator for improving safety alignment and lead to a higher harmlessness score than DPO.

Table of Contents

cs.CV [Back]

[1] ReStNet: A Reusable & Stitchable Network for Dynamic Adaptation on IoT Devices

[2] Enhancing the Safety of Medical Vision-Language Models by Synthetic Demonstrations

[3] BG-HOP: A Bimanual Generative Hand-Object Prior

[4] Segment Any Architectural Facades (SAAF):An automatic segmentation model for building facades, walls and windows based on multimodal semantics guidance

[5] VersaVid-R1: A Versatile Video Understanding and Reasoning Model from Question Answering to Captioning Tasks

[6] FlagEvalMM: A Flexible Framework for Comprehensive Multimodal Model Evaluation

[7] AVA-Bench: Atomic Visual Ability Benchmark for Vision Foundation Models

[8] BakuFlow: A Streamlining Semi-Automatic Label Generation Tool

[9] Bias Analysis in Unconditional Image Generative Models

[10] CAIRe: Cultural Attribution of Images by Retrieval-Augmented Evaluation

[11] Seedance 1.0: Exploring the Boundaries of Video Generation Models

[12] Cross-Frame Representation Alignment for Fine-Tuning Video Diffusion Models

[13] PatchGuard: Adversarially Robust Anomaly Detection and Localization through Vision Transformers and Pseudo Anomalies

[14] UFM: A Simple Path towards Unified Dense Correspondence with Flow

[15] Lightweight Object Detection Using Quantized YOLOv4-Tiny for Emergency Response in Aerial Imagery

[16] Efficient Edge Deployment of Quantized YOLOv4-Tiny for Aerial Emergency Object Detection on Raspberry Pi 5

[17] MSSDF: Modality-Shared Self-supervised Distillation for High-Resolution Multi-modal Remote Sensing Image Learning

[18] CheckManual: A New Challenge and Benchmark for Manual-based Appliance Manipulation

[19] An Effective End-to-End Solution for Multimodal Action Recognition

[20] Autoregressive Adversarial Post-Training for Real-Time Interactive Video Generation

[21] A new approach for image segmentation based on diffeomorphic registration and gradient fields

[22] SAGE: Exploring the Boundaries of Unsafe Concept Domain with Semantic-Augment Erasing

[23] ScaleLSD: Scalable Deep Line Segment Detection Streamlined

[24] UniForward: Unified 3D Scene and Semantic Field Reconstruction via Feed-Forward Gaussian Splatting from Only Sparse-View Images

[25] ReID5o: Achieving Omni Multi-modal Person Re-identification in a Single Model

[26] Improving Out-of-Distribution Detection via Dynamic Covariance Calibration

[27] SRPL-SFDA: SAM-Guided Reliable Pseudo-Labels for Source-Free Domain Adaptation in Medical Image Segmentation

[28] Synthetic Human Action Video Data Generation with Pose Transfer

[29] Noise Conditional Variational Score Distillation

[30] ODG: Occupancy Prediction Using Dual Gaussians

[31] A High-Quality Dataset and Reliable Evaluation for Interleaved Image-Text Generation

[32] A Novel Lightweight Transformer with Edge-Aware Fusion for Remote Sensing Image Captioning

[33] TOGA: Temporally Grounded Open-Ended Video QA with Weak Supervision

[34] Harmonizing and Merging Source Models for CLIP-based Domain Generalization

[35] Evidential Deep Learning with Spectral-Spatial Uncertainty Disentanglement for Open-Set Hyperspectral Domain Generalization

[36] Optimizing Cooperative Multi-Object Tracking using Graph Signal Processing

[37] Provoking Multi-modal Few-Shot LVLM via Exploration-Exploitation In-Context Learning

[38] Urban1960SatSeg: Unsupervised Semantic Segmentation of Mid-20$^{th}$ century Urban Landscapes with Satellite Imageries

[39] TinySplat: Feedforward Approach for Generating Compact 3D Scene Representation

[40] Marrying Autoregressive Transformer and Diffusion with Multi-Reference Autoregression

[41] Generalized Gaussian Entropy Model for Point Cloud Attribute Compression with Dynamic Likelihood Intervals

[42] HAIF-GS: Hierarchical and Induced Flow-Guided Gaussian Splatting for Dynamic Scene

[43] Revisit What You See: Disclose Language Prior in Vision Tokens for Efficient Guided Decoding of LVLMs

[44] Gaussian Herding across Pens: An Optimal Transport Perspective on Global Gaussian Reduction for 3DGS

[45] AngleRoCL: Angle-Robust Concept Learning for Physically View-Invariant T2I Adversarial Patches

[46] 3DGeoDet: General-purpose Geometry-aware Image-based 3D Object Detection

[47] GLD-Road:A global-local decoding road network extraction model for remote sensing images

[48] AD^2-Bench: A Hierarchical CoT Benchmark for MLLM in Autonomous Driving under Adverse Conditions

[49] SemanticSplat: Feed-Forward 3D Scene Understanding with Language-Aware Gaussian Fields

[50] Consistent Story Generation with Asymmetry Zigzag Sampling

[51] ECAM: A Contrastive Learning Approach to Avoid Environmental Collision in Trajectory Forecasting

[52] HSENet: Hybrid Spatial Encoding Network for 3D Medical Vision-Language Understanding

[53] DGAE: Diffusion-Guided Autoencoder for Efficient Latent Representation Learning

[54] HopaDIFF: Holistic-Partial Aware Fourier Conditioned Diffusion for Referring Human Action Segmentation in Multi-Person Scenarios

[55] Self-Supervised Multi-Part Articulated Objects Modeling via Deformable Gaussian Splatting and Progressive Primitive Segmentation

[56] CINeMA: Conditional Implicit Neural Multi-Modal Atlas for a Spatio-Temporal Representation of the Perinatal Brain

[57] Reasoning Models Are More Easily Gaslighted Than You Think

[58] Adding simple structure at inference improves Vision-Language Compositionality

[59] Towards Practical Alzheimer's Disease Diagnosis: A Lightweight and Interpretable Spiking Neural Model

[60] CHIP: A multi-sensor dataset for 6D pose estimation of chairs in industrial settings

[61] Non-Contact Health Monitoring During Daily Personal Care Routines

[62] The Four Color Theorem for Cell Instance Segmentation

[63] MPFNet: A Multi-Prior Fusion Network with a Progressive Training Strategy for Micro-Expression Recognition

[64] Vision Matters: Simple Visual Perturbations Can Boost Multimodal Math Reasoning

[65] ELBO-T2IAlign: A Generic ELBO-Based Method for Calibrating Pixel-level Text-Image Alignment in Diffusion Models

[66] Class Similarity-Based Multimodal Classification under Heterogeneous Category Sets

[67] Hierarchical Image Matching for UAV Absolute Visual Localization via Semantic and Structural Constraints

[68] Outside Knowledge Conversational Video (OKCV) Dataset -- Dialoguing over Videos

[69] Inverting Black-Box Face Recognition Systems via Zero-Order Optimization in Eigenface Space

[70] Q-SAM2: Accurate Quantization for Segment Anything Model 2

[71] Accurate and efficient zero-shot 6D pose estimation with frozen foundation models

[72] DreamCS: Geometry-Aware Text-to-3D Generation with Unpaired 3D Reward Supervision

[73] MMME: A Spontaneous Multi-Modal Micro-Expression Dataset Enabling Visual-Physiological Fusion

[74] DynaSplat: Dynamic-Static Gaussian Splatting with Hierarchical Motion Decomposition for Scene Reconstruction

[75] OctoNav: Towards Generalist Embodied Navigation

[76] Learning to Align: Addressing Character Frequency Distribution Shifts in Handwritten Text Recognition

[77] IntPhys 2: Benchmarking Intuitive Physics Understanding In Complex Synthetic Environments

[78] Leveraging Depth and Language for Open-Vocabulary Domain-Generalized Semantic Segmentation